理论研究

中文问答社区答案质量的评价研究:以知乎为例

  • 王伟 ,
  • 冀宇强 ,
  • 王洪伟 ,
  • 郑丽娟
展开
  • 1. 华侨大学工商管理学院 泉州 362021;
    2. 同济大学经济与管理学院 上海 200092;
    3. 聊城大学商学院 聊城 252000
王伟(ORCID:0000-0001-5981-7312),讲师,博士;冀宇强(ORCID:0000-0003-0478-4706),工程师,硕士;郑丽娟(ORCID:0000-0002-8182-6765),讲师,博士。

收稿日期: 2017-07-02

  修回日期: 2017-09-09

  网络出版日期: 2017-11-20

基金资助

本文系国家自然科学基金项目"文本语言特征对众筹项目融资效果的影响:基于文本挖掘的方法"(项目编号:71601082)和国家自然科学基金项目"基于在线评论文本挖掘的线上线下服务补救:以网络零售为例"(项目编号:71701085)研究成果之一。

Evaluating Chinese Answers' Quality in the Community QA System:A Case Study of Zhihu

  • Wang Wei ,
  • Ji Yuqiang ,
  • Wang Hongwei ,
  • Zheng Lijuan
Expand
  • 1. College of Business Administration, Huaqiao University, Quanzhou 362021;
    2. School of Economics and Management, Tongji University, Shanghai 200092;
    3. School of Business, Liaocheng University, Liaocheng 252000

Received date: 2017-07-02

  Revised date: 2017-09-09

  Online published: 2017-11-20

摘要

[目的/意义]在线问答社区成为互联网用户获取高质量知识的重要途径,探索中文问答社区答案质量对知识传播具有重要意义。[方法/过程]以规模最大的中文问答社区之一"知乎"为研究对象,采用数据挖掘和机器学习方法,选取逻辑回归、支持向量机和随机森林三种分类模型,进行三层递进式训练和检验。从结构化特征、文本特征以及用户社交属性三个维度构建答案质量的特征体系。[结果/结论]实验结果显示,随着特征体系的不断丰富,三种分类模型的性能逐步提升;而随机森林作为一种组合分类模型,在全量特征的情况下,取得出色的分类性能。对特征组合分析发现,包含用户社交属性的随机森林总是比同等级的其它模型更加出色,表明社会化网络在答案质量评价中的地位。研究结论表明从答案本身和答案编写者两个角度能够评价答案质量,构建的特征体系和模型可以较为全面地预测答案质量。

本文引用格式

王伟 , 冀宇强 , 王洪伟 , 郑丽娟 . 中文问答社区答案质量的评价研究:以知乎为例[J]. 图书情报工作, 2017 , 61(22) : 36 -44 . DOI: 10.13266/j.issn.0252-3116.2017.22.005

Abstract

[Purpose/significance] Online Q&A communities have become a major way to access high quality knowledge. It is meaningful to explore the quality of the answer in the Chinese question and answer community which promotes the dissemination of knowledge.[Method/process] In this paper, we focused on the largest Chinese Q&A community-Zhihu. Data mining and machine learning, logistic regression, support vector machine and random forest algorithms were employed to build three classification models with three-level progressive training to predict the answer quality. Then we constructed a feature set including structured features, text features and social features.[Result/conclusion] The experiment results show that the performance of three classification models has been improved significantly with the continuous enrichment of the feature system. We find that the random forest model always shows better performance than other models in the same feature level. Moreover, by analyzing the different kinds of feature combination, the random forest model with social features always outperforms the models without social features, which reflects the value of the social attributes in the evaluation of the answer quality. We conclude that it is reasonable to evaluate the answer quality from the answer itself and the writer's social attributes. The feature system we build can reflect the quality of the answers in a comprehensive way.

参考文献

[1] NIE L, WEI X, ZHANG D, et al. Data-driven answer selection in communitty QA systems[J]. IEEE transactions on knowledge and data engineering, 2017, 29(6):1186-1198.
[2] YAO Y, TONG H, XIE T, et al. Detecting high-quality posts in community question answering sites[J]. Information sciences, 2015, 302(C):70-82.
[3] BURGESS S, SELLITTO C, COX C, et al. User-generated content (UGC) in tourism:benefits and concerns of online consumers[C]//European conference on information systems. Verona:DBLP, 2009:417-429.
[4] PATIL S, LEE K. Detecting experts on Quora:by their activity, quality of answers, linguistic characteristics and temporal behaviors[J]. Social network analysis and mining, 2016, 6(1):5.
[5] HOSSEINI M, MOORE J, ALMALIKI M, et al. Wisdom of the crowd within enterprises:practices and challenges[J]. Computer networks, 2015, 90(C):121-132.
[6] KIM S, OH J S, OH S. Best-answer selection criteria in a social Q&A site from the user-oriented relevance perspective[J]. Proceedings of the American Society for Information Science and Technology, 2007, 44(1):1-15.
[7] SHAH C, POMERANTZ J. Evaluating and predicting answer quality in community QA[C]//Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. Geneva:ACM Press, 2010:411-418.
[8] 张克永, 李贺. 网络健康社区知识共享的影响因素研究[J]. 图书情报工作, 2017, 61(5):109-116.
[9] SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering[C]//Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. Geneva:ACM Press, 2010:841-842.
[10] MCCALLUM A, ROSENFELD R, MITCHELL T M, et al. Improving text classification by shrinkage in a hierarchy of classes[C]//Fifteenth international conference on machine learning. San Francisco:Morgan Kaufmann Publishers Inc., 1998:359-367.
[11] JOYCE E, KRAUT R E. Predicting continued participation in newsgroups[J]. Journal of computer-mediated communication, 2006, 11(3):723-747.
[12] NG S H, BRADAC J J. Power in language:verbal communication and social influence[M]. Thousand Oaks:Sage Publications, 1993.
[13] ZHANG M, GUO L, HU M, et al. Influence of customer engagement with company social networks on stickiness:mediating effect of customer value creation[J]. International journal of information management, 2017, 37(3):229-240.
[14] PAN Z, LU Y, WANG B, et al. Who do you think you are? Common and differential effects of social self-identity on social media usage[J]. Journal of management information systems, 2017, 34(1):71-101.
[15] HUFFAKER D. Dimensions of leadership and social influence in online communities[J]. Human communication research, 2010, 36(4):593-617.
[16] PERRY-SMITH J E, MANNUCCI P V. From creativity to innovation:the social network drivers of the four phases of the idea journey[J]. Academy of management review, 2017, 42(1):53-79.
[17] 祝振媛. 基于信息分类的网络书评内容挖掘与整合研究[J]. 图书情报工作, 2016,60(1):114-124.
[18] LEON R D, RODRÍGUEZ-RODRÍGUEZ R, GO'MEZ-GASQUET P, et al. Social network analysis:a tool for evaluating and predicting future knowledge flows from an insurance organization[J]. Technological forecasting and social change, 2017, 114:103-118.
[19] CHUJO K, UTIYAMA M. Understanding the role of text length, sample size and vocabulary size in determining text coverage[J]. Reading in a foreign language, 2005, 17(1):1-22.
[20] MC LAUGHLIN G H. SMOG grading-a new readability formula[J]. Journal of reading, 1969, 12(8):639-646.
[21] CHAFE W. Punctuation and the prosody of written language[J]. Written communication, 1988, 5(4):395-426.
[22] ZHANG L, HUANG C, ZHOU M, et al. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm[C]//Proceedings of the 38th annual meeting on Association for Computational Linguistics. New York:ACM Press, 2000:248-254.
[23] WALKER S, SCHLOSS P, FLETCHER C R, et al. Visual-syntactic text formatting:a new method to enhance online reading[J]. Reading online, 2005, 8(6):1096-1232.
[24] METZGER M J. Making sense of credibility on the Web:models for evaluating online information and recommendations for future research[J]. Journal of the American Society for Information Science and Technology, 2007, 58(13):2078-2091.
[25] LEEBRON E J. Visual persuasion:the role of images in advertising[J]. Journal of broadcasting & electronic media, 1997, 41(4):589-593.
[26] KAKOL M, NIELEK R, WIERZBICKI A. Understanding and predicting web content credibility using the content credibility corpus[J]. Information processing & management, 2017, 53(5):1043-1061.
[27] 李展, 巢文涵, 陈小明, 等. 中文社区问答中问题答案质量评价和预测[J]. 计算机科学, 2011, 38(6):230-236.
[28] LI Y, MA S, ZHANG Y, et al. An improved mix framework for opinion leader identification in online learning communities[J]. Knowledge-based systems, 2013, 43(2):43-51.
[29] MARTENS D, VANTHIENEN J, VERBEKE W, et al. Performance of classification models from a user perspective[J]. Decision support systems, 2011, 51(4):782-793.
[30] CAO P, LIU X, YANG J, et al. A multi-kernel based framework for heterogeneous feature selection and over-sampling for computer-aided detection of pulmonary nodules[J]. Pattern recognition, 2017, 64:327-346.
[31] 刘敏娟, 张学福, 颜蕴. 基于词频、词量、累积词频占比的共词分析词集范围选取方法研究[J]. 图书情报工作, 2016, 60(23):135-142.
[32] KELLEY J, STEWART C, MORRIS N, et al. Obtaining and managing answer quality for online data-intensive services[J]. ACM transactions on modeling and performance evaluation of computing systems, 2015, 2(2):167-176.
[33] SHEN H, LIU G, WANG H, et al. Social Q&A:an online social network based question and answer system[J]. IEEE transactions on big data, 2017, 3(1):91-106.
[34] SAVCHUK O Y, HART J D. Fully robust one-sided cross-validation for regression functions[J]. Computational statistics, 2017, 32(3):1003-1025.
[35] BAI S. Growing random forest on deep convolutional neural networks for scene categorization[J]. Expert systems with applications, 2017, 71(C):279-287.
[36] BREIMAN L. Random Forest[J]. Machine Learning, 2001(1), 45:5-32.
[37] MENZE B H, KELM B M, MASUCH R, et al. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data[J]. BMC bioinformatics, 2009, 10(1):1-16.
文章导航

/