[目的/意义]鉴于传统的作者身份识别方法不适用于当前大量涌现的网络文本。综述近年文本作者身份识别的典型方法和关键问题,并进行客观分析和评价,以期为进一步研究提供新的思路。[方法/过程]分别从应用领域、文体特征选取、作者身份建模和性能评价指标等方面对国内外作者身份识别相关研究现状进行客观分析,梳理相关领域研究发展脉络和趋势。[结果/结论]作者身份识别需要适应短文本、不规范文本、海量、高维和多语种环境,需更具表现和刻画能力的多层面特征和相应的作者身份建模方法,并借助信息检索、机器学习和自然语言处理领域的最新研究成果提高效率和准确率。
[Purpose/significance]The traditional authorship identification methods are not applicable to web text.In this paper some typical methods and the key problems in recent years are reviewed in order to provide new ideas for further research.[Method/process]We objectively analyzed the authorship stylistic features selection,the authorship modeling and the performance evaluation indexes respectively,presenting the latest development of the related areas and trends.[Result/conclusion]Authorship identification should adapt to short,non-standard,mass,high-dimensional,sparse and multilingual text.More efficient multidimensional features models and corresponding authorship identification methods are required.The latest achievements in information retrieval, machine learning and natural language processing are the promising solutions to improve the efficiency and accuracy of authorship identification.
[1] Mendenhall T C.The characteristic curves of composition[J].Science,1887(214S):237-246.
[2] Holmes D I.The evolution of stylometry in humanities scholarship[J].Literary and Linguistic Computing, 1998,13(3):111-117.
[3] Stamatatos E.A survey of modern authorship attribution methods[J].Journal of the American Society for Information Science and Technology,2009,60(3):538-556.
[4] Yule G U.On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship[J].Biometrika, 1939: 363-390.
[5] Baayen H, Van Halteren H, Tweedie F.Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution[J].Literary and Linguistic Computing, 1996, 11(3): 121-132.
[6] Zhao Ying, Zobel J. Effective and scalable authorship attribution using function words[C]//Information Retrieval Technology.Berlin Heidelberg:Springer,2005: 174-189.
[7] Goebel R, Wahlster W.Using dependency-based annotations for authorship identification[C]//Text, Speech and Dialogue.Berlin Heidelberg:Springer, 2012: 314-319.
[8] Hassan F H, Chaurasia M A.Author assertion of furtive write print using character n-grams[C]//International Conference on Future Information Technology IPCSIT.Singapore:IACSIT PRESS,2011:212-216.
[9] Gamon M. Linguistic correlates of style:Authorship classification with deep linguistic analysis features[C]//Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg:Association for Computational Linguistics,2004: 611-617.
[10] Zhang Chunxia, Wu Xindong, Niu Zhendong, et al. Authorship identification from unstructured texts[J]. Knowledge-Based Systems, 2014:99-111.
[11] Abbasi A, Chen H.Applying authorship analysis to extremist-group web forum messages[J].IEEE Intelligent Systems,2005,20(5): 67-75.
[12] Iqbal F, Binsalleeh H, Fung B C M, et al.Mining writeprints from anonymous e-mails for forensic investigation[J].Digital Investigation, 2010, 7(1): 56-64.
[13] Ali N, Price M, Yampolskiy R. BLN-Gram-TF-ITF as a new feature for authorship identification[EB/OL].[2015-07-02].http://www.ase360.org/bitstream/handle/123456789/130/Poster60.pdf?sequence=1&isAllowed=y.
[14] Fan Mengdi, Qian Tieyun, Chen Li, et al. Authorship attribution with very few labeled data: A co-training approach[C]//Web-Age Information Management.Berlin: Springer International Publishing,2014:657-668.
[15] Zamani H, Esfahani H N, Babaie P, et al. Authorship identification using dynamic selection of features from probabilistic feature set[C]//Information Access Evaluation. Multilinguality, Multimodality, and Interaction.Berlin: Springer International Publishing,2014:128-140.
[16] 吕英杰,范静, 刘景方.基于文体学的中文UGC作者身份识别研究[J].现代图书情报技术,2013(9): 48-53.
[17] Zheng Rong, Li Jiexun, Chen Hsinchun,et al.A framework for authorship identification of online messages: Writing style features and classification techniques[J].Journal of the American Society of Information Science and Technology, 2006,57(3):378-393.
[18] 胡壮麟.理论文体学[M].北京: 外语教学与研究出版社, 2000:50-63.
[19] 祁瑞华,杨德礼,郭旭,等.基于多层面文体特征的博客作者身份识别研究[J].情报学报,2015.6: 628-634.
[20] McCarthy P M, Lewis G A, Dufty D F,et al.Analyzing writing styles with coh-metrix[C]//Proceedings of the Florida Artificial Intelligence Research Society International Conference. Menlo Park, California,USA: AAAI Press,2006:764-769.
[21] 武晓春,黄萱菁,吴立德.基于语义分析的作者身份识别方法研究[J].中文信息学报.2006, 20(6): 61-68.
[22] Peng Fuchun, Shuurmans D, Wang Shaojun.Augmenting naive Bayes classifiers with statistical language models[J].Information Retrieval Journal, 2004.7(1), 317-345.
[23] Marton Y, Wu Ning, Hellerstein L. On compression-based text classification[C]//Proceedings of the European Conference on Information Retrieval.Springer,Berlin German.2005:300-314.
[24] Keselj V, Peng Fuchun, Cercone N,et al. N-gram-based author profiles for authorship attribution[C]//Proceedings of the Pacific Association for Computational Linguistics.PACLING,Canberra,Australia.2003: 255-264.
[25] Frantzeskou G, Stamatatos E, Gritzalis S, et al. Effective identification of source code authors using byte-level information[C]//Proceedings of the 28th International Conference on Software Engineering. New York:ACM, 2006:893-896.
[26] Fréry J, Largeron C, Juganaru-Mathieu M.UJM at CLEF in author identification[J].Notebook for PAN at CLEF,2014,1180:1042-1048.
[27] Burrows J.‘Delta’: A measure of stylistic difference and a guide to likely authorship[J].Literary and Linguistic Computing, 2002, 17(3): 267-287.
[28] Cilibrasi R, Vitanyi P. Clustering by compression[J]. Information Theory, IEEE Transactions on, 2005, 51(4): 1523-1545.
[29] Castillo E, Cervantes O, Vilariño D, et al.Unsupervised method for the authorship identification task[J].Notebook for PAN at CLEF,2014,1180:1035-1041.
[30] Koppel M, Schler J, Bonchek-Dokow E.Measuring Differentiability:Unmasking Pseudonymous Authors[J].Journal of Machine Learning Research,2007,8(2): 1261-1276.
[31] Khonji M, Iraqi Y.A slightly-modified GI-based author-verifier with lots of features[J].Notebook for PAN at CLEF,2014,1180:977-983.
[32] Gollub T,otthast M P,ABeyer A, et al.Recent trends in digital text forensics and its evaluation[C]//Information Access Evaluation.Multilinguality, Multimodality, and Visualization. Berlin Heidelberg:Springer, 2013: 282-302.
[33] Potha N, Stamatatos E. A profile-based method for authorship verification[C]//Artificial Intelligence: Methods and Applications. Berlin: Springer International Publishing,2014: 313-326.
[34] Peñas A, Rodrigo A. A simple measure to assess non-response[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg:Association for Computational Linguistics,2011(1): 1415-1424.