[目的/意义] 在人文计算兴起这一背景下,针对先秦诸子典籍进行自动分类的探究,以更加深入和精准地从古代典籍中挖掘出相应的知识。[方法/过程] 基于《论语》《老子》《管子》《庄子》《孙子》《韩非子》《孟子》《荀子》和《墨子》9种先秦诸子典籍构成的训练和测试语料,采用支持向量机技术,提取TF-IDF、信息增益、卡方统计和互信息为特征,完成针对先秦诸子典籍的自动分类实验。[结果/结论] 基于先秦诸子典籍得到的自动分类模型调和平均值能达到99.21%,效果较好,具有较强的推广和应用价值。
[Purpose/significance] In order to deeply and accurately mine the knowledge from the ancient classics, the automatic classification of Pre-Qin Literature is implemented at the background of the rising of humanities computing. [Method/process] Based on the training and testing corpus which consisted of 9 kinds of full texts of the Analects of Confucius, Laozi, Guanzi, Zhuangzi, Xunzi, Han Fei Zi, Mencius, Xunzi and Mozi, the paper finished experiments about the automatic classification of Pre-Qin Philosophers Literature by the support vector machine which used the feature selection, which included TF-IDF, information gain, Chi-square statistics and mutual information determined by the method of statistics rules. [Result/conclusion] The classification models based on the support vector machine are obtained under 4 different feature selection methods for Pre-Qin Philosophers Literature. The best F-measure of classification model reaches 99.21% which has favorable effect and the value of promotion and application.
[1] HUIJNEN P, LAAN F, RIJKE M D, et al. A digital humanities approach to the history of science[C]//Proceedings of fifth international conference on social informatics. Berlin:Springer Berlin, 2013:71-85.
[2] 赵生辉, 朱学芳. 我国高校数字人文中心建设初探[J]. 图书情报工作, 2014, 58(6):64-69.
[3] YANG Y, PEDERSEN J O. A Comparative study on feature selection in text categorization[C]//Proceedings of fourteenth international conference on machine learning. California:Morgan Kaufmann Publishers Inc,1997:412-420.
[4] YANG Y,LIU X. A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. New York:Association for Computing Machinery,1999:42-49.
[5] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18(1):26-32.
[6] 刘志刚, 李德仁, 秦前清,等. 支持向量机在多类分类问题中的推广[J]. 计算机工程与应用, 2004, 40(7):10-13.
[7] 李盼池, 许少华. 支持向量机在模式识别中的核函数特性分析[J]. 计算机工程与设计, 2005, 26(2):302-304.
[8] VALENZA R J. Are the Thisted-Efron authorship tests valid?[J]. Computers & the humanities, 1991, 25(1):27-46.
[9] SANDERSON C,GUENTER S. Short text authorship attribution via sequence kernels, Markov chains and author unmasking:An investigation[C]//Proceedings of the 2006 conference on empirical methods in natural language processing. Stroudsburg:Association for Computational Linguistics,2006:482-491.
[10] 年洪东, 陈小荷,王东波. 现当代文学作品的作者身份识别研究[J].计算机工程与应用, 2010, 46(4):226-229.
[11] 王昊, 严明, 苏新宁. 基于机器学习的中文书目自动分类研究[J]. 中国图书馆学报, 2010(6):28-39.
[12] 王东波, 苏新宁, 朱丹浩,等. 基于支持向量机的医学期刊文章自动分类研究[J]. 情报理论与实践, 2011, 34(4):115-118.
[13] 黄水清,王东波,何琳.以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J].图书情报工作,2015,59(11):127-133.
[14] SALTON G, FOX E A, WU H. Extended boolean information retrieval[M]. New York:The Cornell University Pres, 1982.
[15] MITCHELL T M. Machine learning[M]. New York:The Mc-Graw-Hill Companies, 1997.
[16] YATES F. Contingency tables involving small numbers and the χ2 test[J]. Journal of the Royal Statistical Society, 1934, 1(2):217-235.
[17] HANKS P. Word association norms, mutual information, and lexicography[C]//Proceedings of the 27th annual meeting on association for computational linguistics. Stroudsburg:Association for Computational Linguistics, 1989:76-83.
[18] 金敏. 《管子·明法》与《韩非子·有度》比较[J]. 中外法学, 1997, 9(6):111-113.