Library and Information Service >
Exploring of Word Segmentation for Fore-Qin Literature Based on the Domain Glossary of Sinological Index Series
Received date: 2015-05-12
Revised date: 2015-05-22
Online published: 2015-06-05
[Purpose/significance] With the rising of humanities computing, in order to more deeply and accurately mine the corresponding knowledge from the ancient classics, the Fore-Qin Literature is automatically segmented in this paper.[Method/process] Based on domain glossary of Zuo Commentary from the Sinological Index Series, the paper finishes the segmentation of Fore-Qin Literature on the corpus of train and test which consist of Zuo Commentary and Yanzi's Spring and Autum Annals by the conditional random fields which uses the feature template determined by the method of statistics and rules. [Result/conclusion] The segmentation models based on simple feature template, internal feature template and combined feature template are obtained under the framework of word segmentation for Fore-Qin Literature. The best F-measure of segmentation model reaches 97.47%, which has a great potential for popularization and application.In the processof constructing the model, the precision rate and recall rate of segmentation model are effectively enhanced by merging internal and external feature knowledge.
Huang Shuiqing , Wang Dongbo , He Lin . Exploring of Word Segmentation for Fore-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015 , 59(11) : 127 -133 . DOI: 10.13266/j.issn.0252-3116.2015.11.018
[1] Huijnen P,Laan F,Rijke M,et al.A digital humanities approach to the history of science[J].Social Informatics Lecture Notes in Computer Science, 2014,83(59):71-85.
[2] 赵生辉,朱学芳.我国高校数字人文中心建设初探[J].图书情报工作,2014,58(6):64-69.
[3] 孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究[J].中文信息学报,2000,14(1):1-6.
[4] 刘挺,吴岩.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报,1998,12(1): 17-25.
[5] 姚天顺,张桂平.基于规则的汉语自动分词系统[J].中文信息学报,1990,4(1):37-43.
[6] 赵益民.用VFP实现汉语文献的自动分词[J].图书情报工作,2002,46(11): 64-66.
[7] 曹自强,李素建.HDP与互信息相结合的中文无指导分词[J].中文信息学报,2013,27(6):1-5.
[8] 韩冬煦,常宝宝.中文分词模型的领域适应性方法[J].计算机学报,2015,38(2):272-281.
[9] Zhao Hai, Huang Chang-Ning, Li Mu, et al. A unified character-based tagging method of Chinese word segmentation via conditional random field modeling[J]. ACM Transaction on Asian Language Information Processing, 2010, 9(2):1-32.
[10] 李双龙,刘群,王成耀.基于条件随机场的汉语分词系统[J].微计算机信息,2006(28):178-180.
[11] 宋彦,蔡东风,张桂平,等.种基于字词联合解码的中文分词方法[J].软件学报, 2009(9): 2366-2375.
[12] 汉籍电子文献[EB/OL]. [2015-05-07].http://hanji.sinica.edu.tw/index.html.
[13] 邱冰,皇甫娟.基于中文信息处理的古代汉语分词研究[J].微计算机信息, 2008,24(24):100-102.
[14] 梁社会,陈小荷.先秦文献《孟子》自动分词方法研究[J].南京师范大学文学院学报,2013(3):175-182.
[15] 汉达文库[EB/OL].[2015-04-13].http://www.chant.org/.
[16] 徐润华,陈小荷.一种利用注疏的《左传》分词新方法[J].中文信息学报,2012,26(2):13-17.
[17] 马学良,孙蕊.从“整理国故”看哈佛燕京学社汉学引得丛刊的价值[J].图书情报工作,2010,54(7):111-114.
[18] Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//The International Mchine Learning Society. Proceedings of the Eighteenth International Conference on Machine Learning. Williamstown: Williams College, 2001:282-289.
[19] CRF++[EB/OL].[2015-05-07].http://sourceforge.net/projects/crfpp/.
/
〈 |
|
〉 |