收稿日期: 2014-01-14
修回日期: 2014-02-09
网络出版日期: 2014-02-20
基金资助
本文系江苏省高校哲学社会科学重点研究基地重大项目“楚辞数字化处理与应用研究”(项目编号:2010JDXM037)、国家社会科学基金项目“楚辞文献语义化研究”(项目编号:10BTQ031)研究成果之一。
Research on Automatic Word Segmentation and Pos Tagging for Chu Ci Based on HMM
Received date: 2014-01-14
Revised date: 2014-02-09
Online published: 2014-02-20
钱智勇 , 周建忠 , 童国平 , 苏新宁 . 基于HMM的楚辞自动分词标注研究[J]. 图书情报工作, 2014 , 58(04) : 105 -110 . DOI: 10.13266/j.issn.0252-3116.2014.04.017
This paper studies the ancient and modern Chinese word segmentation and pos tagging technology. Then it makes an automatic word segmentation and pos tagging experiment on Chu Ci by using Hidden Markov Model. The probability of speech tagging is compared after word segmentation, maximum probability is taken as the last word segmentation and pos tagging results, through the method of a smoothing algorithm with full segmentation and add value. By adjusting modules and parameters of word segmentation and pos tagging program by experiment, it gets a word segmentation and pos tagging assistive software. The F-score of word segmentation is 85% and the F-score of pos tagging is 55% in the open tes,which is 14 percentage higher than the benchmark F.
Key words: HMM; Chu Ci; automatic word segmentation; pos tagging; ancient word segmentation
[1] 文庭孝.汉语自动分词研究进展[J].图书情报工作, 2005(5):54-63.
[2] 梁南元.书面汉语的自动分词与一个自动分词系统-CDWS[J].北京航空学院学报, 1984(4):97-104.
[3] 石民, 李斌, 陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报, 2010, 24(2):39-45.
[4] 徐润华, 陈小荷.一种利用注疏的《左传》分词新方法[J].中文信息学报, 2012, 26(2):13-17.
[5] 周建忠, 贾捷.楚辞[M].南京.凤凰出版社, 2009.
[6] 徐志明, 王晓龙, 关毅.N-gram语言模型的数据平滑技术[J].计算机应用研究, 1999(7):37-44.
[7] 刘丹, 方卫国, 周弘.二元语法中文分词数据平滑算法性能研究[J].计算机工程与应用, 2009, 45(17):33-36.
[8] 俞士汶, 胡俊峰.唐宋诗之词汇自动分析及应用[J].语言暨语言学, 2002(4):39-44.
[9] 邱冰, 皇甫娟.基于中文信息处理的古代汉语分词研究[J].微计算机信息, 2008, 24(8):100-102.
/
〈 | 〉 |