收稿日期: 2013-05-02
修回日期: 2013-05-13
网络出版日期: 2013-06-05
基金资助
本文系国家自然科学基金面上项目"面向知识服务的知识组织模式与应用研究"(项目编号:71273126)和国家社会科学基金重点项目"人文社会科学汉英动态术语数据库的构建研究"(项目编号:11AYY002)研究成果之一。
Research of Mining the Category Knowledge Based on Chinese-English Part of Speech Sequence Parallel Corpus in Phrase Level
Received date: 2013-05-02
Revised date: 2013-05-13
Online published: 2013-06-05
基于通过具体实验确定的Bisecting K-means聚类和Lemmatization形态变换算法,在汉英短语级人文社会科学平行语料基础上,尝试进行类别知识挖掘的实验。在中文社会科学引文索引(CSSCI)的类别和标题知识基础上,完成对汉英语料的预处理,并分析名词、动词和形容词的分布状况。在名词、动词和形容词等词性的组合基础上,对比不同词性组合的效果并确定最优的词性组合类别知识挖掘模型。
关键词: 词性组合; Bisecting K-means; 汉英平行语料库; 类别知识
王东波 , 韩普 , 沈耕宇 , 沈思 . 基于汉英词性组合的短语级平行语料类别知识挖掘研究[J]. 图书情报工作, 2013 , 57(11) : 106 -111,145 . DOI: 10.7536/j.issn.0252-3116.2013.11.020
The paper attempts an experiment of mining the category knowledge from Chinese-English humanities and social sciences parallel corpus in phrase level based on clustering and morphological conversion algorithms of Bisecting K-means and Lemmatization, which are determined by the experiments. The Chinese and English corpus preprocessing is completed and the distribution of noun, verb and adjective is analyzed based on Chinese Social Sciences Citation Index (CSSCI). The results of different part of speech sequences are compared and the most optimal part of speech sequence applied to mining the category knowledge is determined on the basis of part of speech sequences of noun, verb and adjective.
[1] Gimenez J.SVMTool:Neral POS tagger generator based on Support Vector Machines[C]//Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04).Berlin:Springer Verlag,2004:1-9.
[2] 洪铭材,张阔,唐杰,等.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006,33(10):148-151.
[3] ICTCLAS [EB/OL].[2012-12-31].http://ictclas.org/ictclas_feature.html.
[4] Chua S.The role of parts-of-speech in feature selection[C]//Proceedings of the International MultiConference of Engineers and Computer Scientists.Amsterdam:Reed Elsevier, 2008:124-132.
[5] Liu Zitao,Yu Wenchao,Deng Yongtao.A feature selection method for document clustering based on part-of-speech and word co-occurrence[C]//Proceedings of the 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.Berlin:Springer Verlag,2010:234-238.
[6] Rosell M. Part of speech tagging for text clustering in Swedish[C]//Proceedings of the 17th Nordic Conference of Computational Linguistics.Berlin:Springer Verlag,2009:142-148.
[7] Sedding J,Kazakovk D. Wordnet-based text document clustering[C] // Proceedings of the Third Workshop on Robust Methods in Analysis of Natural Language Data (ROMAND).Berlin:Springer Verlag,2004:104-113.
[8] Shi Kangsheng, Li Lerming. High performance genetic algorithm based text clustering using parts of speech and outlier elimination[J].Applied Intelligence,2012,7(8):1-9.
[9] Lamar M,Maron Y,Johnson M. SVD and clustering for unsupervised POS tagging[C] // Proceedings of the ACL 2010 Conference Short Papers.Berlin:Springer Verlag,2010:215-219.
[10] Owoputi O,Connor B,Dyer C.Improved part-of-speech tagging for online conversational text with word clusters[C]//The 2013 Conference of the North American Chapter of the Association for Computational Linguistics.Berlin:Springer Verlag,2013:1-13.
[11] 姚清耘,刘功申,李翔.基于向量空间模型的文本聚类算法[J].计算机工程,2008,34(18):39-41.
[12] Lloyd S P. Least squares quantization in PCM[J]. IEEE Transactions on Information Theory,1982,28 (2):129-137.
[13] Sneath P H, Soka R R. Numerical taxonomy: The principles and practice of numerical classification[M].San Francisco: Freeman, 1973:573.
[14] Savaresi S M,Boley D L. On the performance of bisecting k-means and PDDP[C] //Proceedings of the 1st SIAM International Conference on Data Mining.Berlin:Springer Verlag,2001:1-14.
[15] CLUTO [EB/OL].[2013-01-31].http://glaros.dtc.umn.edu/gkhome/views/cluto/.
[16] 复旦文本分类语料 [EB/OL].[2013-02-21].http://www.datatang.com/data/43543.
[17] Huang Zhexue.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998(2):283-304.
[18] Porter Stemming Algorithm [EB/OL].[2013-03-21]. http://tartarus.org/martin/PorterStemmer/.
[19] Porter2 Stemming Algorithm[EB/OL].[2013-01-11]. http://snowball.tartarus.org/algorithms/english/stemmer.html.
[20] European languages lemmatizer[EB/OL].[2013-02-15]. http://lemmatizer.org/.
[21] Lemmatization[EB/OL].[2013-02-15].http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.
[22] 20 Newsgroups[EB/OL].[2013-03-10].http://qwone.com/~jason/20Newsgroups/.
[23] 中文社会科学引文索引 [EB/OL].[2013-02-28]. http://cssci.nju.edu.cn/news_show.asp?Articleid=163.
/
〈 | 〉 |