In order to resolve the problems of the lack of text structure and semantic information in the vector space model and the bottleneck problem of annotation in dealing with large numbers of unlabeled samples, this paper introduces a method of short texts classification based on semi-supervised learning. It is feasible to maintain the relationship between samples and can also make full use of the unlabeled parts to improve the performance of the classifier. It is a self-training algorithm that connects the large numbers of unlabeled parts and the labeled together to learn based on graph structure, so that the training samples can be enlarged and used to build the final text classifier. The contrast experiment shows that the algorithm of short text classification based on semi-supervised learning can get better classified effect.
Zhang Qian
,
Liu Huailiang
. Research on Short Text Classification Based on Semi-supervised Learning by Graph Structure[J]. Library and Information Service, 2013
, 57(21)
: 126
-132
.
DOI: 10.7536/j.issn.0252-3116.2013.21.020
[1] Lewis D D.An evaluation of phrasal and clustered representations on a text categorization task[C]//Proceedings of the 15th International ACM/SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press, 1992:37-50.
[2] 吴江宁, 刘巧凤. 基于图结构的中文文本表示方法研究[J]. 情报学报, 2010, 29(4): 618-624.
[3] Hensman S, Dunnion J. Using linguistic resources to construct conceptual graph representation of texts[C]//Proceedings of the 7th International Conference TSD. Brno:LNCS, 2004: 81-88.
[4] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009(3): 142-145.
[5] Hynek J, Jezek K, Rohlik O. Short document categorization-item sets method[C]//Proceedings of 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Workshop Machine Learning and Textual Information Access. Lyon:Springer(LNCS), 2000:14-19.
[6] Schenker A, Last M, Bunke H, et al. Classification of Web documents using a graph model[C]//Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR 2003).LOS Alamitos:IEEE Computer Society, 2003: 240-244.
[7] 周昭涛, 卜东波, 程学旗. 文本的图表示初探[J]. 中文信息学报, 2005, 19(2): 36-43.
[8] 黄云平, 孙乐, 李文波. 基于上下文图模型文本表示的文本分类研究[C]//第四届全国信息检索与内容安全学术会议论文集. 北京:中国中文信息学会, 2008: 587-595.
[9] 张晓孪, 王西锋. 基于概念图的汉语语义计算的研究与实现[J].计算机工程与应用, 2011, 47(10): 120-123.
[10] 吴江宁, 刘巧凤. 基于最大公共子图的文本相似度算法研究[J]. 情报学报, 2010, 29(5): 785-791.
[11] Day N E.Estimating the components of a mixture of normal distributions[J]. Biometrika, 1969, 56(3): 463-474.
[12] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society. Series B(Methodological), 1977, 39(1): 1-38.
[13] Shahshanani B M, Landgrebe D A. The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon[J]. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(5): 1087-1095.
[14] 秦飞. 基于半监督学习的文本分类研究[D]. 成都: 西南交通大学, 2010: 1-4.
[15] Nigam K, McCallum A, Mitchell T. Semi-supervised text classification using EM[M]//Chapelle O, Schlkopf B, Zien A. Semi-Supervised Learning.Cambridge:The MIT Press, 2006: 33-38.
[16] 侯翠琴, 焦李成. 基于图的Co-Training网页分类[J]. 电子学报, 2009, 37(10): 2173-2180.
[17] 郑海清, 林琛, 牛军钰. 一种基于紧密度的半监督文本分类方法[J]. 中文信息学报, 2007, 21(3): 54-60.
[18] Vapnik V N. Statistical learning theory[M].New York: Wiley-Interscience, 1998: 434-437.
[19] Blum A, Chawla S. Learning from labeled and unlabeled data using graph mincuts[C]//Proceedings of the 18th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann Publishers, 2001: 19-26.
[20] Nigam K, Kachites M A, Thrun S, et al. Text classification from labeled and unlabeled documents using EM[J]. Machine Learning, 2000, 39(2-3): 103-134.
[21] 张博锋, 白冰, 苏金树. 基于自训练EM算法的半监督文本分类[J]. 国防科技法学学报, 2007, 29(6): 65-69.
[22] Chen Caikou, Yu Yiming. Semi-supervised neighborhood discriminant analysis[C]//Proceedings of 2010 3rd International Conference on Computational Intelligence and Industrial Application (PACⅡA). Wuhan:China Academic Jeurnal Electronic Publishing House, 2010: 435-438.
[23] 韩红旗, 朱东华, 刘嵩, 等. 关联词约束的半监督文本分类方法[J]. 计算机工程与应用, 2010, 46(4): 113-116.
[24] 钟茂生, 刘慧, 刘磊. 词汇间语义相关关系量化计算方法[J].中文信息学报, 2009, 23(2): 115-122.
[25] Zhu Xiaojin. Semi-supervised learning literature survey:TR 1530[R/OL].[2013-01-13]. http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
[26] Zhu Xiaojin, Goldberg A B. Introduction to semi-supervised learning[M]. San Rafael:Morgan & Claypool Publishers, 2009: 15-17.
[27] 苗夺谦, 卫志华. 中文文本信息处理的原理与应用[M].北京:清华大学出版社, 2007: 229-230.