A Chinese Text Classification Algorithm Based on Partitioning Community in Semantic Network

  • Yin Liying ,
  • Zhao Pengwei
Expand
  • 1. School of Economics & Management, Xidian University, Xi'an 710071;
    2. College of Economics and Management, Xi'an University of Post & Telecommunications, Xi'an 710121

Received date: 2014-06-09

  Revised date: 2014-08-11

  Online published: 2014-10-05

Abstract

In order to reduce the polysemy phenomenon and the influence of the category deflect problem of training samples, a Chinese text categorization method was proposed on community division of semantic network. Firstly, disambigurtion was in progress through Wikipedia knowledge base, the complex network of text is built in order to represent the semantic relations between training texts. Then, in order to improve the problem of category deflect, the training samples is partitioned by the method of K-means which combined with the synthetic characteristics of network nodes. Finally, the text classification based on the nearest community of testing text is found out according to the nearest community. Results of experiments show that the algorithm proposed by this paper is feasible and can improve the effect of its classification.

Cite this article

Yin Liying , Zhao Pengwei . A Chinese Text Classification Algorithm Based on Partitioning Community in Semantic Network[J]. Library and Information Service, 2014 , 58(19) : 124 -128 . DOI: 10.13266/j.issn.0252-3116.2014.19.019

References

[1] Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34(1): 1-47.

[2] 王煜, 白石, 王正欧. 基于特征权重优化的改进 KNN Web 文本分类算法[J]. 情报学报, 2007, 26(5): 643-647.

[3] Yang Y, Liu X. A re-examination of text categorization methods[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 1999: 42-49.

[4] Jing Yongxia, Gou Heping, Zhu Yaling. An improved density-based method for reducing training data in KNN[C]//The 2013 International Conference on Computational and Information Sciences. Piscataway:IEEE, 2013: 972-975.

[5] 刘海峰, 姚泽清, 苏展, 等. 文本分类中基于 K-means 的类偏斜 KNN 样本剪裁[J]. 微电子学与计算机, 2012, 29(5): 24-28.

[6] 王超学, 潘正茂, 马春森, 等. 改进型加权 KNN 算法的不平衡数据集分类[J]. 计算机工程, 2012, 38(20): 160-163.

[7] Wei G Y, Zou L, Pan J. Improved text classification algorithm for spam filtering based on CABSOFV[J]. WIT Transactions on Engineering Sciences, 2014, 86:1131-1139.

[8] Liu Zuoguo, Chen Xiaorong. A graph-based text similarity algorithm[C]//2012 National Conference on Information Technology and Computer Science. Beijing:Atlantis Press, 2012.

[9] Giannakopoulos G, Mavridi P, Paliouras G, et al. Representation models for text classification: A comparative analysis over three Web document types[C]//Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics. New York:ACM, 2012: 13.

[10] W Jiangning, L Qiaofeng. Research on text similarity computing based on max-common subgraphs[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(5): 785-791.

[11] 涂新辉, 张红春, 周琨峰,等. 中文维基百科的结构化信息抽取及词语相关度计算方法[J]. 中文信息学报, 2012, 26(3): 109-115.

[12] 赵辉, 刘怀亮, 范云杰. 复杂网络理论在中文文本特征选择中的应用研究[J]. 现代图书情报技术, 2012(9): 23-28.

Outlines

/