Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model

Qu Jingye; Chen Zhen; Zheng Yanning

doi:10.13266/j.issn.0252-3116.2018.04.015

Library and Information Service >

2018 , Vol. 62 >Issue 4: 113 - 120

DOI: https://doi.org/10.13266/j.issn.0252-3116.2018.04.015

Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model

Qu Jingye ,
Chen Zhen ,
Zheng Yanning

Expand

1. Information Technology and Media College of Beihua University, Jilin 132013;
2. Institute of Scientific and Technical Information of China, Beijing 100038

Received date: 2017-08-12

Revised date: 2017-11-13

Online published: 2018-02-20

Fold

Abstract

[Purpose/significance] This paper explores the method of text clustering in the science and technology reports based on the topic model, develops new scientific literature technology monitoring areas, and puts forward a new semantic analysis method based on science and technology reports. [Method/process] Based on the national science and technology report service system, firstly, it conducted topic mining based on the LDA model after the text preprocessing; secondly, a clustering analysis based on the combination of K-means and Ward was carried out based on the text vector of the abstract containing theme distribution information. A proper text clustering method for the text mining suitable for the science and technical report was proposed. [Result/conclusion] The experimental results show that the LDA model can be effectively and accurately used in the topic mining of science and technology reports, and the clustering effect of the combination of Ward and K-means proposed in this paper is better than that of other traditional clustering algorithms in science and technology reports.

Key words： science and technology report; topic model; LDA; text clustering

Cite this article

Qu Jingye , Chen Zhen , Zheng Yanning . Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model[J]. Library and Information Service, 2018 , 62(4) : 113 -120 . DOI: 10.13266/j.issn.0252-3116.2018.04.015

References

[1] VRETTAS G, SANDERSON M. Conferences versus journals in computer science[J]. Journal of the Association for Information Science & Technology, 2015, 66(12): 2674-2684.
[2] 侯人华,刘春燕,杜薇薇.科技报告制度体系与形成模式研究[J].情报理论与实践,2014,37(1):51-54.
[3] 郭学武,朱江.开放科技报告服务体系建设刍议[J].情报理论与实践,2011,34(9):82-84,126.
[4] 毛刚,贾志雷,侯人华.情报学视角下的科技报告研究[J].情报杂志,2013 32(12):62-66,109.
[5] NTIS. The National Technical Information Service [EB/OL].[2017-07-06].https://www.ntis.gov.
[6] U.S. Department of Defense [EB/OL].[2017-07-06]. https://www.defense.gov/.
[7] Office of Scientific and Technical Information. OSTI Databases [EB/OL].[2017-07-06].http://www.osti.gov.
[8] Science.gov [EB/OL].[2017-07-06].http://www.science.gov.
[9] OpenGrey.System for Information on Grey Literature in Europe. [EB/OL].[2017-07-06].http://www.opengrey.eu.
[10] 中国科学技术信息研究所.国家科技报告服务系统[EB/OL].[2017-07-06]. http://www.nstrs.cn.
[11] DUMAIS S,FURNAS G,LANDAUER T,et al. Using latent semantic analysis to improve access to textual information[C]//Proceedings of computer human interaction. Washington: Association for Computing Machinery,1988:281-285.
[12] HOFMANN T. Probabilistic latent semantic indexing[C]//Proceedings of the 22th annual international SIGIR conference on research and development in information retrieval. Berkeley: Association for Computing Machinery, 1999:50-57.
[13] BLEI D, NG A, JORDAN M. Latent Dirichlet allocation[J].Journal of machine learning research, 2003,3 (3): 993-1022.
[14] TITOV I,MCDONALD R. Modeling online reviews with multi grain topic models[C]//Proceedings of 2008 WWW conference.New York: Association for Computing Machinery,2008: 111-120.
[15] BLEI M. Probabilistic topic models [J]. Communications of the ACM, 2012, 55(4): 77-84.
[16] GRIFFITHS T, STEYVERS M. Finding scientific topics [J].Proceedings of the National Academy of Sciences, 2004,101(S1): 5228-5235.
[17] 朱亮,孟宪学,赵瑞雪.基于文献计量的科技监测方法与应用系统比较研究[J].数字图书馆论坛,2015,128(1): 52-56.
[18] 吕一博,康宇航.基于共现分析的科技监测地图绘制及实证研究[J].科学学研究,2010,28(10): 1459-1466.
[19] 李湘东,张娇,袁满.基于LDA 模型的科技期刊主题演化研究[J].情报杂志,2014,33(7): 115-121.
[20] 关鹏,王曰芬,傅柱.不同语料下基于LDA 主题模型的科学文献主题抽取效果分析[J].图书情报工作,2016,60(2):112-121.
[21] 王曰芬,傅柱,陈必坤.采用LDA 主题模型的国内知识流研究结构探讨: 以学科分类主题抽取为视角[J].现代图书情报技术,2016,37(4): 8-19.
[22] 王平.基于层次概率主题模型的科技文献主题发现及演化[J].图书情报工作,2014, 58(22): 70-77.
[23] 刘卫江. 基于主题模型的科技监测研究与实现[D].南京:南京理工大学,2014.
[24] SZEKELY G, RIZZO M. Hierarchical clustering via joint between-within distances: extending ward's minimum variance method[J].Journal of classification,2005,22(2):151-183.
[25] 奉国和,郑伟. 国内中文自动分词技术研究综述[J]. 图书情报工作,2011,54(2): 41- 45.
[26] 周昭涛.文本聚类分析效果评价及文本表示研究[D].北京:中国科学院研究生院(计算技术研究所),2005.
[27] 关鹏,王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016,37(9): 42-50.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References