[目的/意义] 针对关键词共现方法识别领域研究热点过程中数据清洗进行理论研究与探索,以辅助科研工作者准确识别领域研究热点。[方法/过程] 在文献调研的基础上,阐述数据清洗的定义和对象,并分析脏数据产生的原因和影响,进而制定数据清洗的步骤和方案,并采用实证研究方法对数据清洗的效果和方案的可行性进行验证。[结果/结论] 研究结果表明该数据清洗方案能够提高研究热点识别的准确性,从而证明了该方案的可行性。
[Purpose/significance] In order to efficiently aid researchers to identify research hotpot, this paper aims to explore theoretical basis and practical guidance of data cleaning in the process of identifying research hotpots based on keywords co-occurrence. [Method/process] On the basis of literature research, it firstly defines the conception and the objects of data cleaning. Then it analyses the reasons and influences of dirty data. Finally, it proposes the procedures of data cleaning, which is verified by empirical research method. [Result/conclusion] The result indicates that the procedures of data cleaning which are proved to be feasible can increase the accuracy of identification of research hotpot.
[1] 高继平,丁堃, 潘云涛,等. 多词共现分析方法的实现及其在研究热点识别中的应用[J]. 图书情报工作,2014,58(24):80-85.
[2] 刘晓波. 我国图书馆学研究热点及趋势:基于关键词共现和词频统计的可视化研究[J]. 图书情报工作,2012,56(7):62-67.
[3] 闵超,孙建军. 基于关键词交集的学科交叉研究热点分析——以图书情报学和新闻传播学为例[J]. 情报杂志,2014,33(5):76-82.
[4] CUNNINGHAM S. The content evaluation of British scientific research[D]. Brighton: University of Sussex, 1996.
[5] LOSIEWICZ P, OARD D, KOSTOFF R. Textual data mining to support science and technology management[J]. Journal of intelligent information systems, 2000, 15(2):99-119.
[6] AIZAWA A. An information-theoretic perspective of TF-IDF measures [J]. Information processing & management, 2003, 39(1):45-65.
[7] ZHU D, PORTER A. Automated extraction and visualization of information for technological intelligence and forecasting[J]. Technological forecasting & social change, 2002, 69(5):495-506.
[8] DUMAIS S, FURNAS G, LANDAUER T, et al. Using latent semantic analysis to improve access to textual information[C]/ / Proceedings of computer human interaction. Washington: ACM, 1988:281-285.
[9] HOFMANN T. Probabilistic latent semantic indexing[C]//Proceedingsof the 22th Annual International SIGIR Conference on research and development in information retrieval. Univca, Berkeley, CA: Assoc Computing Machinery, 1999: 50-57.
[10] BLEI D, NG A, JORDAN M. Latent dirichlet allocation[J]. Journal of machine learning research, 2003( 3):993-1022.
[11] PORTER A, ZHANG Y. Text clumping for technical intelligence[EB/OL].[2016-11-13].https://www.intechopen.com/books/theory-and-applications-for-advanced-text-mining/text-clumping-for-technical-intelligence.
[12] ZHANG Y, PORTER A, HU Z, et al. "Term clumping" for technical intelligence: a case study on dye-sensitized solar cells[J]. Technological forecasting & social change, 2014, 85(6):26-39.
[13] FLORESCUAND D. An extensible framework for data cleaning[C]// International conference on data engineering. California: IEEE Computer Society, 1999:312.
[14] 路霞,吴鹏,王曰芬,等. 中文专利数据地址信息清洗框架及实现[J]. 情报理论与实践,2016,39(4):128-132.
[15] 王曰芬,章成志,张蓓蓓,等. 数据清洗研究综述[J]. 现代图书情报技术,2007(12):50-56.
[16] RAHM E, HONG H. Data cleaning: problems and current approaches[J]. IEEE data engineering bulletin, 2000, 23(23):3-13.
[17] 郭志懋, 周傲英. 数据质量和数据清洗研究综述[J]. 软件学报, 2002, 13(11):2076-2082.
[18] 盛怡瑾, 黄政, 张学福. 面向领域分析的文献数据清洗策略研究[J]. 数字图书馆论坛, 2015(12):2-8.
[19] 张勤, 马费成. 国外知识管理研究范式——以共词分析为方法[J]. 管理科学学报, 2007, 10(6):65-75.
[20] 郑彦宁, 许晓阳, 刘志辉. 基于关键词共现的研究前沿识别方法研究[J]. 图书情报工作, 2016, 60(4):85-92.
[21] 朱丹浩, 王东波, HASSAN S,等. 知识组织视角下关键词网络中的小世界现象[J]. 图书与情报, 2013(6):19-22.
[22] 郭文姣, 欧阳昭连, 李阳,等. 应用共词分析法揭示生物医学工程领域的研究主题[J]. 中国生物医学工程学报, 2012, 31(4):545-551.
[23] 雷孝平, 张旭, 赵蕴华,等. 基于IRPU算法的专利数据相似重复属性及记录检测方法[J]. 现代图书情报技术, 2010(12):46-51.
[24] 刘伙玉, 王东波. 面向论文相似性检测的数据预处理研究[J]. 现代图书情报技术, 2015 (5):50-56.
[25] 高燕. 关键词自动标引方法综述[J]. 电子世界, 2012(6):118-120.
[26] 蒋勋, 徐绪堪. 面向知识服务的知识库逻辑结构模型[J]. 图书与情报, 2013(6):23-31.
[27] 林晓华, 钟伶. 基于PubMed开展学科服务的探索[J]. 图书馆学研究, 2013(4):56-58.
[28] CHEN C. The CiteSpace Manual[EB/OL].[2016-11-13].http://cluster.ischool.drexel.edu/~cchen/citespace/CiteSpaceManual.pdf.