[Purpose/significance] This paper studies the TF-IDF assisted indexing algorithm based on the user natural annotation from the users' point of view.[Method/process] First, the keywords and the classification number in Chinese core journals were taken as the data source. The user natural annotation vocabulary was constructed by computing the keywords frequency and using the TF-IDF algorithm. Second, the featured words were extracted from the scientific and technological project data by the IK Analyzer word segmentation software and the TF-IDF algorithm. Finally, the keywords and classification number of the scientific and technological project data were indexed synchronously.[Result/conclusion] The experiment indicates that the data of scientific and technical projects take up 68.1% in total. In these projects, the ratio similitude of the keywords of machine indexing and the keywords of human indexing is more than 60% in total. The ratio of the uniformity in the former three numbers of machine-indexed classification number and the human-indexed classification number is 83.9% in total. It is feasible to adopt the TF-IDF algorithm based on the users' natural annotation data.
Chen Baixue
,
Song Peiyan
. Empirical Research on TF-IDF Assisted Indexing Algorithm Based on Users' Natural Annotation[J]. Library and Information Service, 2018
, 62(1)
: 132
-139
.
DOI: 10.13266/j.issn.0252-3116.2018.01.017
[1] 马张华.信息组织[M].北京:清华大学出版社,2001.
[2] 白华. 用户标注的词语网络与语义描述[J]. 图书情报工作, 2010, 54(2):70-73.
[3] 孙茂松.基于互联网自然标注资源的自然语言处理[J].中文信息学报,2011,25(6):26-32.
[4] 章成志. 基于集成学习的自动标引方法研究[J]. 情报学报, 2010, 29(1):3-8.
[5] 李纲,戴强斌. 基于词汇链的关键词自动标引方法[J]. 图书情报知识, 2011(3):67-71.
[6] 曹树金,周小又,陈桂鸿. 网络舆情监控系统中的主题帖自动标引及情感倾向分析研究[J]. 图书情报知识, 2012(1):66-73.
[7] 王丹,杨晓蓉. 自动标引中的歧义词消除方法研究[J]. 图书情报工作, 2014, 58(5):93-97.
[8] DE CAMPOS L M, FERNÁNDEZ-LUNA J M, HUETE J F, et al. Automatic indexing from a thesaurus using Bayesian networks:application to the classification of parliamentary initiatives[C]//European conference on symbolic and quantitative approaches to reasoning and uncertainty. Berlin:Springer Berlin Heidelberg, 2007:865-877.
[9] MEDELYAN O, WITTEN I H. Thesaurus based automatic keyphrase indexing[C]//Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries. New York:ACM, 2006:296-297.
[10] MERROUNI Z A, FRIKH B, OUHBI B. Automatic keyphrase extraction:an overview of the state of the art[C]//IEEE international colloquium on information science and technology. Piscataway:IEEE, 2017:306-313.
[11] 李枫林,张景. 基于用户标注行为的相关性分析及重排序[J]. 情报理论与实践, 2010(10):57-61.
[12] 吴丹,许小梅. 图书馆与图书分享网站的用户标注行为比较研究[J]. 图书情报知识, 2013(1):85-93.
[13] 谢佳琳,张晋朝. 高校图书馆用户标注行为研究——以信息系统成功模型为视角[J]. 图书馆论坛, 2014(11):87-93.
[14] PATTERSON J, DOUGALL S, MOODY N. Systems and methods for manipulating user annotations in electronic books:United States Patent,8520025[P]. 2013-08-27.
[15] ZARRO M A, ALLEN R B. User-contributed annotations for libraries and cultural institutions[EB/OL].[2017-06-26]. http://mikezarro.com/docs/Zarro-LRS-V-Poster.pdf.
[16] ZHANG Y Y, ZHANG C Z,CHEN G, et al. Analyzing scientific user tagging behavior on academic blogs according to tag's content characteristics-a preliminary study[EB/OL].[2017-06-26].https://www.ideals.illinois.edu/bitstream/handle/2142/96741/3.62_419_Zhang-Analyzing%20scientific%20user%20tagging%20behavior%20on%20academic%20blogs%20according.pdf?sequence=1&isAllowed=y.
[17] PAN X, HE S, ZHU X, et al. How users employ various popular tags to annotate resources in social tagging:an empirical study[J]. Journal of the Association for Information Science & Technology, 2016, 67(5):1121-1137.
[18] 马费成,张斌. 图书标注环境下用户的认知特征[J]. 中国图书馆学报, 2014(1):4-14.
[19] 常唯. 论网络环境下用户标注的价值与应用[J]. 图书情报工作, 2008, 52(1):9-12.
[20] AIZAWA A. An information-theoretic perspective of tf-idf measures[J].Information processing and management,2003,39(1):45-65.
[21] 路永和, 李焰锋. 改进TE-IDF算法的文本特征项权值计算方法[J]. 图书情报工作, 2013, 57(3):90-95.
[22] 覃世安, 李法运. 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013, 29(10):27-30.
[23] 刘勘,周丽红,陈譞. 基于关键词的科技文献聚类研究[J]. 图书情报工作, 2012, 56(4):6-11.
[24] SAMOYLOV A B. Evaluation of the delta TF-IDF features for sentiment analysis[C]//International conference on analysis of images, social networks and texts_x000D_. Berlin:Springer, 2014:207-212.
[25] PHILIP S, SHOLA P B, OVYE A. Application of content-based approach in research paper recommendation system for a digital library[J]. International journal of advanced computer science & applications, 2014, 5(10):37-40.
[26] Xu R. POS weighted TF-IDF algorithm and its application for an MOOC search engine[C]//International conference on audio, language and image processing. Piscataway:IEEE, 2015:868-873.
[27] DADGAR S M H, ARAGHI M S, FARAHANI M M. A novel text mining approach based on TF-IDF and support vector machine for news classification[EB/OL].[2017-06-26]. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7569223.