Library and Information Service >
A Compound Word Based Algorithm for Hot Event Detection and Description on the Web
Received date: 2016-05-13
Revised date: 2016-11-15
Online published: 2016-12-05
[Purpose/significance] Automatic detection of hot events on the Web (from news and microblogs) and extraction of descriptive words to describe them is important for detecting internet public opinion. [Method/process] Current methods to extract descriptive words mainly rely on association rules or combination of multiple n-grams, which often lead to noise words with imprecise meaning and potential meanig drift. In this paper, a compound word based feature extraction method is proposed and used to represent news texts. A vector space model is used to cluster and detect hot events on the Web. [Result/conclusion] The experimental result on Tencent Internet News shows that the method proposed in this paper has higer clustering precision and recall and can produce better descriptive words.
Li Xia , Wang Lianxi , Lu Meixiu , Liu Hanfeng , Liu Junyan . A Compound Word Based Algorithm for Hot Event Detection and Description on the Web[J]. Library and Information Service, 2016 , 60(23) : 128 -134 . DOI: 10.13266/j.issn.0252-3116.2016.23.016
[1] ALLAN J. Topic detection and tracking:event-based information organization[M].Norwell:Kluwer Academic Publishers, 2002:194-218.
[2] 洪宇,张宇,刘挺,等. 话题检测与跟踪的评测及研究综述[J]. 中文信息学报, 2007, 21(6):71-87.
[3] 李保利,俞士汶. 话题识别与跟踪研究[J]. 计算机工程与应用, 2003, 39(17):7-10.
[4] YANG Y, AULT T, PIERCE T,et al. Improving text categorization methods for event tracking[C]//Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval.New York:ACM,2000:65-72.
[5] 于满泉,骆卫华,许洪波,等. 话题识别与跟踪中的层次化话题识别技术研究[J]. 计算机研究与发展, 2006,43(3):489-495.
[6] 洪宇,张宇,范基礼,等.基于子话题分治匹配的新事件检测[J]. 计算机学报,2008,31(4):687-695.
[7] PAPKA R, ALLAN J.On-line new event detection using single pass clustering[R].Amherst:University of Massachusetts, Amherst,1998:37-45.
[8] 刘星星,何婷婷,龚海军,等. 网络热点事件发现系统的设计[J]. 中文信息学报,2008,22(6):80-85.
[9] 任晓东,张永奎,薛晓飞. 基于K-Modes聚类的自适应话题追踪技术[J]. 计算机工程, 2009, 35(9):222-224.
[10] 贺敏,王丽宏,杜攀,等.基于有意义串聚类的微博热点话题发现方法[J]. 通信学报,2013,34(S1):256-262.
[11] 曾依灵,许洪波. 网络热点信息发现研究[J]. 通信学报,2007,28(12):141-146.
[12] 李恒训,张华平,秦鹏,等. 基于主题词的网络热点话题发现[C]//第五届全国信息检索学术会议论文集.上海:中国中文信息学会,2009.
[13] 黄玉兰,龚才春,许洪波,等.基于局部性原理的有意义串提取方法[C]//第四届全国信息检索与内容安全学术会议论文集.北京:中国中文信息学会,2008.
[14] LAI Y S,CHUNGH S W. Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology[C]//ACM transactions on Asian language information processing. New York:ACM,2002:34-64.
[15] 郑魁,疏学明,袁宏永. 网络舆情热点信息自动发现方法[J]. 计算机工程,36(3):4-6.
[16] 张海军,李勇,闫琪琪. 一种基于海量语料的网络热点新词识别方法[J]. 计算机工程与应用, 2015, 51(5):208-213.
[17] 赵华,赵铁军,于浩,等. 基于查询向量的英语话题跟踪研究[J]. 计算机研究与发展, 2007, 44(8):1412-1417.
[18] 王馨,王煜,王亮.基于新词发现的网络新闻热点排名[J]. 图书情报工作,2015,59(6):68-74.
[19] [EB/OL].[2016-05-10].http://ictclas.nlpir.org/.
/
〈 | 〉 |