提出一个RSS级别的网页主题内容抽取方法与系统,利用RSS feed中的少量entry信息训练得到主题内容模板,通过模板可以对RSS feed下的所有网页进行主题内容抽取。该方法支持分别抽取网页的标题、正文、类别等信息;另外,该方法有自适应机制,能实时侦测模板的变化。从实验结果来看,该方法和系统有很高的召回率和准确率。
Abstract
This paper proposes a RSS level web page main content extraction method and system. This method uses small amount of entry RSS meta informations in the RSS feed to train main content template, and based on this template, extract main content for all of web page in the RSS feed. This method also supports extracting title, body and category information separately. Furthermore, this method has self adaptation mechanism, it can real-time detect template change. From experiment results, this method and system has high recall and precision.
关键词
网页主题内容抽取 /
RSS /
模板 /
自适应机制
{{custom_keyword}} /
Key words
web page main content extraction /
RSS /
template /
self adaptation mechanism
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins, On the bursty evolution of blogspace. In WWW’03: Proceedings of the 12th international conference on World Wide Web,568-576, New York, NY, USA, 2003. ACM Press.
[2] Z. Bar-Yossef and S. Rajagopalan, Template detection via data mining and its application. In Proc. 11th WWW, 580-591, 2002
[3] D. Gibson, K. Punera, and A. Tomkins, The volume and evolution of web page templates. In Proc. 14th WWW(Special interest tracks and posters), 830-839,2005
[4] D. Cai, X-F. He, J.-R. Wen, and W.-Y. Ma, Block-level Link Analysis, In Proceedings of the ACM-SIGIR, 440-447, 2004
[5] L. Yi and B. Liu, Web page cleaning for web mining through feature weighting. In Proc. 18th UCAI, 43-50, 2003
[6] Rupesh R. Mehta, and Amit Madaan, Web page sectioning using regex-based template, In Proceedings of Word Wide Web conference, poster, 2008
[7] 曹东林,廖详文,许洪波,白硕,基于网页格式信息量的博客文章和评论抽取模型,Journal of Software, 1282-1291, 2009
[8] SandipDebnath, Prasenjit Mitra Nirmal Pal, and C.Lee Giles, Automatic identification of informative sections of Web-pages, IEEE transactions on knowledge and data engineering, 1233-1246, 2005
[9] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera, Page-level template detection via isotonic smoothing, In Proceedings of World Wide Web conference, 61-70, 2007
[10] Li Qingcheng, Li Youmeng, Extracting Content from Web Pages Based on RSS, International Conference on Computer Science and Software Engineering, 218-221, 2008
[11] Berkman Center “RSS 2.0 Specification” http://cyber.law.harvard.edu/rss/rss.html
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
南京信息工程大学科研基金资助项目“基于语义Web的数字图书馆研究与实现”(项目编号:SK20080153)
{{custom_fund}}