收稿日期: 2014-07-21
修回日期: 2014-09-01
网络出版日期: 2014-10-30
Link Filtering Algorithm of Domain Name in View of the Crawler
Received date: 2014-07-21
Revised date: 2014-09-01
Online published: 2014-10-30
文阳 , 陈文宇 , 袁野 , 朱建 . 针对爬虫的域名链接过滤算法[J]. 图书情报工作, 2014 , 58(20) : 125 -130 . DOI: 10.13266/j.issn.0252-3116.2014.20.019
Traditional link filtering algorithm based on topic even though the topic in the field of a crawler is widely used, but this method only cares about fetching the correlation between subject and the website, and ignoring the website links to the structure characteristics of itself. The connection filtering algorithm is proposed based on domain name, and this method is based on the structure characteristics of the domain name in the web link. Link filtering algorithm will be based on the theme at the same time as the auxiliary, judge the useless garbage links. Compared with the single link filtering algorithm based on theme, link filtering algorithm based on domain name is a more comprehensive judgment way. Besides, link filter is more effective, which can effectively improve the efficiency of the web crawler capture, and improve the efficiency of information retrieval. Finally, through the simulation experiment proves the validity of the algorithm.
Key words: Web crawler; connection filtering; domain filtering; theme filtering
[1] 张云秋,安文秀,冯佳.探索式信息搜索行为研究[J].图书情报工作,2012,56(14):67-72.
[2] A. Emtage, P. Deutsch. Archie: An electronic directory service for the Internet[C]//Proceedings of the Winter 2010 Usenix Conference.California:USENIX, 2010:93-110
[3] Alberti B, Anklesaria F, Lindner P, et al. The Internet Gopher protocol: A distributed document search and retrieval protocol[J].The Journal of Universal Computer Science,1991,24(2):235-246.
[4] Pant G, Srinivasan P. Learning to crawl: Comparing classification schemes[J]. ACM Transactions on Information Systems (TOIS), 2005, 23(4): 430-462.
[5] Knoblock C A, Arens Y. An architecture for information retrieval agents[C]//Working Notes of the AAAI Spring Symposium on Software Agents.New York:SIGIR, 2010:49-56.
[6] Abiteboul S, Preda M, Cobena G. Adaptive on-line page importance computation[C]//Proceedings of the 12th International Conference on World Wide Web.Budapest:Springer, 2012:280-290.
[7] Cutting D, Pedersen J. Optimization for dynamic inverted index maintenance[C]//Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:SIGIR, 2011:405-411.
[8] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 2009, 24(5): 513-523.
/
〈 | 〉 |