Link Filtering Algorithm of Domain Name in View of the Crawler

  • Wen Yang ,
  • Chen Wenyu ,
  • Yuan Ye ,
  • Zhu Jian
Expand
  • 1. Library of University of Electronic Science & Technology of China, Chengdu 611731;
    2. School of Computer Science and Engineering, University of Electronic Science & Technology of China, Chengdu 611731

Received date: 2014-07-21

  Revised date: 2014-09-01

  Online published: 2014-10-30

Abstract

Traditional link filtering algorithm based on topic even though the topic in the field of a crawler is widely used, but this method only cares about fetching the correlation between subject and the website, and ignoring the website links to the structure characteristics of itself. The connection filtering algorithm is proposed based on domain name, and this method is based on the structure characteristics of the domain name in the web link. Link filtering algorithm will be based on the theme at the same time as the auxiliary, judge the useless garbage links. Compared with the single link filtering algorithm based on theme, link filtering algorithm based on domain name is a more comprehensive judgment way. Besides, link filter is more effective, which can effectively improve the efficiency of the web crawler capture, and improve the efficiency of information retrieval. Finally, through the simulation experiment proves the validity of the algorithm.

Cite this article

Wen Yang , Chen Wenyu , Yuan Ye , Zhu Jian . Link Filtering Algorithm of Domain Name in View of the Crawler[J]. Library and Information Service, 2014 , 58(20) : 125 -130 . DOI: 10.13266/j.issn.0252-3116.2014.20.019

References

[1] 张云秋,安文秀,冯佳.探索式信息搜索行为研究[J].图书情报工作,2012,56(14):67-72.

[2] A. Emtage, P. Deutsch. Archie: An electronic directory service for the Internet[C]//Proceedings of the Winter 2010 Usenix Conference.California:USENIX, 2010:93-110

[3] Alberti B, Anklesaria F, Lindner P, et al. The Internet Gopher protocol: A distributed document search and retrieval protocol[J].The Journal of Universal Computer Science,1991,24(2):235-246.

[4] Pant G, Srinivasan P. Learning to crawl: Comparing classification schemes[J]. ACM Transactions on Information Systems (TOIS), 2005, 23(4): 430-462.

[5] Knoblock C A, Arens Y. An architecture for information retrieval agents[C]//Working Notes of the AAAI Spring Symposium on Software Agents.New York:SIGIR, 2010:49-56.

[6] Abiteboul S, Preda M, Cobena G. Adaptive on-line page importance computation[C]//Proceedings of the 12th International Conference on World Wide Web.Budapest:Springer, 2012:280-290.

[7] Cutting D, Pedersen J. Optimization for dynamic inverted index maintenance[C]//Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:SIGIR, 2011:405-411.

[8] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 2009, 24(5): 513-523.

Outlines

/