Classification Strategy for Focus Crawling Based on Multi-classifier Combination and Ranking Approach

  • Qiao Jianzhong
Expand
  • Information Management Center of PLA Academy of Arts, Beijing 100081

Received date: 2013-06-18

  Revised date: 2013-07-05

  Online published: 2013-07-20

Abstract

For the limitation that generalization capacity of single classification algorithm is not strong when focused crawler is facing multi-topic Web crawling and classification, the paper proposed a strategy of using multi-classifier combination formed of multiple strong classification algorithms. The focused crawler evaluates and ranks the classifiers online according to the current topic, and classifies Web pages by selecting the better classifiers. Through classification experiments of multiple topics crawling tasks, comparing between accurate rate of each classification algorithm and average classification accurate rate of multi-classifier combination, and comprehensive analysis of the two indicators——classification accuracy and classification efficiency, it proved the proposed method is better in universality, to a certain extent and overcomes the limitations of a single classifier.

Cite this article

Qiao Jianzhong . Classification Strategy for Focus Crawling Based on Multi-classifier Combination and Ranking Approach[J]. Library and Information Service, 2013 , 57(14) : 114 -120 . DOI: 10.7536/j.issn.0252-3116.2013.14.019

References

[1] Mitchell T M.Machine learning[M].Columbus:The McGraw-Hill Companies Inc, 1997.
[2] Nigam K, McCallum A, Thrun S, et al. Text classification from labeled and unlabelled documents using EM[J]. Machine Learning, 2000, 39(2-3):103-134.
[3] Rennie J, MeCallum A. Using reinforcement learning to spider the Web efficiently [C]//Bratko I, Dzeroski S. Proceedings of the Sixteenth International Conference on Machine Learning. San Francisco:Morgan Kaufmann Publishers Inc, 1999:335-343.
[4] Diligenti M, Coetzee F, Lawrence S.Focused crawling using context graphs[C]//Abbadi A E, Brodie M L, Chakravarthy S, et al. Proceedings of the 26th VLDB Conference.San Francisco:Morgan Kaufmann Publishers Inc, 2000:527-534.
[5] Chakrabarti S, Punera K, Subramanyam M. Accelerated focused crawling through online relevance feedback[C]//Proceedings of the 11th International Conference on World Wide Web. New York:ACM,2002:148-159.
[6] Johnson J, Tsioutsiouliklis K, Giles C L. Evolving strategies for focused Web crawling[C]//Fawcett T,Mishra N.Proceedings of the Twentieth International Conference. Washington D.C:AAAI Press, 2003:298-305.
[7] Pant G, Srinivasan P. Link contexts in classifier-guided topical crawlers[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18 (1):107-122.
[8] Pant G, Srinivasan P. Learning to crawl:Comparing classification schemes[J]. ACM Trans Information Systems, 2005, 23 (4):430-462.
[9] 杨建良,王永成.自动分类技术的发展与展望[EB/OL]. [2013-06-06].http://www.cnindex.fudan.edu.cn/zgsy/2003n1/zidongfenlei.htm.
[10] 张超群.基于网页分块技术的主题爬行[D].长春:吉林大学,2007.
[11] 刘菊新,徐从富.基于多分类器组合模型的垃圾邮件过滤[J].计算机工程, 2010(18):194- 196.
[12] 侯帅,韩中庚,黄洁,等.基于Sugeno模糊积分的多分类器融合方法在多属性决策中的应用[J].信息工程大学学报,2010(1):124-128.
[13] 陈冰,张化祥.集成学习的多分类器动态组合方法[J].计算机工程,2008(24):218-220.
[14] Machine learning group at university of waikato.Weka 3:Data mining software in java[EB/OL]. [2013-06-06].http://www.cs.waikato.ac.nz/ml/weka/.
[15 ] Pazzani M J, Billsus D. Learning and revising user profiles:The identification of interesting Web sites[J].Machine Learning,1997, 27(3):313-331.
[16 ] Sergej S, Martin T, Stefan S. BINGO!:Bookmark-induced gathering of information[C]//Wang Ling Tok, Dayal U, Bertino E. Proceedings of the 3rd International Conference on Web Information Systems Engineering.Los Alamitos:IEEE Computer Society, 2002:323-332.
[17 ] Miller R. WebSPHINX:A personal, customizable Web crawler [EB/OL]. [2011-02-12]. http://www.cs.cmu.edu/~rcm/websphinx/.
Outlines

/