[Purpose/significance] Data source description or resource representation is a key issue of Deep Web Integrated Retrieval, as its quality has a direct impact on the retrieval efficiency and effectiveness of the Integrated Retrieval System. This paper proposes a data source description approach based on domain features and user query-based sampling, to provide reference for the related application and research on resources integration in the non-cooperate environment. [Method/process] The approach is a kind of offline sampling method with heterogeneity non-cooperate data. By analyzing the data source and its domain subject features, it constructs the domain feature word set, initial feature word set, and high frequency feature word set one by one, and obtains the data source description information by query-based sampling of high frequency feature words. Then the paper analyzes the calculating method of relevance between the query and data source descriptions based on inference network using CORI algorithms, designed and developed a Deep Web Integrated Retrieval system based on Lemur toolkit to test the effectiveness of the approach. [Result/conclusion] The Results show that this method achieves high performance at both recall and precision. Compared with other approaches, it has a distinct cost advantage and a good practical value in the automatic renew of data and operation and maintenance management.
Yuan Guohua
,
Kou Jingjing
,
Li Fang
. Data Source Description Approach for Deep Web Based on Domain Features and User Query-based Sampling[J]. Library and Information Service, 2017
, 61(15)
: 138
-145
.
DOI: 10.13266/j.issn.0252-3116.2017.15.016
[1] 刘伟, 孟小峰, 孟卫一. Deep Web数据集成研究综述[J]. 计算机学报, 2007, 30(9):1475-1489.
[2] 万常选, 邓松, 刘德喜,等. 面向混合类型关键词查询的非合作结构化深网数据源选择[J]. 计算机研究与发展, 2014, 51(4):905-917.
[3] 任祖杰. 非合作性环境下的P2P搜索技术研究[D]. 杭州:浙江大学, 2010.
[4] PALTOGLOU G, SALAMPASIS M, SATRATZEMI M. Collection-integral source selection for un-cooperative distributed information retrieval environments[J]. Information sciences, 2010, 180(14):2763-2776.
[5] CALLAN J, CONNELL M. Query-based sampling of text databases[J].ACM transactions on in-formation systems, 2001, 19(2):97-130.
[6] WANG F, AGRAWAL G. Effective and efficient sampling methods for deep web aggregation que-ries[C]//International conference on extending database technology. Uppsala, Sweden:DBLP, 2011:425-436.
[7] 胡代勇.一种改进的深层网络数据源描述方法[D]. 哈尔滨:哈尔滨工程大学, 2012.
[8] 樊敬川. Deep Web数据库的选择研究[D]. 保定:河北大学, 2009.
[9] 邓松. 实体信息集成检索的深网数据源选择[J]. 计算机工程, 2016, 42(10):75-79.
[10] 林培光.基于Web数据库特征的Deep Web独立数据样本采样方法[J].计算机研究与发展, 2012,49(S):15-21.
[11] 周徐.基于分层采样的Deep Web数据分析方法研究[D]. 苏州:苏州大学, 2015.
[12] CALLAN J. Distributed information retrieval[M]//Advances in information retrieval. Boston:Springer US, 2000:127-150.
[13] SI L. Federated search of text search engines in uncooperative environments[M]. New York:ACM, 2007.
[14] SHOKOUHI M, BAILLIE M, AZZOPARDI L. Updating collection representations for federated search[C]//Proceedings of the international ACM SIGIR conference on research and development in information retrieval. Amsterdam:ACM. 2007:511-518.
[15] YUWONO B, LEE D L. Server ranking for distributed text retrieval systems on the internet[A]//The 5th international conference on database systems for advanced application. Melbourne:World Scientific Press,1997:41-50.
[16] TURTLE H, CROFT W B. Evaluation of an inference network-based retrieval model[J]. ACM transactions on information systems, 1991, 9(3):187-222.
[17] Lemur toolkit[EB/OL].[2016-12-29]. http://www.lemurproject.org/.
[18] Web of science core collection[EB/OL].[2016-10-03]. http://apps.webofknowledge.com/.
[19] IPEIROTIS P G, GRAVANO L. Classification-aware hidden-Web text database selection[J]. ACM transactions on information systems, 2008, 26(2):1-66.