A Hybrid Semantic Information Extraction Methodfor Scientific Research Papers

  • Leng Fuhai ,
  • Bai Rujiang ,
  • Zhu Qingsong
Expand
  • 1. Chinese Academy of Sciences, National Science Library, Beijing 100190;
    2. Shandong University of Technology Library, Zibo 255049

Received date: 2013-04-15

  Revised date: 2013-05-10

  Online published: 2013-06-05

Abstract

Knowledge extraction techniques can not accurately extract specific theoretical approaches and performance indicators parameters mentioned in the academic literature. This paper proposed a hybrid semantic extract method to address this problem mentioned above. The proposed method combined semantic tagging technology, rule extraction technology and regular expression technology to accurately extract the relevant information from scientific literature. Firstly, semantic annotation technology was used to obtain relevant academic terms. Then, construct specific extraction rules to extract sentences associated with the performance indicators. Finally, regular expressions technology was used to accurately extract the parameters of the key performance indicators. Experiment in the field of carbon nanotube research proved that this method can rapidly, efficiently and accurately extract the scientific literature innovative research and the indicators.

Cite this article

Leng Fuhai , Bai Rujiang , Zhu Qingsong . A Hybrid Semantic Information Extraction Methodfor Scientific Research Papers[J]. Library and Information Service, 2013 , 57(11) : 112 -119 . DOI: 10.7536/j.issn.0252-3116.2013.11.021

References

[1] Grishman R.Information extraction:Techniques and challenges[R].New York: New York University Press,1997.
[2] Message Understanding Conference (MUC) [EB/OL]. [2012-12-16].http://www.itl.nist.gov/iaui/894.02/related_projects/muc/.
[3] Automatic Content Extraction(ACE)evaluation[EB/OL]. [2012-12-16].http://www.itl.nist.gov/iad/mig//tests/ace/.
[4] Text Analysis Conference[EB/OL]. [2012-12-16]. http://www.nist.gov/tac/.
[5] Appelt D E,Onyshkevych B. The Common Pattern Specification Language[C]//Association for Computational Linguistics. Proceedings of a Workshop on TIPSTER. Stroudsburg:ACM,1998:23-30.
[6] Cunningham H.JAPE:A Java annotation patterns engine[EB/OL].[2013-04-13]. http://www.dcs.shef.ac.uk/intranet/research/public/resmes/CS0010.pdf.
[7] Boguraev B. Annotation-based finite state processing in a large-scale NLP architecture[C]//Nicolov N. Recent Advances in Natural Language Processing.Amsterdam:John Benjamins Publishing,2004:61-63.
[8] Zhao S,Grishman R. Extracting relations with integrated information using kernel methods[C]//Association for Computational Linguistics. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computing Machinery,2005:419-426.
[9] Agichtein E,Gravano L.Snowball: Extracting relations from large plain text collections[C]//ACM Conference on Digital libraries. Proceedings of the fifth ACM conference on Digital libraries. New York: Association for Computing Machinery, 2000:85-94.
[10] Yates A,Banko M,Broadhead M,et al. TextRunner:Open information extraction on the Web[C]//Association for Computational Linguistics. Proceedings of Human Language Technologies: The Annual Conference of the North American. New York: Association for Computing Machinery, 2007:25-26.
[11] Soderland S G.Learning text analysis rules for domain-specific natural language processing[EB/OL].[2013-04-13]. http://www.cs.washington.edu/homes/soderlan/Thesis.ps.gz.
[12] Lafferty J,McCallum A,Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Association for Computing Machinery. Proceedings of the 18th International Conference on Machine Learning 2001. Stroudsburg: Association for Computing Machiner,2001:282-289.
[13] Peng F,McCallum A. Accurate information extraction from research papers using conditional random fields[EB/OL].[2013-04-03]. http://acl.ldc.upenn.edu/N/N04/N04-1042.pdf.
[14] Freitag D. Multistrategy learning for information extraction[C]//Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers,1998:161-169.
[15] Staab S,Maedche A,Handschuh S. An annotation framework for the semantic Web [C]//Proceedings of the First Workshop on Multimedia Annotation, Tokyo,2001.
[16] Kahan J,Koivunen M R,Prud’Hommeaux E,et al. Annotea: An open RDF infrastructure for shared Web annotations[C]//World Wide Web Consortium. Proceedings of the Tenth International World Wide Web Conference. New York:ACM,2001:623-632.
[17] Heflin J,Hendler J. Searching the Web with SHOE[EB/OL].[2013-04-17]. http://www.cs.kun.nl/is/Library/Data/2000/Heflin/Searching/2000-Heflin-Searching.pdf.
[18] Sheth A,Bertram C,Avant D,et al. Managing semantic content for the Web [J]. IEEE Internet Computing,2002,6(4):80-87.
[19] 黄泽武. 基于语义的科技文献共享平台的信息抽取系统[D].武汉:华中科技大学,2007.
[20] 于亮. 科技文献的文本特征抽取研究与应用[D].北京:北京邮电大学,2009.
[21] 何新贵,彭甫阳.中文文本的关键词自动抽取和模糊分类.中文信息学报,1998,13(1):10-16
[22] 何婷婷,许婷,瞿国忠,等.基于主题词对的文档重排方法.计算机工程与应用, 2007,43(11):161-163.
[23] 侯跃芳,崔雷,朱利娜.应用主题词/副主题词关联规则对专题知识的挖掘分析及评价.情报理论与实践,2008(2):234-236.
[24] 赵军,刘康,周光有,等.开放式文本信息抽取[J]. 中文信息学报,2011(6):98-110.
[25] 孙荣,周文,刘宗田.用规则抽取句子中事件信息[J]. 小型微型计算机系统,2011(11):2309-2314.
[26] 胡军伟,秦奕青,张伟.正则表达式在Web信息抽取中的应用[J]. 北京信息科技大学学报(自然科学版),2011(6):86-89.
[27] 黄先珍,杨玉珍,刘培玉. 信息过滤中基于统计与规则的关键词抽取研究[J]. 计算机工程,2012(2):57-59.
[28] 黄九鸣,吴泉源,刘春阳,等. 短文本信息流的无监督会话抽取技术[J]. 软件学报,2012(4):735-747.
[29] 朱玲玲,杨爱琴,魏晓宁. 中文自由短文本信息抽取方法的研究[J]. 电脑知识与技术,2012(15):3691-3692.
[30] Ahmed Z.Domain specific information extraction for semantic annotation[D].Prague:Charles University in Prague,2009.
[31] 温有奎,温浩.关键词与创新点词句群分布分析[J].情报学报,2007,26(1):50-55.
[32] 温有奎,温浩,徐端颐,等.基于创新点的知识元挖掘[J].情报学报,2005,24(6):663-668.
[33] 刘剑兰,朱东华.信息抽取技术在情报监测中的应用[J].情报学报,2004,23(6):661-666.
[34] 裘江南,罗志成,王延章,等.基于词汇链的应急预案主题抽取方法研究[J].情报学报,2008,27(6):891-896.
[35] 丁晟春,刘逶迤,熊霞,等.基于领域本体和语块分析的信息抽取的研究与实现[J].情报学报,2010,29(1):53-58.

Outlines

/