专题:自然语言处理与文本信息分析

中文科技文献切分的领域适应技术研究

  • 石崇德 ,
  • 乔晓东 ,
  • 王惠临 ,
  • 屈鹏
展开
  • 中国科学技术信息研究所
石崇德,中国科学技术信息研究所助理研究员,E-mail:shicd@istic.ac.cn;乔晓东,中国科学技术信息研究所研究员,总工程师;王惠临,中国科学技术信息研究所研究员;屈鹏,中国科学技术信息研究所助理研究员。

收稿日期: 2014-07-24

  修回日期: 2014-09-01

  网络出版日期: 2014-10-05

基金资助

本文系科技部国际科技合作专项“面向科技文献的日汉双向实用型机器翻译合作研究”(项目编号:2014DFA11350)和国家社会科学基金项目“基于事实型科技大数据的情报分析方法及集成分析平台研究”(项目编号:14BTQ038)研究成果之一。

Research on Domain Adaptation Technology of Chinese Science and Technology Literatures Segmentation

  • Shi Chongde ,
  • Qiao Xiaodong ,
  • Wang Huilin ,
  • Qu Peng
Expand
  • Institute of Scientific and Technical Information of China, Beijing 100038

Received date: 2014-07-24

  Revised date: 2014-09-01

  Online published: 2014-10-05

摘要

以生物医学文献为实例对象,研究科技文献切分中的领域适应技术,通过以词典特征、领域词汇特征、子串标注和使用词典切分的粗切分语料作为训练语料等方法,实现基于序列标注的中文切分方法由新闻领域到科技领域的适应,并取得了较好的效果。研究表明,在科技文献切分中,充分利用领域知识获取领域相关特征,对于提高科技文献切分的准确率具有重要的作用。

本文引用格式

石崇德 , 乔晓东 , 王惠临 , 屈鹏 . 中文科技文献切分的领域适应技术研究[J]. 图书情报工作, 2014 , 58(19) : 13 -18 . DOI: 10.13266/j.issn.0252-3116.2014.19.002

Abstract

Segmentation of science and technology (S&T) literature is a basic step in S&T documents information processing. This paper takes biomedical literatures as the instances and studies domain adaptation technology in segmentation of S&T literatures. Then it takes some methods such as dictionary features, domain character features, sub-word tagging and low quality in-domain training corpus based on dictionary-based segmentation to adapt Chinese segmentation method based on sequence labeling in journalism filed to S&T filed and achieves the significant improvement. It finds that how to exploit domain specific features with domain knowledge plays an important role in improving the segmentation quality of S&T literatures.

参考文献

[1] Xue Nianwen, Shen Libin. Chinese word segmentation as LMR tagging[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo: Association for Computational Linguistics, 2003:176-179.

[2] Low JinKiat, Ng HweeTou, Guo Wenyuan. A maximum entropy approach to chinese word segmentation[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.Jeju Island: Asian Federation of Natural Language Processing, 2005:161-164.

[3] Zhao Hai, Huang Changning, Li Mu. An improved chinese word segmentation system with conditional random field[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney: Association for Computational Linguistics, 2006:162-165.

[4] Jiang Jing. A Literature survey on domain adaptation of statistical classifiers[M].[2014-07-01]. http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/-survey/.

[5] Søgarrd A. Semi-supervised learning and domain adaptation in natural language processing[J]. Synthesis Lectures on Human Language Technologies, 2013,6(2):1-103.

[6] Blitzer J. Domain Adaptation of natural language processing systems[D]. Philadelphia:University of Pennsylvania. 2008.

[7] Pan SinnoJialin, Yang Qiang. A survey on transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.

[8] Japkowicz N, Stephen S. The class imbalance problem: A systematic study[J]. Intelligent Data Analysis, 2002, 6(6):429-449.

[9] Ando R K, Zhang Tong. A framework for learning predictive structures from multiple tasks and unlabeled data[J]. Journal of Machine Learning Research, 2005, 6(6):1817-1853.

[10] Blitzer J, Mcdonald R, Pereira F. Domain adaptation with structural correspondence learning[C]//The 2006 Conference on Empirical Methods in Natural Language Processing.Sydney: Association for Computational Linguistics, 2006:120-128.

[11] Zeng Daniel, Wei Donghua,Chau Michael, et al. Domain-specific Chinese word segmentation using suffix tree and mutual information[J]. Information Systems Frontiers, 2011,13(1):115-125.

[12] Chang Baobao. Enhancing domain portability of chinese segmentation model using chi-square statistics and bootstrapping[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.Cambridge: Association for Computational Linguistics,2010:789-798.

[13] Liu Yang, Zhang Yue. Unsupervised domain adaptation for joint segmentation and POS-tagging[C]//Proceedings of COLING 2012: Posters. Mumbai: The COLING 2012 Organizing Committee, 2012: 745-754.

[14] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3):8-19.

[15] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning.San Francisco: Morgan Kaufmann Publishers Inc, 2001:282-289.

[16] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术,2012(4):29-34.

[17] 李楠, 郑荣廷, 吉久明,等. 基于启发式规则的中文化学物质命名识别研究[J]. 现代图书情报技术, 2010(5):13-17.

[18] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the Eleventh Annual Conference on Computational Learning Theory.New York: ACM, 1998:92-100.

[19] CC-CEDICT[M/OL].[2014-07-01]. http://cc-cedict.org/wiki/.

文章导航

/