收稿日期: 2014-07-24
修回日期: 2014-09-01
网络出版日期: 2014-10-05
基金资助
本文系科技部国际科技合作专项“面向科技文献的日汉双向实用型机器翻译合作研究”(项目编号:2014DFA11350)和国家社会科学基金项目“基于事实型科技大数据的情报分析方法及集成分析平台研究”(项目编号:14BTQ038)研究成果之一。
Research on Domain Adaptation Technology of Chinese Science and Technology Literatures Segmentation
Received date: 2014-07-24
Revised date: 2014-09-01
Online published: 2014-10-05
石崇德 , 乔晓东 , 王惠临 , 屈鹏 . 中文科技文献切分的领域适应技术研究[J]. 图书情报工作, 2014 , 58(19) : 13 -18 . DOI: 10.13266/j.issn.0252-3116.2014.19.002
Segmentation of science and technology (S&T) literature is a basic step in S&T documents information processing. This paper takes biomedical literatures as the instances and studies domain adaptation technology in segmentation of S&T literatures. Then it takes some methods such as dictionary features, domain character features, sub-word tagging and low quality in-domain training corpus based on dictionary-based segmentation to adapt Chinese segmentation method based on sequence labeling in journalism filed to S&T filed and achieves the significant improvement. It finds that how to exploit domain specific features with domain knowledge plays an important role in improving the segmentation quality of S&T literatures.
[1] Xue Nianwen, Shen Libin. Chinese word segmentation as LMR tagging[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo: Association for Computational Linguistics, 2003:176-179.
[2] Low JinKiat, Ng HweeTou, Guo Wenyuan. A maximum entropy approach to chinese word segmentation[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.Jeju Island: Asian Federation of Natural Language Processing, 2005:161-164.
[3] Zhao Hai, Huang Changning, Li Mu. An improved chinese word segmentation system with conditional random field[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney: Association for Computational Linguistics, 2006:162-165.
[4] Jiang Jing. A Literature survey on domain adaptation of statistical classifiers[M].[2014-07-01]. http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/-survey/.
[5] Søgarrd A. Semi-supervised learning and domain adaptation in natural language processing[J]. Synthesis Lectures on Human Language Technologies, 2013,6(2):1-103.
[6] Blitzer J. Domain Adaptation of natural language processing systems[D]. Philadelphia:University of Pennsylvania. 2008.
[7] Pan SinnoJialin, Yang Qiang. A survey on transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.
[8] Japkowicz N, Stephen S. The class imbalance problem: A systematic study[J]. Intelligent Data Analysis, 2002, 6(6):429-449.
[9] Ando R K, Zhang Tong. A framework for learning predictive structures from multiple tasks and unlabeled data[J]. Journal of Machine Learning Research, 2005, 6(6):1817-1853.
[10] Blitzer J, Mcdonald R, Pereira F. Domain adaptation with structural correspondence learning[C]//The 2006 Conference on Empirical Methods in Natural Language Processing.Sydney: Association for Computational Linguistics, 2006:120-128.
[11] Zeng Daniel, Wei Donghua,Chau Michael, et al. Domain-specific Chinese word segmentation using suffix tree and mutual information[J]. Information Systems Frontiers, 2011,13(1):115-125.
[12] Chang Baobao. Enhancing domain portability of chinese segmentation model using chi-square statistics and bootstrapping[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.Cambridge: Association for Computational Linguistics,2010:789-798.
[13] Liu Yang, Zhang Yue. Unsupervised domain adaptation for joint segmentation and POS-tagging[C]//Proceedings of COLING 2012: Posters. Mumbai: The COLING 2012 Organizing Committee, 2012: 745-754.
[14] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3):8-19.
[15] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning.San Francisco: Morgan Kaufmann Publishers Inc, 2001:282-289.
[16] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术,2012(4):29-34.
[17] 李楠, 郑荣廷, 吉久明,等. 基于启发式规则的中文化学物质命名识别研究[J]. 现代图书情报技术, 2010(5):13-17.
[18] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the Eleventh Annual Conference on Computational Learning Theory.New York: ACM, 1998:92-100.
[19] CC-CEDICT[M/OL].[2014-07-01]. http://cc-cedict.org/wiki/.
/
〈 | 〉 |