[Purpose/significance] Healthcare big data is an important basic strategic resource in China. Word segmentation and entity recognition of Chinese electronic medical record(EMR) is helpful in extracting important information from a large number of unstructured text.[Method/process] In this study, a Chinese medical thesaurus is firstly built in terms of authoritative medical subject headings, official standards and health website data; then, the effect of four segmentation methods is compared based on the corpus of artificial segmentation and manual annotation; finally, CRF model is used to identify 5 entities, including disease, symptom, test, drug and treatment.[Result/conclusion] Results show that (i)AC automaton model has the best F-measure in EMR word segmentation, which is 82%; (ii) compared with Western medical record, it's difficult to identify medical entities in the record of traditional Chinese medicine. Besides, "Test" and "Disease" entities have better F-measure, while the F-measure of "Symptom" entity is not that ideal.
Wang Ruojia
,
Cho Sang
,
Wang Jimin
. Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record[J]. Library and Information Service, 2019
, 63(2)
: 34
-42
.
DOI: 10.13266/j.issn.0252-3116.2019.02.004
[1] 国家卫生健康委员会. 电子病历应用管理规范(试行)[EB/OL].[2018-02-20]. http://www.nhfpc.gov.cn/yzygj/s3593/201702/22bb2525318f496f846e8566754876a1.shtml.
[2] 刘群, 张华平, 俞鸿魁,等. 基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展, 2004, 41(8):1421-1429.
[3] 李兆福. 基于K最短路径的中文分词算法研究与实现[D]. 哈尔滨:哈尔滨工程大学, 2009.
[4] 张立邦. 基于半监督学习的中文电子病历分词和名实体挖掘[D]. 哈尔滨:哈尔滨工业大学, 2014.
[5] 张立邦, 关毅, 杨锦峰. 基于无监督学习的中文电子病历分词[J]. 智能计算机与应用, 2014(2):68-71.
[6] 李国垒, 陈先来, 夏冬,等. 面向临床决策的电子病历文本潜在语义分析[J]. 现代图书情报技术, 2016, 32(3):50-57.
[7] FRIEDMAN C, HRIPCSAK G, DUMOUCHEL W, et al. Natural language processing in an operational clinical information system[J]. Natural language engineering, 1995, 1(1):83-108.
[8] SEVENSTER M, VAN O R, QIAN Y. Automatically correlating clinical findings and body locations in radiology reports using MedLEE[J]. Journal of digital imaging, 2012, 25(2):240-249.
[9] MetaMap. A Tool For Recognizing UMLS Concepts in Text[EB/OL].[2018-08-18]. https://mmtx.nlm.nih.gov/.
[10] XU H, STENNER S P, DOAN S, et al. MedEx:a medication information extraction system for clinical narratives[J]. Journal of the American medical informatics association, 2010, 17(1):19-24.
[11] SAVOVA G K, MASANZ J J, OGREN P V, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES):architecture, component evaluation and applications[J]. Journal of the American medical informatics association jamia, 2010, 17(5):507-513.
[12] LI Y, GORMAN S L. Section classification in clinical notes using supervised hidden markov model[C]//Arlington, VA, USA:Proceedings of the 1st ACM International Health Informatics Symposium. ACM, 2010:744-750.
[13] 王鹏远, 姬东鸿. 基于多标签CRF的疾病名称抽取[J]. 计算机应用研究, 2017, 34(1):118-122.
[14] 叶枫, 陈莺莺, 周根贵,等. 电子病历中命名实体的智能识别[J]. 中国生物医学工程学报, 2011, 30(2):256-262.
[15] LEI J, TANG B, LU X, et al. A comprehensive study of named entity recognition in Chinese clinical text[J]. Journal of the American medical informatics association, 2014, 21(5):808-814.
[16] LIANG J, XIAN X, HE X, et al. A novel approach towards medical entity recognition in Chinese clinical text[J]. Journal of healthcare engineering,2017(2):1-16.
[17] UMLS. Current semantic types[EB/OL].[2018-02-20]. https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html.
[18] UZUNER Ö, SOUTH B R, SHEN S Y, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American medical informatics association, 2011, 18(5):552-556.
[19] 结巴中文分词[EB/OL].[2018-02-20].https://github.com.
[20] 沈翔翔, 李小勇. 使用无监督学习改进中文分词[J]. 小型微型计算机系统, 2017, 38(4):744-748.
[21] 孔东林, 罗向阳, 邓崎皓,等. 基于AC自动机匹配算法的入侵检测系统研究[J]. 微电子学与计算机, 2005, 22(3):89-92.
[22] 李原.中文文本分类中分词和特征选择方法研究[D].长春:吉林大学,2011.