A Method for Generating Synthetic Electronic Medical Record Text

被引：20

作者：

Guan, Jiaqi ^{[1
]}

Li, Runzhe ^{[2
]}

Yu, Sheng ^{[3
]}

Zhang, Xuegong ^{[4
,5
,6
]}

机构：

[1] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA

[2] Johns Hopkins Univ 1466, Dept Biostat, Baltimore, MD 21218 USA

[3] Tsinghua Univ, Ctr Stat Sci, Inst Data Sci, Dept Ind Engn, Beijing 100084, Peoples R China

[4] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China

[5] Beijing Natl Res Ctr Informat Sci & Technol BNRIS, MOE Key Lab Bioinformat & Bioinformat Div, Beijing 100084, Peoples R China

[6] Tsinghua Univ, Ctr Synthet & Syst Biol, Beijing 100084, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS | 2021年 / 18卷 / 01期

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Synthetic electronic medical record text; conditional model; generative adversarial network; reinforcement learning;

D O I：

10.1109/TCBB.2019.2948985

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Machine learning (ML) and Natural Language Processing (NLP) have achieved remarkable success in many fields and have brought new opportunities and high expectation in the analyses of medical data, of which the most common type is the massive free-text electronic medical records (EMR). However, the free EMR texts are lacking consistent standards, rich of private information, and limited in availability. Also, it is often hard to have a balanced number of samples for the types of diseases under study. These problems hinder the development of ML and NLP methods for EMR data analysis. To tackle these problems, we developed a model called Medical Text Generative Adversarial Network or mtGAN, to generate synthetic EMR text. It is based on the GAN framework and is trained by the REINFORCE algorithm. It takes disease tags as inputs and generates synthetic texts as EMRs for the corresponding diseases. We evaluate the model from micro-level, macro-level and application-level on a Chinese EMR text dataset. The results show that the method has a good capacity to fit real data and can generate realistic and diverse EMR samples. This provides a novel way to avoid potential leakage of patient privacy while still supply sufficient well-controlled cohort data for developing downstream ML and NLP methods.

引用

页码：173 / 182

页数：10

共 27 条

[1]

[Anonymous], 2013, JIEBA CHINESE TEXT S

[2]

[Anonymous], 2014, PROC C EMPIRICAL MET, DOI DOI 10.3115/V1/D14-1181

[3]

[Anonymous], 2017, IMPLEMENTATION SEQUE

[4]

[Anonymous], 2013, GOOGL WORD2VEC

[5]

Bahdanau D., 2015, PROC INT C LEARN REP

[6]

Bengio S, 2015, ADV NEUR IN, V28

[7] A Survey of Monte Carlo Tree Search Methods [J].

Browne, Cameron B. ;

Powley, Edward ;

Whitehouse, Daniel ;

Lucas, Simon M. ;

Cowling, Peter I. ;

Rohlfshagen, Philipp ;

Tavener, Stephen ;

Perez, Diego ;

Samothrakis, Spyridon ;

Colton, Simon .

IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 2012, 4 (01) :1-43

[8]

Che T., 2017, ARXIV PREPRINT ARXIV

[9]

Cho K., 2014, P 2014 C EMP METH NA, P1724, DOI 10.3115/v1/d14-1179

[10]

Choi E, 2017, P MACHINE LEARNING H, P286

← 1 2 3 →