Federated Topic Modeling

被引:34
作者
Jiang, Di [1 ]
Song, Yuanfeng [1 ,4 ]
Tong, Yongxin [2 ,3 ]
Wu, Xueyang [4 ]
Zhao, Weiwei [1 ]
Xu, Qian [1 ]
Yang, Qiang [1 ,4 ]
机构
[1] WeBank Co Ltd, AI Grp, Shenzhen, Peoples R China
[2] Beihang Univ, SKLSDE Lab, BDBC, Beijing, Peoples R China
[3] Beihang Univ, IRI, Beijing, Peoples R China
[4] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19) | 2019年
基金
美国国家科学基金会;
关键词
Text Semantics; Topic Model; Bayesian Networks; ALGORITHMS;
D O I
10.1145/3357384.3357909
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Topic modeling has been widely applied in a variety of industrial applications. Training a high-quality model usually requires massive amount of in-domain data, in order to provide comprehensive co-occurrence information for the model to learn. However, industrial data such as medical or financial records are often proprietary or sensitive, which precludes uploading to data centers. Hence training topic models in industrial scenarios using conventional approaches faces a dilemma: a party (i.e., a company or institute) has to either tolerate data scarcity or sacrifice data privacy. In this paper, we propose a novel framework named Federated Topic Modeling (FTM), in which multiple parties collaboratively train a high-quality topic model by simultaneously alleviating data scarcity and maintaining immune to privacy adversaries. FTM is inspired by federated learning and consists of novel techniques such as private Metropolis Hastings, topic-wise normalization and heterogeneous model integration. We conduct a series of quantitative evaluations to verify the effectiveness of FTM and deploy FTM in an Automatic Speech Recognition (ASR) system to demonstrate its utility in real-life applications. Experimental results verify FTM's superiority over conventional topic modeling.
引用
收藏
页码:1071 / 1080
页数:10
相关论文
共 54 条
[1]  
Alvarez GA, 2006, 2006 INTERNATIONAL CONFERENCE ON NANOSCIENCE AND NANOTECHNOLOGY, VOLS 1 AND 2, P424
[2]  
[Anonymous], 2012, P 21 INT C WORLD WID
[3]  
[Anonymous], 2018, learning
[4]  
Arnold C, 2012, SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P1031, DOI 10.1145/2348283.2348454
[5]   On a Topic Model for Sentences [J].
Balikas, Georgios ;
Amini, Massih-Reza ;
Clausel, Marianne .
SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, :921-924
[6]  
Bater Johes, 2018, ARXIV181001816
[7]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[8]  
Bonawitz K. A., 2016, NIPS WORKSH PRIV MUL
[9]   Federated learning of predictive models from federated Electronic Health Records [J].
Brisimi, Theodora S. ;
Chen, Ruidi ;
Mela, Theofanie ;
Olshevsky, Alex ;
Paschalidis, Ioannis Ch. ;
Shi, Wei .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2018, 112 :59-67
[10]  
Carey P., 2018, Data Protection - A Practical Guide to UK and EU Law