FedMatch: Federated Learning Over Heterogeneous Question Answering Data

被引:8
作者
Chen, Jiangui [1 ]
Zhang, Ruqing
Guo, Jiafeng
Fan, Yixing
Cheng, Xueqi
机构
[1] Chinese Acad Sci, Inst Comp Technol, CAS Key Lab Network Data Sci & Technol, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021 | 2021年
基金
中国国家自然科学基金;
关键词
Question Answering; Federated Learning; Privacy Protection;
D O I
10.1145/3459637.3482345
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Question Answering (QA), a popular and promising technique for intelligent information access, faces a dilemma about data as most other AI techniques. On one hand, modern QA methods rely on deep learning models which are typically data-hungry. Therefore, it is expected to collect and fuse all the available QA datasets together in a common site for developing a powerful QA model. On the other hand, real-world QA datasets are typically distributed in the form of isolated islands belonging to different parties. Due to the increasing awareness of privacy security, it is almost impossible to integrate the data scattered around, or the cost is prohibited. A possible solution to this dilemma is a new approach known as federated learning, which is a privacy-preserving machine learning technique over distributed datasets. In this work, we propose to adopt federated learning for QA with the special concern on the statistical heterogeneity of the QA data. Here the heterogeneity refers to the fact that annotated QA data are typically with non-identical and independent distribution (non-IID) and unbalanced sizes in practice. Traditional federated learning methods may sacrifice the accuracy of individual models under the heterogeneous situation. To tackle this problem, we propose a novel Federated Matching framework for QA, named FedMatch, with a backbone-patch architecture. The shared backbone is to distill the common knowledge of all the participants while the private patch is a compact and efficient module to retain the domain information for each participant. To facilitate the evaluation, we build a benchmark collection based on several QA datasets from different domains to simulate the heterogeneous situation in practice. Empirical studies demonstrate that our model can achieve significant improvements against the baselines over all the datasets.
引用
收藏
页码:181 / 190
页数:10
相关论文
共 53 条
  • [1] Topic Model based Privacy Protection in Personalized Web Search
    Ahmad, Wasi Uddin
    Rahman, Md Masudur
    Wang, Hongning
    [J]. SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, : 1025 - 1028
  • [2] [Anonymous], 2007, CONLL
  • [3] [Anonymous], 2012, AAAI2012
  • [4] Ba Jimmy Lei, 2016, arXiv
  • [5] Bagdasaryan E, 2020, PR MACH LEARN RES, V108, P2938
  • [6] A question-entailment approach to question answering
    Ben Abacha, Asma
    Demner-Fushman, Dina
    [J]. BMC BIOINFORMATICS, 2019, 20 (01)
  • [7] Blanchard P, 2017, ADV NEUR IN, V30
  • [8] McMahan HB, 2018, Arxiv, DOI arXiv:1710.06963
  • [9] Chen Q, 2017, Arxiv, DOI arXiv:1609.06038
  • [10] Chen Q, 2018, AAAI CONF ARTIF INTE, P265