TREECSS: An Efficient Framework for Vertical Federated Learning

被引:0
作者
Zhang, Qinbo [1 ]
Yang, Xiao [2 ]
Ding, Yukai [1 ]
Xu, Quanqing [3 ]
Hu, Chuang [1 ]
Zhou, Xiaokai [1 ]
Jiang, Jiawei [1 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Wuhan, Peoples R China
[2] Ctr Perceptual & Interact Intelligence CPII, Hong Kong, Peoples R China
[3] OceanBase, Beijing, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT I, DASFAA 2024 | 2024年 / 14850卷
关键词
vertical federated learning; private set intersection; coreset selection; OCEANBASE; DATABASE;
D O I
10.1007/978-981-97-5552-3_29
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vertical federated learning (VFL) considers the case that the features of data samples are partitioned over different participants. VFL consists of two main steps, i.e., identify the common data samples for all participants (alignment) and train model using the aligned data samples (training). However, when there are many participants and data samples, both alignment and training become slow. As such, we propose TREECSS as an efficient VFL framework that accelerates the two main steps. In particular, for sample alignment, we design an efficient multi-party private set intersection (MPSI) protocol called Tree-MPSI, which adopts a tree-based structure and a data-volume-aware scheduling strategy to parallelize alignment among the participants. As model training time scales with the number of data samples, we conduct coreset selection (CSS) to choose some representative data samples for training. Our CCS method adopts a clustering-based scheme for security and generality, which first clusters the features locally on each participant and then merges the local clustering results to select representative samples. In addition, we weight the samples according to their distances to the centroids to reflect their importance to model training. We evaluate the effectiveness and efficiency of our TREECSS framework on various datasets and models. The results show that compared with vanilla VFL, TREECSS accelerates training by up to 2.93x and achieves comparable model accuracy.
引用
收藏
页码:425 / 441
页数:17
相关论文
共 40 条
[1]  
Bachem O, 2017, Arxiv, DOI [arXiv:1703.06476, DOI 10.48550/ARXIV.1703.06476]
[2]   Practical Multi-Party Private Set Intersection Protocols [J].
Bay, Asli ;
Erkin, Zekeriya ;
Hoepman, Jaap-Henk ;
Samardjiska, Simona ;
Vos, Jelle .
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2022, 17 :1-15
[3]  
Bertin-Mahieux T, 2011, UCI Machine Learning Repository
[4]  
Ceballos I, 2020, Arxiv, DOI arXiv:2008.04137
[5]  
Chaudhuri R., 2017, Higgs boson dataset
[6]  
Cohen MB, 2017, PROCEEDINGS OF THE TWENTY-EIGHTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1758
[7]  
De Cristofaro E., 2009, Cryptology ePrint Archive
[8]   Sampling Algorithms for l2 Regression and Applications [J].
Drineas, Petros ;
Mahoney, Michael W. ;
Muthukrishnan, S. .
PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, :1127-+
[9]   TURNING BIG DATA INTO TINY DATA: CONSTANT-SIZE CORESETS FOR k-MEANS, PCA, AND PROJECTIVE CLUSTERING [J].
Feldman, Dan ;
Schmidt, Melanie ;
Sohler, Christian .
SIAM JOURNAL ON COMPUTING, 2020, 49 (03) :601-657
[10]  
Feldman D, 2011, ACM S THEORY COMPUT, P569