Optimal large-scale stochastic optimization of NDCG surrogates for deep learning

被引：0

作者：

Qiu, Zi-Hao ^{[1
]}

Hu, Quanqi ^{[2
]}

Zhong, Yongjian ^{[3
]}

Tu, Wei-Wei ^{[4
]}

Zhang, Lijun ^{[1
]}

Yang, Tianbao ^{[2
]}

机构：

[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing, Peoples R China

[2] Texas A&M Univ, Dept Comp Sci & Engn, College Stn, TX USA

[3] Univ Iowa, Dept Comp Sci, Iowa City, IA USA

[4] 4Paradigm Inc, Beijing, Peoples R China

来源：

MACHINE LEARNING | 2025年 / 114卷 / 02期

关键词：

Stochastic optimization; NDCG; Non-convex optimization; Information retrieval; ALGORITHMS;

D O I：

10.1007/s10994-024-06631-x

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we introduce principled stochastic algorithms to efficiently optimize Normalized Discounted Cumulative Gain (NDCG) and its top-K variant for deep models. To this end, we first propose novel compositional and bilevel compositional objectives for optimizing NDCG and top-K NDCG, respectively. We then develop two stochastic algorithms to tackle these non-convex objectives, achieving an iteration complexity of O(& varepsilon;-4)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\epsilon <^>{-4})$$\end{document} for reaching an & varepsilon;\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-stationary point. Our methods employ moving average estimators to track the crucial inner functions for gradient computation, effectively reducing approximation errors. Besides, we introduce practical strategies such as initial warm-up and stop-gradient techniques to enhance performance in deep learning. Despite the advancements, the iteration complexity of these two algorithms does not meet the optimal O(& varepsilon;-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\epsilon <^>{-3})$$\end{document} for smooth non-convex optimization. To address this issue, we incorporate variance reduction techniques in our framework to more finely estimate the key functions, design new algorithmic mechanisms for solving multiple lower-level problems with parallel speed-up, and propose two types of algorithms. The first type directly tracks these functions with the variance reduced estimators, while the second treats these functions as solutions to minimization problems and employs variance reduced estimators to construct gradient estimators for solving these problems. We manage to establish the optimal O(& varepsilon;-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\epsilon <^>{-3})$$\end{document} complexity for both types of algorithms. It is important to highlight that our algorithmic frameworks are versatile and can optimize a wide spectrum of metrics, including Precision@K/Recall@K, Average Precision (AP), mean Average Precision (mAP), and their top-K variants. We further present efficient stochastic algorithms for optimizing these metrics with convergence guarantees. We conduct comprehensive experiments on multiple ranking tasks to verify the effectiveness of our proposed algorithms, which consistently surpass existing strong baselines.

引用

页数：70

共 70 条

[1] Learning Groupwise Multivariate Scoring Functions Using Deep Neural Networks [J].

Ai, Qingyao ;

Wang, Xuanhui ;

Bruch, Sebastian ;

Golbandi, Nadav ;

Bendersky, Michael ;

Najork, Marc .

PROCEEDINGS OF THE 2019 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'19), 2019, :84-91

[2]

[Anonymous], 2010, P 4 ACM C RECOMMENDE, DOI 10.1145/1864708.1864721

[3] Lower bounds for non-convex stochastic optimization [J].

Arjevani, Yossi ;

Carmon, Yair ;

Duchi, John C. ;

Foster, Dylan J. ;

Srebro, Nathan ;

Woodworth, Blake .

MATHEMATICAL PROGRAMMING, 2023, 199 (1-2) :165-214

[4] STOCHASTIC MULTILEVEL COMPOSITION OPTIMIZATION ALGORITHMS WITH LEVEL-INDEPENDENT CONVERGENCE RATES [J].

Balasubramanian, Krishnakumar ;

Ghadimi, Saeed ;

Nguyen, Anthony .

SIAM JOURNAL ON OPTIMIZATION, 2022, 32 (02) :519-544

[5]

Bennett J., 2007, P KDD CUP WORKSH NEW, P35

[6]

Bhatia Kush, 2015, ADV NEURAL INFORM PR, V28

[7]

Burges Chris, 2005, P 22 INT C MACHINE L, P89

[8]

Burges Chris J.C., 2010, Learning

[9]

Chakrabarti S., 2008, P 14 ACM SIGKDD C KN, P88, DOI [10.1145/140189, 0.1401906, DOI 10.1145/140189]

[10]

Chapelle O., 2011, P LEARNING RANK CHAL, P1

← 1 2 3 4 5 6 7 →