Optimal large-scale stochastic optimization of NDCG surrogates for deep learning

被引：0

作者：

Qiu, Zi-Hao ^{[1
]}

Hu, Quanqi ^{[2
]}

Zhong, Yongjian ^{[3
]}

Tu, Wei-Wei ^{[4
]}

Zhang, Lijun ^{[1
]}

Yang, Tianbao ^{[2
]}

机构：

[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing, Peoples R China

[2] Texas A&M Univ, Dept Comp Sci & Engn, College Stn, TX USA

[3] Univ Iowa, Dept Comp Sci, Iowa City, IA USA

[4] 4Paradigm Inc, Beijing, Peoples R China

来源：

MACHINE LEARNING | 2025年 / 114卷 / 02期

关键词：

Stochastic optimization; NDCG; Non-convex optimization; Information retrieval; ALGORITHMS;

D O I：

10.1007/s10994-024-06631-x

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we introduce principled stochastic algorithms to efficiently optimize Normalized Discounted Cumulative Gain (NDCG) and its top-K variant for deep models. To this end, we first propose novel compositional and bilevel compositional objectives for optimizing NDCG and top-K NDCG, respectively. We then develop two stochastic algorithms to tackle these non-convex objectives, achieving an iteration complexity of O(& varepsilon;-4)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\epsilon <^>{-4})$$\end{document} for reaching an & varepsilon;\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-stationary point. Our methods employ moving average estimators to track the crucial inner functions for gradient computation, effectively reducing approximation errors. Besides, we introduce practical strategies such as initial warm-up and stop-gradient techniques to enhance performance in deep learning. Despite the advancements, the iteration complexity of these two algorithms does not meet the optimal O(& varepsilon;-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\epsilon <^>{-3})$$\end{document} for smooth non-convex optimization. To address this issue, we incorporate variance reduction techniques in our framework to more finely estimate the key functions, design new algorithmic mechanisms for solving multiple lower-level problems with parallel speed-up, and propose two types of algorithms. The first type directly tracks these functions with the variance reduced estimators, while the second treats these functions as solutions to minimization problems and employs variance reduced estimators to construct gradient estimators for solving these problems. We manage to establish the optimal O(& varepsilon;-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\epsilon <^>{-3})$$\end{document} complexity for both types of algorithms. It is important to highlight that our algorithmic frameworks are versatile and can optimize a wide spectrum of metrics, including Precision@K/Recall@K, Average Precision (AP), mean Average Precision (mAP), and their top-K variants. We further present efficient stochastic algorithms for optimizing these metrics with convergence guarantees. We conduct comprehensive experiments on multiple ranking tasks to verify the effectiveness of our proposed algorithms, which consistently surpass existing strong baselines.

引用

页数：70

共 70 条

[21]

Guo Z., 2021, arXiv preprint arXiv:2105.02266

[22] The MovieLens Datasets: History and Context [J].

Harper, F. Maxwell ;

Konstan, Joseph A. .

ACM TRANSACTIONS ON INTERACTIVE INTELLIGENT SYSTEMS, 2016, 5 (04)

[23] LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation [J].

He, Xiangnan ;

Deng, Kuan ;

Wang, Xiang ;

Li, Yan ;

Zhang, Yongdong ;

Wang, Meng .

PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :639-648

[24] NAIS: Neural Attentive Item Similarity Model for Recommendation [J].

He, Xiangnan ;

He, Zhankui ;

Song, Jingkuan ;

Liu, Zhenguang ;

Jiang, Yu-Gang ;

Chua, Tat-Seng .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (12) :2354-2366

[25] Neural Collaborative Filtering [J].

He, Xiangnan ;

Liao, Lizi ;

Zhang, Hanwang ;

Nie, Liqiang ;

Hu, Xia ;

Chua, Tat-Seng .

PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'17), 2017, :173-182

[26] A TWO-TIMESCALE STOCHASTIC ALGORITHM FRAMEWORK FOR BILEVEL OPTIMIZATION: COMPLEXITY ANALYSIS AND APPLICATION TO ACTOR-CRITIC [J].

Hong, Mingyi ;

Wai, Hoi-To ;

Wang, Zhaoran ;

Yang, Zhuoran .

SIAM JOURNAL ON OPTIMIZATION, 2023, 33 (01) :147-180

[27]

Hu W., 2020, 8 INT C LEARNING REP

[28]

Hu Y., 2020, Advances in Neural Information Processing Systems, V33, P2759

[29] Cumulated gain-based evaluation of IR techniques [J].

Järvelin, K ;

Kekäläinen, J .

ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (04) :422-446

[30]

Jiang W., 2022, ADV NEURAL INFORM PR, V35

← 1 2 3 4 5 6 7 →