Evaluating Large Language Models on Controlled Generation Tasks

被引：0

作者：

Sun, Jiao ^{[1
]}

Tian, Yufei ^{[2
]}

Zhou, Wangchunshu ^{[3
]}

Xu, Nan ^{[1
]}

Hu, Qian ^{[4
]}

Gupta, Rahul ^{[4
]}

Wieting, John ^{[5
]}

Peng, Nanyun ^{[2
]}

Ma, Xuezhe ^{[1
]}

机构：

[1] Univ Southern Calif, Los Angeles, CA 90007 USA

[2] Univ Calif Los Angeles, Los Angeles, CA 90024 USA

[3] Swiss Fed Inst Technol, Zurich, Switzerland

[4] Amazon, Seattle, WA USA

[5] Google DeepMind, London, England

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While recent studies have looked into the abilities of large language models in various benchmark tasks, few studies have looked into the controllability of large language models on generation tasks. We present a systematic and extensive analysis of the controllability of large language models on ten benchmarks, including a new simple yet challenging numerical planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing when large language models fall behind, are comparable, or exceed the ability of smaller models. We conclude that large language models struggle at meeting fine-grained hard constraints.

引用

页码：3155 / 3168

页数：14

共 50 条

[21] Evaluating Large Language Models for Material Selection
Grandi, Daniele
Jain, Yash Patawari
Groom, Allin
Cramer, Brandon
Mccomb, Christopher
JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
[22] Evaluating large language models for annotating proteins
Vitale, Rosario
Bugnon, Leandro A.
Fenoy, Emilio Luis
Milone, Diego H.
Stegmayer, Georgina
BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
[23] Evaluating large language models as agents in the clinic
Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
npj Digital Medicine, 7
[24] EVALUATING LARGE LANGUAGE MODELS ON THEIR ACCURACY AND COMPLETENESS
Edalat, Camellia
Kirupaharan, Nila
Dalvin, Lauren A.
Mishra, Kapil
Marshall, Rayna
Xu, Hannah
Francis, Jasmine H.
Berkenstock, Meghan
RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2025, 45 (01): : 128 - 132
[25] Evaluating large language models for software testing
Li, Yihao
Liu, Pan
Wang, Haiyang
Chu, Jie
Wong, W. Eric
COMPUTER STANDARDS & INTERFACES, 2025, 93
[26] Evaluating Intelligence and Knowledge in Large Language Models
Bianchini, Francesco
TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
[27] A bilingual benchmark for evaluating large language models
Alkaoud, Mohamed
PEERJ COMPUTER SCIENCE, 2024, 10
[28] SafetyBench: Evaluating the Safety of Large Language Models
Zhang, Zhexin
Lei, Leqi
Wu, Lindong
Sun, Rui
Huang, Yongkang
Long, Chong
Liu, Xiao
Lei, Xuanyu
Tang, Jie
Huang, Minlie
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15537 - 15553
[29] Evaluating large language models as agents in the clinic
Mehandru, Nikita
Miao, Brenda Y.
Almaraz, Eduardo Rodriguez
Sushil, Madhumita
Butte, Atul J.
Alaa, Ahmed
NPJ DIGITAL MEDICINE, 2024, 7 (01)
[30] Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
Sottana, Andrea
Liang, Bin
Zou, Kai
Yuan, Zheng
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8776 - 8788

← 1 2 3 4 5 →