Evaluating Large Language Models on Controlled Generation Tasks

被引:0
|
作者
Sun, Jiao [1 ]
Tian, Yufei [2 ]
Zhou, Wangchunshu [3 ]
Xu, Nan [1 ]
Hu, Qian [4 ]
Gupta, Rahul [4 ]
Wieting, John [5 ]
Peng, Nanyun [2 ]
Ma, Xuezhe [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90007 USA
[2] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
[3] Swiss Fed Inst Technol, Zurich, Switzerland
[4] Amazon, Seattle, WA USA
[5] Google DeepMind, London, England
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While recent studies have looked into the abilities of large language models in various benchmark tasks, few studies have looked into the controllability of large language models on generation tasks. We present a systematic and extensive analysis of the controllability of large language models on ten benchmarks, including a new simple yet challenging numerical planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing when large language models fall behind, are comparable, or exceed the ability of smaller models. We conclude that large language models struggle at meeting fine-grained hard constraints.
引用
收藏
页码:3155 / 3168
页数:14
相关论文
共 50 条
  • [21] Evaluating Large Language Models for Material Selection
    Grandi, Daniele
    Jain, Yash Patawari
    Groom, Allin
    Cramer, Brandon
    Mccomb, Christopher
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [22] Evaluating large language models for annotating proteins
    Vitale, Rosario
    Bugnon, Leandro A.
    Fenoy, Emilio Luis
    Milone, Diego H.
    Stegmayer, Georgina
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
  • [23] Evaluating large language models as agents in the clinic
    Nikita Mehandru
    Brenda Y. Miao
    Eduardo Rodriguez Almaraz
    Madhumita Sushil
    Atul J. Butte
    Ahmed Alaa
    npj Digital Medicine, 7
  • [24] EVALUATING LARGE LANGUAGE MODELS ON THEIR ACCURACY AND COMPLETENESS
    Edalat, Camellia
    Kirupaharan, Nila
    Dalvin, Lauren A.
    Mishra, Kapil
    Marshall, Rayna
    Xu, Hannah
    Francis, Jasmine H.
    Berkenstock, Meghan
    RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2025, 45 (01): : 128 - 132
  • [25] Evaluating large language models for software testing
    Li, Yihao
    Liu, Pan
    Wang, Haiyang
    Chu, Jie
    Wong, W. Eric
    COMPUTER STANDARDS & INTERFACES, 2025, 93
  • [26] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
  • [27] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [28] SafetyBench: Evaluating the Safety of Large Language Models
    Zhang, Zhexin
    Lei, Leqi
    Wu, Lindong
    Sun, Rui
    Huang, Yongkang
    Long, Chong
    Liu, Xiao
    Lei, Xuanyu
    Tang, Jie
    Huang, Minlie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15537 - 15553
  • [29] Evaluating large language models as agents in the clinic
    Mehandru, Nikita
    Miao, Brenda Y.
    Almaraz, Eduardo Rodriguez
    Sushil, Madhumita
    Butte, Atul J.
    Alaa, Ahmed
    NPJ DIGITAL MEDICINE, 2024, 7 (01)
  • [30] Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
    Sottana, Andrea
    Liang, Bin
    Zou, Kai
    Yuan, Zheng
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8776 - 8788