Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

被引：16

作者：

Abbasiantaeb, Zahra ^{[1
]}

Yuan, Yifei ^{[2
]}

Kanoulas, Evangelos ^{[1
]}

Aliannejadi, Mohammad ^{[1
]}

机构：

[1] Univ Amsterdam, Amsterdam, Netherlands

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024 | 2024年

关键词：

Dialogue simulation; Conversational question answering; Large language models;

D O I：

10.1145/3616855.3635856

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Conversational question-answering (CQA) systems aim to create interactive search systems that effectively retrieve information by interacting with users. To replicate human-to-human conversations, existing work uses human annotators to play the roles of the questioner (student) and the answerer (teacher). Despite its effectiveness, challenges exist as human annotation is time-consuming, inconsistent, and not scalable. To address this issue and investigate the applicability of large language models (LLMs) in CQA simulation, we propose a simulation framework that employs zero-shot learner LLMs for simulating teacher-student interactions. Our framework involves two LLMs interacting on a specific topic, with the first LLM acting as a student, generating questions to explore a given search topic. The second LLM plays the role of a teacher by answering questions and is equipped with additional information, including a text on the given topic. We implement both the student and teacher by zero-shot prompting the GPT-4 model. To assess the effectiveness of LLMs in simulating CQA interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the simulated data from various perspectives. We begin by evaluating the teacher's performance through both automatic and human assessment. Next, we evaluate the performance of the student, analyzing and comparing the disparities between questions generated by the LLM and those generated by humans. Furthermore, we conduct extensive analyses to thoroughly examine the LLM performance by benchmarking state-of-the-art reading comprehension models on both datasets. Our results reveal that the teacher LLM generates lengthier answers that tend to be more accurate and complete. The student LLM generates more diverse questions, covering more aspects of a given topic.

引用

页码：8 / 17

页数：10

共 57 条

[1]

Afzali J, 2023, Arxiv, DOI arXiv:2301.05544

[2]

Aiyappa R, 2024, Arxiv, DOI [arXiv:2303.12767, 10.48550/arXiv.2303.12767]

[3] Asking Clarifying Questions in Open-Domain Information-Seeking Conversations [J].

Aliannejadi, Mohammad ;

Zamani, Hamed ;

Crestani, Fabio ;

Croft, W. Bruce .

PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :475-484

[4]

Askari A, 2023, Arxiv, DOI arXiv:2305.02320

[5]

Askari Arian, 2023, P 32 ACM INT C INF K

[6]

Balog K, 2024, Arxiv, DOI [arXiv:2306.08550, 10.48550/arXiv.2306.08550]

[7]

Balog Krisztian, 2021, P 2 INT C DES EXP SE, V2950

[8] CASES, SCRIPTS, AND INFORMATION-SEEKING STRATEGIES - ON THE DESIGN OF INTERACTIVE INFORMATION-RETRIEVAL SYSTEMS [J].

BELKIN, NJ ;

COOL, C ;

STEIN, A ;

THIEL, U .

EXPERT SYSTEMS WITH APPLICATIONS, 1995, 9 (03) :379-395

[9]

Carterette Ben, 2011, P 20 ACM INT C INF K

[10]

Choi E, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2174

← 1 2 3 4 5 6 →