Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

被引：15

作者：

Abbasiantaeb, Zahra ^{[1
]}

Yuan, Yifei ^{[2
]}

Kanoulas, Evangelos ^{[1
]}

Aliannejadi, Mohammad ^{[1
]}

机构：

[1] Univ Amsterdam, Amsterdam, Netherlands

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024 | 2024年

关键词：

Dialogue simulation; Conversational question answering; Large language models;

D O I：

10.1145/3616855.3635856

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Conversational question-answering (CQA) systems aim to create interactive search systems that effectively retrieve information by interacting with users. To replicate human-to-human conversations, existing work uses human annotators to play the roles of the questioner (student) and the answerer (teacher). Despite its effectiveness, challenges exist as human annotation is time-consuming, inconsistent, and not scalable. To address this issue and investigate the applicability of large language models (LLMs) in CQA simulation, we propose a simulation framework that employs zero-shot learner LLMs for simulating teacher-student interactions. Our framework involves two LLMs interacting on a specific topic, with the first LLM acting as a student, generating questions to explore a given search topic. The second LLM plays the role of a teacher by answering questions and is equipped with additional information, including a text on the given topic. We implement both the student and teacher by zero-shot prompting the GPT-4 model. To assess the effectiveness of LLMs in simulating CQA interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the simulated data from various perspectives. We begin by evaluating the teacher's performance through both automatic and human assessment. Next, we evaluate the performance of the student, analyzing and comparing the disparities between questions generated by the LLM and those generated by humans. Furthermore, we conduct extensive analyses to thoroughly examine the LLM performance by benchmarking state-of-the-art reading comprehension models on both datasets. Our results reveal that the teacher LLM generates lengthier answers that tend to be more accurate and complete. The student LLM generates more diverse questions, covering more aspects of a given topic.

引用

页码：8 / 17

页数：10

共 57 条

[21]

Kotov Alexander, 2010, P ACM WEB C 2010

[22]

Kumar NA, 2023, Arxiv, DOI arXiv:2306.08847

[23]

Kuzman T, 2023, Arxiv, DOI [arXiv:2303.03953, DOI 10.48550/ARXIV.2303.03953]

[24]

Li Jiwei, 2016, P 4 INT C LEARN REPR

[25]

Li XY, 2020, Arxiv, DOI [arXiv:1910.11476, 10.48550/arXiv.1910.11476]

[26] Simulation studies of different dimensions of users' interests and their impact on user modeling and information filtering [J].

Mostafa, J ;

Mukhopadhyay, S ;

Palakal, M .

INFORMATION RETRIEVAL, 2003, 6 (02) :199-223

[27]

OpenAI, 2023, Gpt-4 technical report

[28] Exploiting Simulated User Feedback for Conversational Search: Ranking, Rewriting, and Beyond [J].

Owoicho, Paul ;

Sekulic, Ivan ;

Aliannejadi, Mohammad ;

Dalton, Jeffrey ;

Crestani, Fabio .

PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, :632-642

[29]

Park JS, 2023, Arxiv, DOI arXiv:2304.03442

[30] Social Simulacra: Creating Populated Prototypes for Social Computing Systems [J].

Park, Joon Sung ;

Popowski, Lindsay ;

Cai, Carrie J. ;

Morris, Meredith Ringel ;

Liang, Percy ;

Bernstein, Michael S. .

PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, UIST 2022, 2022,

← 1 2 3 4 5 6 →