Automatic Online Evaluation of Intelligent Assistants

被引:88
作者
Jiang, Jiepu [1 ]
Awadallah, Ahmed Hassan [2 ]
Jones, Rosie [2 ]
Ozertem, Umut [2 ]
Zitouni, Imed [2 ]
Kulkarni, Ranjitha Gurunath [2 ]
Khan, Omar Zia [2 ]
机构
[1] Univ Massachusetts, Ctr Intelligent Informat Retrieval, Amherst, MA 01003 USA
[2] Microsoft, Redmond, WA USA
来源
PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015) | 2015年
关键词
Voice-activated intelligent assistant; evaluation; user experience; mobile search; spoken dialog system;
D O I
10.1145/2736277.2741669
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Voice-activated intelligent assistants, such as Siri, Google Now, and Cortana, are prevalent on mobile devices. However, it is challenging to evaluate them due to the varied and evolving number of tasks supported, e.g., voice command, web search, and chat. Since each task may have its own procedure and a unique form of correct answers, it is expensive to evaluate each task individually. This paper is the first attempt to solve this challenge. We develop consistent and automatic approaches that can evaluate different tasks in voice-activated intelligent assistants. We use implicit feedback from users to predict whether users are satisfied with the intelligent assistant as well as its components, i.e., speech recognition and intent classification. Using this approach, we can potentially evaluate and compare different tasks within and across intelligent assistants according to the predicted user satisfaction rates. Our approach is characterized by an automatic scheme of categorizing user-system interaction into task-independent dialog actions, e.g., the user is commanding, selecting, or confirming an action. We use the action sequence in a session to predict user satisfaction and the quality of speech recognition and intent classification. We also incorporate other features to further improve our approach, including features derived from previous work on web search satisfaction prediction, and those utilizing acoustic characteristics of voice requests. We evaluate our approach using data collected from a user study. Results show our approach can accurately identify satisfactory and unsatisfactory sessions.
引用
收藏
页码:506 / 516
页数:11
相关论文
共 33 条
[1]  
Ageev M, 2011, PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), P345
[2]  
[Anonymous], 2000, ANN STAT
[3]  
FEILD H, 2010, P ACM SIGIR INT C RE, P34
[4]   Evaluating implicit measures to improve web search [J].
Fox, S ;
Karnawat, K ;
Mydland, M ;
Dumais, S ;
White, T .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2005, 23 (02) :147-168
[5]   Struggling or Exploring? Disambiguating Long Search Sessions [J].
Hassan, Ahmed ;
White, Ryen W. ;
Dumais, Susan T. ;
Wang, Yi-Min .
WSDM'14: PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2014, :53-62
[6]  
Hassan A, 2013, PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), P2019
[7]  
Hassan A, 2012, SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P275, DOI 10.1145/2348283.2348323
[8]  
Hassan Ahmed, 2010, P 3 ACM INT C WEB SE, P221, DOI [10.1145/1718487.1718515, DOI 10.1145/1718487.1718515]
[9]  
Hassan Ahmed., 2011, ACM C INFORM KNOWLED, P125, DOI DOI 10.1145/2063576.2063599
[10]  
Heck Larry, 2013, IEEE WORKSH SPEECH L