BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems

被引:27
作者
Asyrofi, Muhammad Hilmi [1 ]
Yang, Zhou [1 ]
Yusuf, Imam Nur Bani [1 ]
Kang, Hong Jin [1 ]
Thung, Ferdian [1 ]
Lo, David [1 ]
机构
[1] Singapore Management Univ, Sch Comp & Informat Syst, Singapore 188065, Singapore
关键词
Sentiment analysis; test case generation; metamorphic testing; bias; fairness bug;
D O I
10.1109/TSE.2021.3136169
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Artificial intelligence systems, such as Sentiment Analysis (SA) systems, typically learn from large amounts of data that may reflect human bias. Consequently, such systems may exhibit unintended demographic bias against specific characteristics (e.g., gender, occupation, country-of-origin, etc.). Such bias manifests in an SA system when it predicts different sentiments for similar texts that differ only in the characteristic of individuals described. To automatically uncover bias in SA systems, this paper presents BiasFinder, an approach that can discover biased predictions in SA systems via metamorphic testing. A key feature of BiasFinder is the automatic curation of suitable templates from any given text inputs, using various Natural Language Processing (NLP) techniques to identify words that describe demographic characteristics. Next, BiasFinder generates new texts from these templates by mutating words associated with a class of a characteristic (e.g., gender-specific words such as female names, "she", "her"). These texts are then used to tease out bias in an SA system. BiasFinder identifies a bias-uncovering test case (BTC) when an SA system predicts different sentiments for texts that differ only in words associated with a different class (e.g., male vs. female) of a target characteristic (e.g., gender). We evaluate BiasFinder on 10 SA systems and 2 large scale datasets, and the results show that BiasFinder can create more BTCs than two popular baselines. We also conduct an annotation study and find that human annotators consistently think that test cases generated by BiasFinder are more fluent than the two baselines.
引用
收藏
页码:5087 / 5101
页数:15
相关论文
共 82 条
  • [1] Abbasi A, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P823
  • [2] Aghajanyan A, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P5799
  • [3] SA-E: Sentiment Analysis for Education
    Altrabsheh, Nabeela
    Gaber, Mohamed Medhat
    Cocea, Mihaela
    [J]. INTELLIGENT DECISION TECHNOLOGIES, 2013, 255 : 353 - 362
  • [4] How Modern News Aggregators Help Development Communities Shape and Share Knowledge
    Aniche, Mauricio
    Treude, Christoph
    Steinmacher, Igor
    Wiese, Igor
    Pinto, Gustavo
    Storey, Margaret-Anne
    Gerosa, Marco Aurelio
    [J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, : 499 - 510
  • [5] [Anonymous], SAMPL SIZ CALC
  • [6] CrossASR plus plus : A Modular Differential Testing Framework for Automatic Speech Recognition
    Asyrofi, Muhammad Hilmi
    Yang, Zhou
    Lo, David
    [J]. PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21), 2021, : 1575 - 1579
  • [7] Can Differential Testing Improve Automatic Speech Recognition Systems?
    Asyrofi, Muhammad Hilmi
    Yang, Zhou
    Shi, Jieke
    Quan, Chu Wei
    Lo, David
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2021), 2021, : 674 - 678
  • [8] CrossASR: Efficient Differential Testing of Automatic Speech Recognition via Text-To-Speech
    Asyrofi, Muhammad Hilmi
    Thung, Ferdian
    Lo, David
    Jiang, Lingxiao
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2020), 2020, : 640 - 650
  • [9] Bhaskaran J, 2019, GENDER BIAS IN NATURAL LANGUAGE PROCESSING (GEBNLP 2019), P62
  • [10] Bolukbasi T, 2016, ADV NEUR IN, V29