Robust-EQA: Robust Learning for Embodied Question Answering With Noisy Labels

被引：10

作者：

Luo, Haonan ^{[1
]}

Lin, Guosheng ^{[2
]}

Shen, Fumin

Huang, Xingguo ^{[3
]}

Yao, Yazhou ^{[4
]}

Shen, Hengtao

机构：

[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu 611756, Peoples R China

[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore

[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610054, Peoples R China

[4] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2024年 / 35卷 / 09期

基金：

中国博士后科学基金; 新加坡国家研究基金会;

关键词：

Task analysis; Noise measurement; Visualization; Navigation; Trajectory; Question answering (information retrieval); Noise robustness; Embodied question answering (EQA); label noise; navigation; reinforcement learning; visual question answering (VQA);

D O I：

10.1109/TNNLS.2023.3251984

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Embodied question answering (EQA) is a recently emerged research field in which an agent is asked to answer the user's questions by exploring the environment and collecting visual information. Plenty of researchers turn their attention to the EQA field due to its broad potential application areas, such as in-home robots, self-driven mobile, and personal assistants. High-level visual tasks, such as EQA, are susceptible to noisy inputs, because they have complex reasoning processes. Before the profits of the EQA field can be applied to practical applications, good robustness against label noise needs to be equipped. To tackle this problem, we propose a novel label noise-robust learning algorithm for the EQA task. First, a joint training co-regularization noise-robust learning method is proposed for noisy filtering of the visual question answering (VQA) module, which trains two parallel network branches by one loss function. Then, a two-stage hierarchical robust learning algorithm is proposed to filter out noisy navigation labels in both trajectory level and action level. Finally, by taking purified labels as inputs, a joint robust learning mechanism is given to coordinate the work of the whole EQA system. Empirical results demonstrate that, under extremely noisy environments (45% of noisy labels) and low-level noisy environments (20% of noisy labels), the robustness of deep learning models trained by our algorithm is superior to the existing EQA models in noisy environments.

引用

页码：12083 / 12094

页数：12

共 55 条

[1]

Abbeel P., 2004, Proceedings of the Twenty-First International Conference on Machine Learning, P1

[2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].

Anderson, Peter ;

Wu, Qi ;

Teney, Damien ;

Bruce, Jake ;

Johnson, Mark ;

Sunderhauf, Niko ;

Reid, Ian ;

Gould, Stephen ;

van den Hengel, Anton .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683

[3] Neural Module Networks [J].

Andreas, Jacob ;

Rohrbach, Marcus ;

Darrell, Trevor ;

Klein, Dan .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48

[4] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[5]

Chen K, 2015, ARXIV

[6]

Chung J., 2014, Empirical evaluation of gated recurrent neural networks on sequence modeling

[7]

Das A., 2018, ARXIV

[8]

Das A, 2018, PROC CVPR IEEE, P1, DOI [10.3233/his-180257, 10.1109/CVPR.2018.00008]

[9]

Devlin J., 2018, arXiv

[10]

Dolgov Dmitri., 2008, ANN ARBOR, V1001, P18

← 1 2 3 4 5 6 →