3D Question Answering

被引:2
作者
Ye, Shuquan [1 ]
Chen, Dongdong [2 ]
Han, Songfang [3 ]
Liao, Jing [1 ]
机构
[1] City Univ Hong Kong, Kowloon Tong, Hong Kong, Peoples R China
[2] Microsoft Cloud AI, Redmond, WA 98052 USA
[3] Univ Calif San Diego, La Jolla, CA 92093 USA
关键词
Point cloud; scene understanding; LANGUAGE; VISION;
D O I
10.1109/TVCG.2022.3225327
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual question answering (VQA) has experienced tremendous progress in recent years. However, most efforts have only focused on 2D image question-answering tasks. In this article, we extend VQA to its 3D counterpart, 3D question answering (3DQA), which can facilitate a machine's perception of 3D real-world scenarios. Unlike 2D image VQA, 3DQA takes the color point cloud as input and requires both appearance and 3D geometrical comprehension to answer the 3D-related questions. To this end, we propose a novel transformer-based 3DQA framework "3DQA-TR", which consists of two encoders to exploit the appearance and geometry information, respectively. Finally, the multi-modal information about the appearance, geometry, and linguistic question can attend to each other via a 3D-linguistic Bert to predict the target answers. To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset "ScanQA", which builds on the ScanNet dataset and contains over 10 K question-answer pairs for 806 scenes. To the best of our knowledge, ScanQA is the first large-scale dataset with natural-language questions and free-form answers in 3D environments that is fully human-annotated. We also use several visualizations and experiments to investigate the astonishing diversity of the collected questions and the significant differences between this task from 2D VQA and 3D captioning. Extensive experiments on this dataset demonstrate the obvious superiority of our proposed 3DQA framework over state-of-the-art VQA frameworks and the effectiveness of our major designs. Our code and dataset will be made publicly available to facilitate research in this direction. The code and data are available at http://shuquanye.com/3DQA_website/.
引用
收藏
页码:1772 / 1786
页数:15
相关论文
共 50 条
  • [31] Flexible Sentence Analysis Model for Visual Question Answering Network
    Deng, Wei
    Wang, Jianming
    Wang, Shengbei
    Jin, Guanghao
    2018 2ND INTERNATIONAL CONFERENCE ON BIOMEDICAL ENGINEERING AND BIOINFORMATICS (ICBEB 2018), 2018, : 89 - 95
  • [32] Indexing UMLS Semantic Types for Medical Question-Answering
    Delbecque, Thierry
    Jacquemart, Pierre
    Zweigenbaum, Pierre
    CONNECTING MEDICAL INFORMATICS AND BIO-INFORMATICS, 2005, 116 : 805 - 810
  • [33] COIN: Counterfactual Image Generation for Visual Question Answering Interpretation
    Boukhers, Zeyd
    Hartmann, Timo
    Juerjens, Jan
    SENSORS, 2022, 22 (06)
  • [34] Exploring and exploiting model uncertainty for robust visual question answering
    Zhang, Xuesong
    He, Jun
    Zhao, Jia
    Hu, Zhenzhen
    Yang, Xun
    Li, Jia
    Hong, Richang
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [35] Medical visual question answering based on question-type reasoning and semantic space constraint
    Wang, Meiling
    He, Xiaohai
    Liu, Luping
    Qing, Linbo
    Chen, Honggang
    Liu, Yan
    Ren, Chao
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2022, 131
  • [36] Rapid and accurate phase matching for 3D measurement
    Huang, Haiqing
    Fang, Xiangzhong
    Li, Xiangyang
    Lu, Qingchun
    Ren, Hang
    2013 10TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2013, : 1096 - 1100
  • [37] QUARE: towards a question-answering model for requirements elicitation
    Gallego, Johnathan Mauricio Calle
    Jaramillo, Carlos Mario Zapata
    AUTOMATED SOFTWARE ENGINEERING, 2023, 30 (02)
  • [38] A Survey of Multi-modal Question Answering Systems for Robotics
    Liu, Xiaomeng
    Long, Fei
    2017 2ND INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM), 2017, : 189 - 194
  • [39] Contrastive Video Question Answering via Video Graph Transformer
    Xiao, Junbin
    Zhou, Pan
    Yao, Angela
    Li, Yicong
    Hong, Richang
    Yan, Shuicheng
    Chua, Tat-Seng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
  • [40] Dual-Key Multimodal Backdoors for Visual Question Answering
    Walmer, Matthew
    Sikka, Karan
    Sur, Indranil
    Shrivastava, Abhinav
    Jha, Susmit
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15354 - 15364