Probing the link between vision and language in material perception using psychophysics and unsupervised learning

被引:0
|
作者
Liao, Chenxi [1 ]
Sawayama, Masataka [2 ]
Xiao, Bei [3 ]
机构
[1] Amer Univ, Dept Neurosci, Washington, DC 20016 USA
[2] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
[3] Amer Univ, Dept Comp Sci, Washington, DC USA
基金
美国国家卫生研究院;
关键词
REPRESENTATION; COLOR; GLOSS;
D O I
10.1371/journal.pcbi.1012481
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We can visually discriminate and recognize a wide range of materials. Meanwhile, we use language to describe what we see and communicate relevant information about the materials. Here, we investigate the relationship between visual judgment and language expression to understand how visual features relate to semantic representations in human cognition. We use deep generative models to generate images of realistic materials. Interpolating between the generative models enables us to systematically create material appearances in both well-defined and ambiguous categories. Using these stimuli, we compared the representations of materials from two behavioral tasks: visual material similarity judgments and free-form verbal descriptions. Our findings reveal a moderate but significant correlation between vision and language on a categorical level. However, analyzing the representations with an unsupervised alignment method, we discover structural differences that arise at the image-to-image level, especially among ambiguous materials morphed between known categories. Moreover, visual judgments exhibit more individual differences compared to verbal descriptions. Our results show that while verbal descriptions capture material qualities on the coarse level, they may not fully convey the visual nuances of material appearances. Analyzing the image representation of materials obtained from various pre-trained deep neural networks, we find that similarity structures in human visual judgments align more closely with those of the vision-language models than purely vision-based models. Our work illustrates the need to consider the vision-language relationship in building a comprehensive model for material perception. Moreover, we propose a novel framework for evaluating the alignment and misalignment between representations from different modalities, leveraging information from human behaviors and computational models. Materials are building blocks of our environment, granting access to a wide array of visual experiences. The immense diversity, complexity, and versatility of materials present challenges in verbal articulation. To what extent can words convey the richness of visual material perception? What are the salient attributes for communicating about materials? We address these questions by measuring both visual material similarity judgments and free-form verbal descriptions. We use AI models to create a diverse array of plausible visual appearances of familiar and unfamiliar materials. Our findings reveal a moderate vision-language correlation within individual participants, yet a notable discrepancy persists between the two modalities. While verbal descriptions capture material qualities at a coarse categorical level, precise alignment between vision and language at the individual stimulus level is still lacking. These results highlight that visual representations of materials are richer than verbalized semantic features, underscoring the differential roles of language and vision in perception. Lastly, we discover that deep neural networks pre-trained on large-scale datasets can predict human visual similarities at a coarse level, suggesting the general visual representations learned by these networks carry perceptually relevant information for material-relevant tasks.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model
    Li, Mingxin
    Zhang, Richong
    Nie, Zhijie
    Mao, Yongyi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13590 - 13599
  • [22] Detecting Logos for Indoor Environmental Perception Using Unsupervised and Few-Shot Learning
    Yin, Changjiang
    Ye, Qin
    Zhang, Shaoming
    Yang, Zexin
    ELECTRONICS, 2024, 13 (12)
  • [23] Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval
    Sun, Lina
    Li, Yewen
    Dong, Yumin
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 499 - 507
  • [24] UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
    Li, Xin
    Behpour, Sima
    Doan, Thang
    He, Wenbin
    Gou, Liang
    Ren, Liu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [25] American Sign Language Recognition using Deep Learning and Computer Vision
    Bantupalli, Kshitij
    Xie, Ying
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 4896 - 4899
  • [26] Learning the Visualness of Text Using Large Vision-Language Models
    Verma, Gaurav
    Rossi, Ryan A.
    Tensmeyer, Christopher
    Gu, Jiuxiang
    Nenkova, Ani
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
  • [27] Cross-modal learning for material perception using deep extreme learning machine
    Zheng, Wendong
    Liu, Huaping
    Wang, Bowen
    Sun, Fuchun
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2020, 11 (04) : 813 - 823
  • [28] Cross-modal learning for material perception using deep extreme learning machine
    Wendong Zheng
    Huaping Liu
    Bowen Wang
    Fuchun Sun
    International Journal of Machine Learning and Cybernetics, 2020, 11 : 813 - 823
  • [29] BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning
    Xu, Xiao
    Wu, Chenfei
    Rosenman, Shachar
    Lal, Vasudev
    Che, Wanxiang
    Duan, Nan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 9, 2023, : 10637 - 10647
  • [30] Active Clothing Material Perception using Tactile Sensing and Deep Learning
    Yuan, Wenzhen
    Mo, Yuchen
    Wang, Shaoxiong
    Adelson, Edward H.
    2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2018, : 4842 - 4849