Audio Representation Learning by Distilling Video as Privileged Information

被引:2
|
作者
Hajavi A. [1 ]
Etemad A. [1 ]
机构
[1] Queen's University at Kingston, Department of Electrical and Computer Engineering, Kingston, K7L 3N6, ON
来源
关键词
Audiovisual representation learning; deep learning; knowledge distillation; learning using privileged information (LUPI); multimodal data;
D O I
10.1109/TAI.2023.3243596
中图分类号
学科分类号
摘要
Deep audio representation learning using multimodal audiovisual data often leads to a better performance compared to unimodal approaches. However, in real-world scenarios, both modalities are not always available at the time of inference, leading to performance degradation by models trained for multimodal inference. In this article, we propose a novel approach for deep audio representation learning using audiovisual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft labels generated by the teacher, in our proposed method, we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and nonsequential data where the entire features are treated as one whole segment. In the nonsequential setting, both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and the task header. We use two sets of embeddings produced by the encoder and the aggregation component of the teacher to train the student. Similar to the nonsequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audiovisual tasks, namely, speaker recognition and speech emotion recognition. Through these experiments, we show that by treating the video modality as privileged information for the main goal of audio representation learning, our method results in considerable improvements over sole audio-based recognition as well as prior works that use LUPI. © 2020 IEEE.
引用
收藏
页码:446 / 456
页数:10
相关论文
共 50 条
  • [21] A New Method for Structured Learning with Privileged Information
    Sun, Shiding
    Zhang, Chunhua
    Tian, Yingjie
    COMPUTATIONAL SCIENCE - ICCS 2018, PT II, 2018, 10861 : 453 - 461
  • [22] Feature Selection in Learning Using Privileged Information
    Izmailov, Rauf
    Lindqvist, Blerta
    Lin, Peter
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2017), 2017, : 957 - 963
  • [23] Learning with privileged information using Bayesian networks
    Wang, Shangfei
    He, Menghua
    Zhu, Yachen
    He, Shan
    Liu, Yue
    Ji, Qiang
    FRONTIERS OF COMPUTER SCIENCE, 2015, 9 (02) : 185 - 199
  • [24] Learning Using Privileged Information for Food Recognition
    Meng, Lei
    Chen, Long
    Yang, Xun
    Tao, Dacheng
    Zhang, Hanwang
    Miao, Chunyan
    Chua, Tat-Seng
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 557 - 565
  • [25] Self-paced learning with privileged information
    Xu, Wei
    Liu, Wei
    Chi, Haoyuan
    Qiu, Song
    Jin, Yu
    NEUROCOMPUTING, 2019, 362 : 147 - 155
  • [26] Audio DistilBERT: A Distilled Audio BERT for Speech Representation Learning
    Yu, Fan
    Guo, Jiawei
    Xi, Wei
    Yang, Zhao
    Jiang, Rui
    Zhang, Chao
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [27] Learning with privileged information using Bayesian networks
    Shangfei Wang
    Menghua He
    Yachen Zhu
    Shan He
    Yue Liu
    Qiang Ji
    Frontiers of Computer Science, 2015, 9 : 185 - 199
  • [28] Learning using statistical invariants with privileged information
    Yan, Xueqin
    Li, Chunna
    Shao, Yuanhai
    Meng, Yanhui
    INFORMATION SCIENCES, 2025, 709
  • [29] Learning with privileged information using Bayesian networks
    Shangfei WANG
    Menghua HE
    Yachen ZHU
    Shan HE
    Yue LIU
    Qiang JI
    Frontiers of Computer Science, 2015, 9 (02) : 185 - 199
  • [30] Incorporating Privileged Information Through Metric Learning
    Fouad, Shereen
    Tino, Peter
    Raychaudhury, Somak
    Schneider, Petra
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (07) : 1086 - 1098