Speaker-Adaptive Lip Reading with User-Dependent Padding

被引:10
|
作者
Kim, Minsu [1 ]
Kim, Hyunjun [1 ]
Ro, Yong Man [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Image & Video Syst Lab, Daejeon, South Korea
来源
COMPUTER VISION, ECCV 2022, PT XXXVI | 2022年 / 13696卷
关键词
Visual speech recognition; Lip reading; Speaker-adaptive training; Speaker adaptation; User-dependent padding; LRW-ID; ADAPTATION;
D O I
10.1007/978-3-031-20059-5_33
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims to reduce this mismatch between train and test speakers, thus guiding a trained model to focus on modeling the speech content without being intervened by the speaker variations. In contrast to the efforts made in audio-based speech recognition for decades, the speaker adaptation methods have not well been studied in lip reading. In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding. The user-dependent padding is a speaker-specific input that can participate in the visual feature extraction stage of a pre-trained lip reading model. Therefore, the lip appearances and movements information of different speakers can be considered during the visual feature encoding, adaptively for individual speakers. Moreover, the proposed method does not need 1) any additional layers, 2) to modify the learned weights of the pre-trained model, and 3) the speaker label of train data used during pre-train. It can directly adapt to unseen speakers by learning the userdependent padding only, in a supervised or unsupervised manner. Finally, to alleviate the speaker information insufficiency in public lip reading databases, we label the speaker of a well-known audio-visual database, LRW, and design an unseen-speaker lip reading scenario named LRW-ID. The effectiveness of the proposed method is verified on sentence- and word-level lip reading, and we show it can further improve the performance of a well-trained model with large speaker variations.
引用
收藏
页码:576 / 593
页数:18
相关论文
共 33 条
  • [1] Prompt Tuning of Deep Neural Networks for Speaker-Adaptive Visual Speech Recognition
    Kim, Minsu
    Kim, Hyung-Il
    Ro, Yong Man
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (02) : 1042 - 1055
  • [2] Integrated speaker-adaptive speech synthesis
    Wan, Moquan
    Degottex, Gilles
    Gales, Mark J. F.
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 705 - 711
  • [3] Online Incremental Learning for Speaker-Adaptive Language Models
    Hu, Chih Chi
    Liu, Bing
    Shen, John Paul
    Lane, Ian
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3363 - 3367
  • [4] Speaker-adaptive visual speech synthesis in the HMM-framework
    Schabus, Dietmar
    Pucher, Michael
    Hofer, Gregor
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 978 - 981
  • [5] Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives
    Cerva, Petr
    Silovsky, Jan
    Zdansky, Jindrich
    Nouza, Jan
    Seps, Ladislav
    SPEECH COMMUNICATION, 2013, 55 (10) : 1033 - 1046
  • [6] Dysarthric Speech Recognition Using Dysarthria-Severity-Dependent and Speaker-Adaptive Models
    Kim, Myung Jong
    Yoo, Joohong
    Kim, Hoirin
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3589 - 3593
  • [7] IMPROVED SPEAKER INDEPENDENT LIP READING USING SPEAKER ADAPTIVE TRAINING AND DEEP NEURAL NETWORKS
    Almajai, Ibrahim
    Cox, Stephen
    Harvey, Richard
    Lan, Yuxuan
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 2722 - 2726
  • [8] Roles of the Average Voice in Speaker-adaptive HMM-based Speech Synthesis
    Yamagishi, Junichi
    Watts, Oliver
    King, Simon
    Usabaev, Bela
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 418 - +
  • [9] UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
    Kim, Heeseung
    Kim, Sungwon
    Yeom, Jiheum
    Yoon, Sungroh
    INTERSPEECH 2023, 2023, : 3038 - 3042
  • [10] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
    Yamagishi, Junichi
    Nose, Takashi
    Zen, Heiga
    Ling, Zhen-Hua
    Toda, Tomoki
    Tokuda, Keiichi
    King, Simon
    Renals, Steve
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230