Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引:6
|
作者
Kose, Oyku Deniz [1 ]
Saraclar, Murat [1 ]
机构
[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey
关键词
Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;
D O I
10.1109/TASLP.2021.3084099
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.
引用
收藏
页码:1912 / 1924
页数:13
相关论文
共 50 条
  • [31] Autonomous Car-Following Approach Based on Real-time Video Frames Processing
    Masmoudi, Mehdi
    Ghazzai, Hakim
    Frikha, Mounir
    Massoud, Yehia
    2019 IEEE INTERNATIONAL CONFERENCE OF VEHICULAR ELECTRONICS AND SAFETY (ICVES 19), 2019,
  • [32] Digital Architecture for Real-Time CNN-based Face Detection for Video Processing
    Bhattarai, Smrity
    Madanayake, Arjuna
    Cintra, Renato J.
    Duffner, Stefan
    Garcia, Christophe
    2017 COGNITIVE COMMUNICATIONS FOR AEROSPACE APPLICATIONS WORKSHOP (CCAA), 2017,
  • [33] Real-time speech synthesis system driven by visual speech
    Li, G
    Xie, GM
    Lin, L
    PROCEEDINGS OF THE THIRD INTERNATIONAL SYMPOSIUM ON INSTRUMENTATION SCIENCE AND TECHNOLOGY, VOL 2, 2004, : 397 - 402
  • [34] Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI
    Saha, Pramit
    Srungarapu, Praneeth
    Fels, Sidney
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1249 - 1253
  • [35] Real-time Speech Summarization for Medical Conversations
    Khai Le-Duc
    Khai-Nguyen Nguyen
    Long Vo-Dang
    Truong-Son Hy
    INTERSPEECH 2024, 2024, : 1960 - 1964
  • [36] A FLEXIBLE ARCHITECTURE FOR REAL-TIME SPEECH RECOGNITION
    MORENO, F
    ALEXANDRES, S
    MENESES, J
    MICROPROCESSING AND MICROPROGRAMMING, 1993, 37 (1-5): : 69 - 72
  • [37] Real-Time Multi-Modal Human-Robot Collaboration Using Gestures and Speech
    Chen, Haodong
    Leu, Ming C.
    Yin, Zhaozheng
    JOURNAL OF MANUFACTURING SCIENCE AND ENGINEERING-TRANSACTIONS OF THE ASME, 2022, 144 (10):
  • [38] Real-Time Speech Signal Segmentation Methods
    Kupryjanow, Adam
    Czyzewski, Andrzej
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2013, 61 (7-8): : 521 - 534
  • [39] Real-time neuroevolution in the NERO video game
    Stanley, KO
    Bryant, BD
    Miikkulainen, R
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2005, 9 (06) : 653 - 668
  • [40] REAL-TIME FACE ALIGNMENT WITH TRACKING IN VIDEO
    Su, Yanchao
    Ai, Haizhou
    Lao, Shihong
    2008 15TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-5, 2008, : 1632 - 1635