Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引:6
|
作者
Kose, Oyku Deniz [1 ]
Saraclar, Murat [1 ]
机构
[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey
关键词
Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;
D O I
10.1109/TASLP.2021.3084099
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.
引用
收藏
页码:1912 / 1924
页数:13
相关论文
共 50 条
  • [21] Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition
    Dhahbi, Sami
    Saleem, Nasir
    Gunawan, Teddy Surya
    Bourouis, Sami
    Ali, Imad
    Trigui, Aymen
    Algarni, Abeer D.
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2024, 8 (06): : 74 - 85
  • [22] Advances in Real-Time MRI-Guided Electrophysiology
    Mukherjee, Rahul K.
    Chubb, Henry
    Roujol, Sebastien
    Razavi, Reza
    O'Neill, Mark D.
    CURRENT CARDIOVASCULAR IMAGING REPORTS, 2019, 12 (02)
  • [23] Real-time human action recognition on an embedded, reconfigurable video processing architecture
    Meng, Hongying
    Freeman, Michael
    Pears, Nick
    Bailey, Chris
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2008, 3 (03) : 163 - 176
  • [24] Real-Time Statistical Speech Translation
    Wolk, Krzysztof
    Marasek, Krzysztof
    NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1, 2014, 275 : 107 - 113
  • [25] Learning Real-Time Ambient Occlusion from Distance Representations
    Keinert, Benjamin
    Martschinke, Jana
    Stamminger, Marc
    ACM SIGGRAPH SYMPOSIUM ON INTERACTIVE 3D GRAPHICS AND GAMES (I3D 2018), 2018,
  • [26] Speech Organ Contour Extraction using Real-Time MRI and Machine Learning Method
    Takemoto, Hironori
    Goto, Tsubasa
    Hagihara, Yuya
    Hamanaka, Sayaka
    Kitamura, Tatsuya
    Nota, Yukiko
    Maekawa, Kikuo
    INTERSPEECH 2019, 2019, : 904 - 908
  • [27] From a Wizard of Oz experiment to a real time speech and gesture multimodal interface
    Carbini, S.
    Delphin-Poulat, L.
    Perron, L.
    Viallet, J. E.
    SIGNAL PROCESSING, 2006, 86 (12) : 3559 - 3577
  • [28] In vivo real-time intravascular MRI
    Rivas, PA
    Nayak, KS
    Scott, GC
    McConnell, MV
    Kerr, AB
    Nishimura, DG
    Pauly, JM
    Hu, BS
    JOURNAL OF CARDIOVASCULAR MAGNETIC RESONANCE, 2002, 4 (02) : 223 - 232
  • [29] A Multimodal Wearable System for Continuous and Real-Time Breathing Pattern Monitoring During Daily Activity
    Qi, Wen
    Aliverti, Andrea
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2020, 24 (08) : 2199 - 2207
  • [30] Machine Learning based Video Processing for Real-time Near-Miss Detection
    Huang, Xiaohui
    Banerjee, Tania
    Chen, Ke
    Varanasi, Naga Venkata Sai
    Rangarajan, Anand
    Ranka, Sanjay
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON VEHICLE TECHNOLOGY AND INTELLIGENT TRANSPORT SYSTEMS (VEHITS), 2020, : 169 - 179