Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引:6
|
作者
Kose, Oyku Deniz [1 ]
Saraclar, Murat [1 ]
机构
[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey
关键词
Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;
D O I
10.1109/TASLP.2021.3084099
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.
引用
收藏
页码:1912 / 1924
页数:13
相关论文
共 50 条
  • [1] RECURRENT NEURAL AUDIOVISUAL WORD EMBEDDINGS FOR SYNCHRONIZED SPEECH AND REAL-TIME MRI
    Kose, Oyku Deniz
    Saraclar, Murat
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6424 - 6428
  • [2] Real-time MRI and articulatory coordination in speech
    Demolin, D
    Hassid, S
    Metens, T
    Soquet, A
    COMPTES RENDUS BIOLOGIES, 2002, 325 (04) : 547 - 556
  • [3] Speech Synthesis from Articulatory Movements Recorded by Real-time MRI
    Otani, Yuto
    Sawada, Shun
    Ohmura, Hidefumi
    Katsurada, Kouichi
    INTERSPEECH 2023, 2023, : 127 - 131
  • [4] Multimodal Deep Learning Approach for Real-Time Sentiment Analysis in Video Streaming
    Tejashwini, S. G.
    Aradhana, D.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 730 - 736
  • [5] MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech
    Havtorn, Jakob D.
    Latko, Jan
    Edin, Joakim
    Borgholt, Lasse
    Maaloe, Lars
    Belgrano, Lorenzo
    Jacobsen, Nicolai F.
    Sdun, Regitze
    Agic, Zeljko
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2370 - 2380
  • [6] Speech ReaLLM - Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
    Seide, Frank
    Doulaty, Morrie
    Shi, Yangyang
    Gaur, Yashesh
    Jia, Junteng
    Wu, Chunyang
    INTERSPEECH 2024, 2024, : 1900 - 1904
  • [7] Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders
    Yu, Yide
    Shandiz, Amin Honarmandi
    Toth, Laszlo
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 945 - 949
  • [8] Real-time speech MRI datasets with corresponding articulator ground-truth segmentations
    Ruthven, Matthieu
    Peplinski, Agnieszka M.
    Adams, David M.
    King, Andrew P.
    Miquel, Marc Eric
    SCIENTIFIC DATA, 2023, 10 (01)
  • [9] Database of volumetric and real-time vocal tract MRI for speech science
    Sorensen, Tanner
    Skordilis, Zisis
    Toutios, Asterios
    Kim, Yoon-Chul
    Zhu, Yinghua
    Kim, Jangwon
    Lammert, Adam
    Ramanarayanan, Vikram
    Goldstein, Louis
    Byrd, Dani
    Nayak, Krishna
    Narayanan, Shrikanth
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 645 - 649
  • [10] Salient Object Detection by Spatiotemporal and Semantic Features in Real-Time Video Processing Systems
    Fang, Yuming
    Ding, Guanqun
    Wen, Wenying
    Yuan, Feiniu
    Yang, Yong
    Fang, Zhijun
    Lin, Weisi
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2020, 67 (11) : 9893 - 9903