Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引:6
|
作者
Kose, Oyku Deniz [1 ]
Saraclar, Murat [1 ]
机构
[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey
关键词
Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;
D O I
10.1109/TASLP.2021.3084099
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.
引用
收藏
页码:1912 / 1924
页数:13
相关论文
共 50 条
  • [41] Real-Time Implementation of Traffic Signs Detection and Identification Application on Graphics Processing Units
    Ayachi, Riadh
    Afif, Mouna
    Said, Yahia
    Abdelali, Abdessalem Ben
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (07)
  • [42] Real-time Video Analysis for Retail Stores
    Hassan, Ehtesham
    Maurya, Avinash Kumar
    SIXTH INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2014), 2015, 9443
  • [43] A Novel Real-Time Fall Detection System Based on Real-Time Video and Mobile Phones
    Tong, Chao
    Lian, Yu
    Zhang, Yang
    Xie, Zhongyu
    Long, Xiang
    Niu, Jianwei
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2017, 26 (04)
  • [44] Advances in Real-Time MRI–Guided Electrophysiology
    Rahul K. Mukherjee
    Henry Chubb
    Sébastien Roujol
    Reza Razavi
    Mark D. O’Neill
    Current Cardiovascular Imaging Reports, 2019, 12
  • [45] Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract
    Csapo, Tamas Gabor
    INTERSPEECH 2020, 2020, : 2722 - 2726
  • [46] Speaker dependent acoustic-to-articulatory inversion using real-time MRI of the vocal tract
    Csapo, Tamas Gabor
    INTERSPEECH 2020, 2020, : 3720 - 3724
  • [47] Real-time fractal signal processing in the time domain
    Hartmann, Andras
    Mukli, Peter
    Nagy, Zoltan
    Kocsis, Laszlo
    Herman, Peter
    Eke, Andras
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2013, 392 (01) : 89 - 102
  • [48] Real-time lexical competitions during speech-in-speech comprehension
    Boulenger, Veronique
    Hoen, Michel
    Ferragne, Emmanuel
    Pellegrino, Francois
    Meunier, Fanny
    SPEECH COMMUNICATION, 2010, 52 (03) : 246 - 253
  • [49] MFNet:Real-Time Motion Focus Network for Video Frame Interpolation
    Zhu, Guosong
    Qin, Zhen
    Ding, Yi
    Liu, Yao
    Qin, Zhiguang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3251 - 3262
  • [50] Whistling shares a common tongue with speech: bioacoustics from real-time MRI of the human vocal tract
    Belyk, Michel
    Schultz, Benjamin G.
    Correia, Joao
    Beal, Deryk S.
    Kotz, Sonja A.
    PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2019, 286 (1911)