Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引：6

作者：

Kose, Oyku Deniz ^{[1
]}

Saraclar, Murat ^{[1
]}

机构：

[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;

D O I：

10.1109/TASLP.2021.3084099

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.

引用

页码：1912 / 1924

页数：13

共 50 条

[31] Autonomous Car-Following Approach Based on Real-time Video Frames Processing
Masmoudi, Mehdi
Ghazzai, Hakim
Frikha, Mounir
Massoud, Yehia
2019 IEEE INTERNATIONAL CONFERENCE OF VEHICULAR ELECTRONICS AND SAFETY (ICVES 19), 2019,
[32] Digital Architecture for Real-Time CNN-based Face Detection for Video Processing
Bhattarai, Smrity
Madanayake, Arjuna
Cintra, Renato J.
Duffner, Stefan
Garcia, Christophe
2017 COGNITIVE COMMUNICATIONS FOR AEROSPACE APPLICATIONS WORKSHOP (CCAA), 2017,
[33] Real-time speech synthesis system driven by visual speech
Li, G
Xie, GM
Lin, L
PROCEEDINGS OF THE THIRD INTERNATIONAL SYMPOSIUM ON INSTRUMENTATION SCIENCE AND TECHNOLOGY, VOL 2, 2004, : 397 - 402
[34] Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI
Saha, Pramit
Srungarapu, Praneeth
Fels, Sidney
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1249 - 1253
[35] Real-time Speech Summarization for Medical Conversations
Khai Le-Duc
Khai-Nguyen Nguyen
Long Vo-Dang
Truong-Son Hy
INTERSPEECH 2024, 2024, : 1960 - 1964
[36] A FLEXIBLE ARCHITECTURE FOR REAL-TIME SPEECH RECOGNITION
MORENO, F
ALEXANDRES, S
MENESES, J
MICROPROCESSING AND MICROPROGRAMMING, 1993, 37 (1-5): : 69 - 72
[37] Real-Time Multi-Modal Human-Robot Collaboration Using Gestures and Speech
Chen, Haodong
Leu, Ming C.
Yin, Zhaozheng
JOURNAL OF MANUFACTURING SCIENCE AND ENGINEERING-TRANSACTIONS OF THE ASME, 2022, 144 (10):
[38] Real-Time Speech Signal Segmentation Methods
Kupryjanow, Adam
Czyzewski, Andrzej
JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2013, 61 (7-8): : 521 - 534
[39] Real-time neuroevolution in the NERO video game
Stanley, KO
Bryant, BD
Miikkulainen, R
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2005, 9 (06) : 653 - 668
[40] REAL-TIME FACE ALIGNMENT WITH TRACKING IN VIDEO
Su, Yanchao
Ai, Haizhou
Lao, Shihong
2008 15TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-5, 2008, : 1632 - 1635

← 1 2 3 4 5 →