Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引：6

作者：

Kose, Oyku Deniz ^{[1
]}

Saraclar, Murat ^{[1
]}

机构：

[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;

D O I：

10.1109/TASLP.2021.3084099

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.

引用

页码：1912 / 1924

页数：13

共 50 条

[21] Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition
Dhahbi, Sami
Saleem, Nasir
Gunawan, Teddy Surya
Bourouis, Sami
Ali, Imad
Trigui, Aymen
Algarni, Abeer D.
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2024, 8 (06): : 74 - 85
[22] Advances in Real-Time MRI-Guided Electrophysiology
Mukherjee, Rahul K.
Chubb, Henry
Roujol, Sebastien
Razavi, Reza
O'Neill, Mark D.
CURRENT CARDIOVASCULAR IMAGING REPORTS, 2019, 12 (02)
[23] Real-time human action recognition on an embedded, reconfigurable video processing architecture
Meng, Hongying
Freeman, Michael
Pears, Nick
Bailey, Chris
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2008, 3 (03) : 163 - 176
[24] Real-Time Statistical Speech Translation
Wolk, Krzysztof
Marasek, Krzysztof
NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1, 2014, 275 : 107 - 113
[25] Learning Real-Time Ambient Occlusion from Distance Representations
Keinert, Benjamin
Martschinke, Jana
Stamminger, Marc
ACM SIGGRAPH SYMPOSIUM ON INTERACTIVE 3D GRAPHICS AND GAMES (I3D 2018), 2018,
[26] Speech Organ Contour Extraction using Real-Time MRI and Machine Learning Method
Takemoto, Hironori
Goto, Tsubasa
Hagihara, Yuya
Hamanaka, Sayaka
Kitamura, Tatsuya
Nota, Yukiko
Maekawa, Kikuo
INTERSPEECH 2019, 2019, : 904 - 908
[27] From a Wizard of Oz experiment to a real time speech and gesture multimodal interface
Carbini, S.
Delphin-Poulat, L.
Perron, L.
Viallet, J. E.
SIGNAL PROCESSING, 2006, 86 (12) : 3559 - 3577
[28] In vivo real-time intravascular MRI
Rivas, PA
Nayak, KS
Scott, GC
McConnell, MV
Kerr, AB
Nishimura, DG
Pauly, JM
Hu, BS
JOURNAL OF CARDIOVASCULAR MAGNETIC RESONANCE, 2002, 4 (02) : 223 - 232
[29] A Multimodal Wearable System for Continuous and Real-Time Breathing Pattern Monitoring During Daily Activity
Qi, Wen
Aliverti, Andrea
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2020, 24 (08) : 2199 - 2207
[30] Machine Learning based Video Processing for Real-time Near-Miss Detection
Huang, Xiaohui
Banerjee, Tania
Chen, Ke
Varanasi, Naga Venkata Sai
Rangarajan, Anand
Ranka, Sanjay
PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON VEHICLE TECHNOLOGY AND INTELLIGENT TRANSPORT SYSTEMS (VEHITS), 2020, : 169 - 179

← 1 2 3 4 5 →