Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引：6

作者：

Kose, Oyku Deniz ^{[1
]}

Saraclar, Murat ^{[1
]}

机构：

[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;

D O I：

10.1109/TASLP.2021.3084099

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.

引用

页码：1912 / 1924

页数：13

共 50 条

[41] Real-Time Implementation of Traffic Signs Detection and Identification Application on Graphics Processing Units
Ayachi, Riadh
Afif, Mouna
Said, Yahia
Abdelali, Abdessalem Ben
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (07)
[42] Real-time Video Analysis for Retail Stores
Hassan, Ehtesham
Maurya, Avinash Kumar
SIXTH INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2014), 2015, 9443
[43] A Novel Real-Time Fall Detection System Based on Real-Time Video and Mobile Phones
Tong, Chao
Lian, Yu
Zhang, Yang
Xie, Zhongyu
Long, Xiang
Niu, Jianwei
JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2017, 26 (04)
[44] Advances in Real-Time MRI–Guided Electrophysiology
Rahul K. Mukherjee
Henry Chubb
Sébastien Roujol
Reza Razavi
Mark D. O’Neill
Current Cardiovascular Imaging Reports, 2019, 12
[45] Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract
Csapo, Tamas Gabor
INTERSPEECH 2020, 2020, : 2722 - 2726
[46] Speaker dependent acoustic-to-articulatory inversion using real-time MRI of the vocal tract
Csapo, Tamas Gabor
INTERSPEECH 2020, 2020, : 3720 - 3724
[47] Real-time fractal signal processing in the time domain
Hartmann, Andras
Mukli, Peter
Nagy, Zoltan
Kocsis, Laszlo
Herman, Peter
Eke, Andras
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2013, 392 (01) : 89 - 102
[48] Real-time lexical competitions during speech-in-speech comprehension
Boulenger, Veronique
Hoen, Michel
Ferragne, Emmanuel
Pellegrino, Francois
Meunier, Fanny
SPEECH COMMUNICATION, 2010, 52 (03) : 246 - 253
[49] MFNet:Real-Time Motion Focus Network for Video Frame Interpolation
Zhu, Guosong
Qin, Zhen
Ding, Yi
Liu, Yao
Qin, Zhiguang
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3251 - 3262
[50] Whistling shares a common tongue with speech: bioacoustics from real-time MRI of the human vocal tract
Belyk, Michel
Schultz, Benjamin G.
Correia, Joao
Beal, Deryk S.
Kotz, Sonja A.
PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2019, 286 (1911)

← 1 2 3 4 5 →