Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing

被引：6

作者：

Kose, Oyku Deniz ^{[1
]}

Saraclar, Murat ^{[1
]}

机构：

[1] Bogazici Univ, Dept Elect & Elect Engn, Istanbul 34342, Turkey

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Task analysis; Data integration; Speech processing; Magnetic resonance imaging; Phonetics; Speech recognition; Neural networks; Machine learning; deep learning; multi-modal information; rtMRI-TIMIT; cross-modality; TISSUE BOUNDARY SEGMENTATION; TRACKING; RECOGNITION; DYNAMICS; DATABASE; FUSION; SHAPE;

D O I：

10.1109/TASLP.2021.3084099

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Representations for data subunits can help with recent data accumulation by enabling efficient storage and retrieval systems. In this paper, we investigate the problem of representation generation for phone classification and cross-modal same-different word discrimination tasks. The benefits of utilizing multimodal data on these tasks are examined together with different data fusion schemes. Mainly, the paper considers two different data modalities, upper airway mid-sagittal plane real-time magnetic resonance imaging (rtMRI) videos and the corresponding speech waveforms, and experiments on USC-TIMIT rtMRI dataset. For the phone classification task, two unimodal neural networks are designed, and these separate systems are merged in two different ways that provide data fusion between two modalities. The proposed networks differ in their stages in which they perform the data fusion. As hypothesized, our results show that data fusion indeed brings a performance improvement over both unimodal approaches, and performing fusion in earlier stages with cross-connections yields better results than fusing the data in later stages. In addition to the proposed phone classification schemes, different unimodal and multimodal systems are designed to obtain phone recognition results on USC-TIMIT rtMRI dataset. Phone representations generated for the phone classification task are also utilized in the phone recognition task, and their representative power is illustrated. Finally, we define a cross-view same-different word discrimination task on USC-TIMIT. We propose two different schemes to tackle this task, and find that for cross-view same-different discrimination, generating representations with the help of cross-modality yields better accuracy than a system employing independently created representations.

引用

页码：1912 / 1924

页数：13

共 50 条

[1] RECURRENT NEURAL AUDIOVISUAL WORD EMBEDDINGS FOR SYNCHRONIZED SPEECH AND REAL-TIME MRI
Kose, Oyku Deniz
Saraclar, Murat
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6424 - 6428
[2] Real-time MRI and articulatory coordination in speech
Demolin, D
Hassid, S
Metens, T
Soquet, A
COMPTES RENDUS BIOLOGIES, 2002, 325 (04) : 547 - 556
[3] Speech Synthesis from Articulatory Movements Recorded by Real-time MRI
Otani, Yuto
Sawada, Shun
Ohmura, Hidefumi
Katsurada, Kouichi
INTERSPEECH 2023, 2023, : 127 - 131
[4] Multimodal Deep Learning Approach for Real-Time Sentiment Analysis in Video Streaming
Tejashwini, S. G.
Aradhana, D.
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 730 - 736
[5] MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech
Havtorn, Jakob D.
Latko, Jan
Edin, Joakim
Borgholt, Lasse
Maaloe, Lars
Belgrano, Lorenzo
Jacobsen, Nicolai F.
Sdun, Regitze
Agic, Zeljko
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2370 - 2380
[6] Speech ReaLLM - Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
Seide, Frank
Doulaty, Morrie
Shi, Yangyang
Gaur, Yashesh
Jia, Junteng
Wu, Chunyang
INTERSPEECH 2024, 2024, : 1900 - 1904
[7] Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders
Yu, Yide
Shandiz, Amin Honarmandi
Toth, Laszlo
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 945 - 949
[8] Real-time speech MRI datasets with corresponding articulator ground-truth segmentations
Ruthven, Matthieu
Peplinski, Agnieszka M.
Adams, David M.
King, Andrew P.
Miquel, Marc Eric
SCIENTIFIC DATA, 2023, 10 (01)
[9] Database of volumetric and real-time vocal tract MRI for speech science
Sorensen, Tanner
Skordilis, Zisis
Toutios, Asterios
Kim, Yoon-Chul
Zhu, Yinghua
Kim, Jangwon
Lammert, Adam
Ramanarayanan, Vikram
Goldstein, Louis
Byrd, Dani
Nayak, Krishna
Narayanan, Shrikanth
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 645 - 649
[10] Salient Object Detection by Spatiotemporal and Semantic Features in Real-Time Video Processing Systems
Fang, Yuming
Ding, Guanqun
Wen, Wenying
Yuan, Feiniu
Yang, Yong
Fang, Zhijun
Lin, Weisi
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2020, 67 (11) : 9893 - 9903

← 1 2 3 4 5 →