Unveiling the Fundamental Obstacle in Speech-to-Text Modeling: Understanding and Mitigating the Granularity Challenge

被引：0

作者：

Xu, Chen ^{[1
]}

Liu, Xiaoqian ^{[2
]}

Zhang, Yuhao ^{[2
]}

Ma, Anxiang ^{[2
]}

Xiao, Tong ^{[2
,3
]}

Zhu, Jingbo ^{[2
,3
]}

Man, Dapeng ^{[1
]}

Yang, Wu ^{[1
]}

机构：

[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China

[2] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110819, Peoples R China

[3] NiuTrans, Shenyang 110819, Peoples R China

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷

基金：

中国国家自然科学基金; 黑龙江省自然科学基金;

关键词：

Training; Translation; Encoding; Data models; Speech processing; Speech to text; Convergence; Conformer; information aggregation; modeling granularity; speech-to-text generation;

D O I：

10.1109/TASLPRO.2025.3555070

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech-to-text (S2T) generation tasks often struggle to achieve satisfactory convergence without relying on auxiliary data or models. We identify the core issue as the modeling granularity, where the fine-grained and lengthy characteristics of audio features pose obstacles in effectively allocating attention weights, particularly during encoder self-attention learning. In this paper, we investigate two well-established methods, Conformer and information aggregation, to mitigate the learning burden of the encoder from the aspects of intra-layer and inter-layer encoding. Conformer directly enhances modeling capability through architecture improvement, while aggregation generates coarser-grained representations, thus shaping text-like structures to fundamentally simplify attention learning. Extensive results demonstrate superior convergence and notable improvement on two representative S2T generation tasks, speech recognition and translation. In particular, we achieve a notable average BLEU score of 26.9 on MuST-C speech translation datasets without auxiliary resources and approaches. Furthermore, our finding suggests that the grown model capacity and sufficient training can effectively facilitate the granularity challenge.

引用

页码：1719 / 1729

页数：11

共 54 条

[1]

Anastasopoulos Antonios, 2018, P 2018 C N AM CHAPT, DOI [10.18653/v1/N18-1008, DOI 10.18653/V1/N18-1008]

[2]

Bahar P, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P792, DOI [10.1109/ASRU46091.2019.9003774, 10.1109/asru46091.2019.9003774]

[3]

Bérard A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6224, DOI 10.1109/ICASSP.2018.8461690

[4]

Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449

[5]

Burchi Maxime, 2021, EFFICIENT CONFORMER, P8, DOI [10.1109/ASRU51503.2021.9687874, DOI 10.1109/ASRU51503.2021.9687874]

[6] Self-supervised Dialogue Learning for Spoken Conversational Question Answering [J].

Chen, Nuo ;

You, Chenyu ;

Zou, Yuexian .

INTERSPEECH 2021, 2021, :231-235

[7]

Cheng X., 2023, P IEEE INT C AC SPEE, P1, DOI [10.1109/ICASSP49357.2023.10095090, DOI 10.1109/ICASSP49357.2023.10095090]

[8]

Di Gangi MA, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2012

[9]

Elkahky A., 2023, P IEEE INT C AC SPEE, P1

[10]

Fang QK, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P7050

← 1 2 3 4 5 6 →