Unveiling the Fundamental Obstacle in Speech-to-Text Modeling: Understanding and Mitigating the Granularity Challenge

被引:0
作者
Xu, Chen [1 ]
Liu, Xiaoqian [2 ]
Zhang, Yuhao [2 ]
Ma, Anxiang [2 ]
Xiao, Tong [2 ,3 ]
Zhu, Jingbo [2 ,3 ]
Man, Dapeng [1 ]
Yang, Wu [1 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110819, Peoples R China
[3] NiuTrans, Shenyang 110819, Peoples R China
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷
基金
中国国家自然科学基金; 黑龙江省自然科学基金;
关键词
Training; Translation; Encoding; Data models; Speech processing; Speech to text; Convergence; Conformer; information aggregation; modeling granularity; speech-to-text generation;
D O I
10.1109/TASLPRO.2025.3555070
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech-to-text (S2T) generation tasks often struggle to achieve satisfactory convergence without relying on auxiliary data or models. We identify the core issue as the modeling granularity, where the fine-grained and lengthy characteristics of audio features pose obstacles in effectively allocating attention weights, particularly during encoder self-attention learning. In this paper, we investigate two well-established methods, Conformer and information aggregation, to mitigate the learning burden of the encoder from the aspects of intra-layer and inter-layer encoding. Conformer directly enhances modeling capability through architecture improvement, while aggregation generates coarser-grained representations, thus shaping text-like structures to fundamentally simplify attention learning. Extensive results demonstrate superior convergence and notable improvement on two representative S2T generation tasks, speech recognition and translation. In particular, we achieve a notable average BLEU score of 26.9 on MuST-C speech translation datasets without auxiliary resources and approaches. Furthermore, our finding suggests that the grown model capacity and sufficient training can effectively facilitate the granularity challenge.
引用
收藏
页码:1719 / 1729
页数:11
相关论文
共 54 条
[1]  
Anastasopoulos Antonios, 2018, P 2018 C N AM CHAPT, DOI [10.18653/v1/N18-1008, DOI 10.18653/V1/N18-1008]
[2]  
Bahar P, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P792, DOI [10.1109/ASRU46091.2019.9003774, 10.1109/asru46091.2019.9003774]
[3]  
Bérard A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6224, DOI 10.1109/ICASSP.2018.8461690
[4]  
Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[5]  
Burchi Maxime, 2021, EFFICIENT CONFORMER, P8, DOI [10.1109/ASRU51503.2021.9687874, DOI 10.1109/ASRU51503.2021.9687874]
[6]   Self-supervised Dialogue Learning for Spoken Conversational Question Answering [J].
Chen, Nuo ;
You, Chenyu ;
Zou, Yuexian .
INTERSPEECH 2021, 2021, :231-235
[7]  
Cheng X., 2023, P IEEE INT C AC SPEE, P1, DOI [10.1109/ICASSP49357.2023.10095090, DOI 10.1109/ICASSP49357.2023.10095090]
[8]  
Di Gangi MA, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2012
[9]  
Elkahky A., 2023, P IEEE INT C AC SPEE, P1
[10]  
Fang QK, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P7050