End-to-End Neural Segmental Models for Speech Recognition

被引:11
|
作者
Tang, Hao [1 ]
Lu, Liang [1 ]
Kong, Lingpeng [2 ]
Gimpel, Kevin [1 ]
Livescu, Karen [1 ]
Dyer, Chris [2 ,3 ]
Smith, Noah A. [4 ]
Renals, Steve [5 ]
机构
[1] Toyota Technol Inst Chicago, Chicago, IL 60637 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Google DeepMind, London N1C 4AG, England
[4] Univ Washington, Seattle, WA 98195 USA
[5] Univ Edinburgh, Edinburgh EH8 9AB, Midlothian, Scotland
基金
英国工程与自然科学研究理事会;
关键词
Connectionist temporal classification; end-to-end training; multitask training; segmental models; LANGUAGE;
D O I
10.1109/JSTSP.2017.2752462
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Segmental models are an alternative to frame-based models for sequence prediction, where hypothesized path weights are based on entire segment scores rather than a single frame at a time. Neural segmental models are segmental models that use neural network-based weight functions. Neural segmental models have achieved competitive results for speech recognition, and their end-to-end training has been explored in several studies. In this work, we review neural segmental models, which can be viewed as consisting of a neural network-based acoustic encoder and a finite-state transducer decoder. We study end-to-end segmental models with different weight functions, including ones based on frame-level neural classifiers and on segmental recurrent neural networks. We study how reducing the search space size impacts performance under different weight functions. We also compare several loss functions for end-to-end training. Finally, we explore training approaches, including multistage versus end-to-end training and multitask training that combines segmental and frame-level losses.
引用
收藏
页码:1254 / 1264
页数:11
相关论文
共 50 条
  • [21] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Cai, Feipeng
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
  • [22] END-TO-END ANCHORED SPEECH RECOGNITION
    Wang, Yiming
    Fan, Xing
    Chen, I-Fan
    Liu, Yuzong
    Chen, Tongfei
    Hoffmeister, Bjorn
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
  • [23] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
    Zhang, Ying
    Pezeshki, Mohammad
    Brakel, Philemon
    Zhang, Saizheng
    Laurent, Cesar
    Bengio, Yoshua
    Courville, Aaron
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414
  • [24] Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming
    Ochiai, Tsubasa
    Watanabe, Shinji
    Hori, Takaaki
    Hershey, John R.
    Xiao, Xiong
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1274 - 1288
  • [25] Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
    Parcollet, Titouan
    Zhang, Ying
    Morchid, Mohamed
    Trabelsi, Chiheb
    Linares, Georges
    De Mori, Renato
    Bengio, Yoshua
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 22 - 26
  • [26] Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    INTERSPEECH 2019, 2019, : 76 - 80
  • [27] A Neural Time Alignment Module for End-to-End Automatic Speech Recognition
    Jiang, Dongcheng
    Zhang, Chao
    Woodland, Philip C.
    INTERSPEECH 2023, 2023, : 1374 - 1378
  • [28] END-TO-END SPEECH EMOTION RECOGNITION USING DEEP NEURAL NETWORKS
    Tzirakis, Panagiotis
    Zhang, Jiehao
    Schuller, Bjoern W.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5089 - 5093
  • [29] A COMPARISON OF END-TO-END MODELS FOR LONG-FORM SPEECH RECOGNITION
    Chiu, Chung-Cheng
    Han, Wei
    Zhang, Yu
    Pang, Ruoming
    Kishchenko, Sergey
    Nguyen, Patrick
    Narayanan, Arun
    Liao, Hank
    Zhang, Shuyuan
    Kannan, Anjuli
    Prabhavalkar, Rohit
    Chen, Zhifeng
    Sainath, Tara
    Wu, Yonghui
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 889 - 896
  • [30] Confidence-based Ensembles of End-to-End Speech Recognition Models
    Gitman, Igor
    Lavrukhin, Vitaly
    Laptev, Aleksandr
    Ginsburg, Boris
    INTERSPEECH 2023, 2023, : 1414 - 1418