Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition

被引:0
|
作者
Masumura, Ryo [1 ]
Hahm, Seongjun [1 ]
Ito, Akinori [1 ]
机构
[1] Tohoku Univ, Grad Sch Engn, Sendai, Miyagi 980, Japan
来源
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5 | 2011年
关键词
Spontaneous speech recognition; language model; World Wide Web; large vocabulary continuous speech recognition; Corpus of Spontaneous Japanese;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken language-like texts were selected from the downloaded Web data using the naive Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we improved word accuracy.
引用
收藏
页码:1476 / 1479
页数:4
相关论文
共 50 条
  • [1] A unified language model for large vocabulary continuous speech recognition of Turkish
    Arisoy, Ebru
    Dutagaci, Helin
    Arslan, Levent M.
    SIGNAL PROCESSING, 2006, 86 (10) : 2844 - 2862
  • [2] Syllable Based Language Model for Large Vocabulary Continuous Speech Recognition of Polish
    Majewski, Piotr
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 397 - 401
  • [3] Large vocabulary continuous speech recognition of an inflected language using stems and endings
    Rotovnik, Tomaz
    Maucec, Mirjam Sepesy
    Kacic, Zdravko
    SPEECH COMMUNICATION, 2007, 49 (06) : 437 - 452
  • [4] A large vocabulary continuous speech recognition system for Persian language
    Hossein Sameti
    Hadi Veisi
    Mohammad Bahrani
    Bagher Babaali
    Khosro Hosseinzadeh
    EURASIP Journal on Audio, Speech, and Music Processing, 2011
  • [5] Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese
    Furui, S
    Nakamura, M
    Ichiba, T
    Iwano, K
    SPEECH COMMUNICATION, 2005, 47 (1-2) : 208 - 219
  • [6] Discriminative training of decoding graphs for large vocabulary continuous speech recognition
    Kuo, Hong-Kwang Jeff
    Kingsbury, Brian
    Zweig, Geoffrey
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 45 - +
  • [7] SPEECH RECOGNITION OF FOREIGN OUT-OF-VOCABULARY WORDS USING A HIERARCHICAL LANGUAGE MODEL
    Yamamoto, Hirofumi
    Kikui, Genichiro
    Nakamura, Satoshi
    Sagisaka, Yoshinori
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1870 - +
  • [8] Japanese large-vocabulary continuous-speech recognition using a newspaper corpus and broadcast news
    Ohtsuki, K
    Matsuoka, T
    Mori, T
    Yoshida, K
    Taguchi, Y
    Furui, S
    Shirai, K
    SPEECH COMMUNICATION, 1999, 28 (02) : 155 - 166
  • [9] Dynamic out-of-vocabulary word registration to language model for speech recognition
    Norihide Kitaoka
    Bohan Chen
    Yuya Obashi
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [10] Dynamic out-of-vocabulary word registration to language model for speech recognition
    Kitaoka, Norihide
    Chen, Bohan
    Obashi, Yuya
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)