A method for constructing Korean spontaneous spoken language corpus based on an imitation of abbreviated and transformed particles

被引:0
|
作者
Ri, Hyok-Chol [1 ]
Kim, Chol [1 ]
Jo, Mok-Ran [1 ]
机构
[1] Kim Il Sung Univ, Fac Informat Sci, Pyongyang, North Korea
关键词
Automatic speech recognition (ASR); Language model (LM); Spontaneous speech; Language corpus; TRANSCRIPTION;
D O I
10.1007/s10772-021-09937-6
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In the paper, we proposed a method of constructing a language corpus based on the imitation of abbreviated and transformed particles that are distinctive feature of Korean spontaneous spoken language. Since it is not practical to train a spoken-style model using numerous spoken transcripts, the proposed approach generates a spoken-style text from a written-style one such as newspapers, based on characteristics of pronouncing variations, dependent on spoken styles, of typical particles. This method for constructing spoken-style text is based on statistical analysis on particles that play same function in both of written and spoken language. We analyze grammatical functions and pronouncing features of particles that distinguish between written and spoken language, and generate spoken-style text from written-style text by imitating typical abbreviated and transformed particles which play same function. Abbreviated and transformed particles to be imitated have proper and typical pronouncing features of spoken language. We replace particles with abbreviated and transformed particles in written-style text according to correspondence of written particles to spoken ones, which results in spoken-style text. The language model, which is trained from spoken-style text imitating abbreviated and transformed particles, significantly improved a word error rate (WER) on spontaneous speech.
引用
收藏
页码:205 / 210
页数:6
相关论文
共 44 条
  • [1] A method for constructing Korean spontaneous spoken language corpus based on an imitation of abbreviated and transformed particles
    Hyok-Chol Ri
    Chol Kim
    Mok-Ran Jo
    International Journal of Speech Technology, 2022, 25 : 205 - 210
  • [2] A Corpus of Spontaneous Speech in Lectures : The KIT Lecture Corpus for Spoken Language Processing and Translation
    Cho, Eunah
    Fuenfer, Sarah
    Stueker, Sebastian
    Waibel, Alex
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1554 - 1559
  • [3] AN ACTIVITY BASED SPOKEN LANGUAGE CORPUS OF NEPALI
    Allwood, Jens
    Regmi, Bhim Narayan
    Dhakhwa, Sagun
    Uranw, Ram Kisun
    2012 INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2012, : 24 - 29
  • [4] AN ACTIVITY BASED SPOKEN LANGUAGE CORPUS OF LOHORUNG
    Allwood, Jens
    Regmi, Bhim Narayan
    Dhakhwa, Sagun
    2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [5] Text Implicates Prosodic Ambiguity: A Corpus for Intention Identification of the Korean Spoken Language
    Cho, Won Ik
    Kim, Nam Soo
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (01)
  • [6] Annotations and tools for an activity based Spoken Language Corpus
    Allwood, J
    Gröqvist, L
    Ahlsén, E
    Gunnarsson, M
    CURRENT AND NEW DIRECTIONS IN DISCOURSE AND DIALOGUE, 2003, 22 : 1 - 18
  • [7] DESIGNING A KOREAN FRENCH-LEARNERS' SPEECH CORPUS (KFLSC) FOR SPOKEN LANGUAGE ASSESSMENT
    Park, Soeun
    Chun, Jihye
    Kim, Mi Hyun
    Lee, Hyunjoo
    Lee, Seong Heon
    Kim, Sunhee
    2022 25TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA 2022), 2022,
  • [8] The Operation of computer technology in corpus-based spoken language
    Jie, Chen
    Jing, Chen
    Open Automation and Control Systems Journal, 2013, 5 (01): : 45 - 50
  • [9] The Operation of Cool Edit Pro in Corpus-based Spoken Language
    Jie, Chen
    Jing, Chen
    MANUFACTURING PROCESS AND EQUIPMENT, PTS 1-4, 2013, 694-697 : 2383 - +
  • [10] University Language: A Corpus-based Study of Spoken and Written Registers
    Fitzsimmons-Doolan, Shannon
    CORPORA, 2009, 4 (02) : 213 - 216