Analyzing and identifying multiword expressions in spoken language

被引:3
|
作者
Strik, Helmer [1 ]
Hulsbosch, Micha [1 ]
Cucchiarini, Catia [1 ]
机构
[1] Radboud Univ Nijmegen, Dept Linguist, Sect Language & Speech, NL-6500 HD Nijmegen, Netherlands
关键词
Multiword expressions; Spoken language; Transcription; Pronunciation reduction; Identification; WORD; LEARNERS; FLUENCY;
D O I
10.1007/s10579-009-9095-y
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The present paper investigates multiword expressions (MWEs) in spoken language and possible ways of identifying MWEs automatically in speech corpora. Two MWEs that emerged from previous studies and that occur frequently in Dutch are analyzed to study their pronunciation characteristics and compare them to those of other utterances in a large speech corpus. The analyses reveal that these MWEs display extreme pronunciation variation and reduction, i.e., many phonemes and even syllables are deleted. Several measures of pronunciation reduction are calculated for these two MWEs and for all other utterances in the corpus. Five of these measures are more than twice as high for the MWEs, thus indicating considerable reduction. One overall measure of pronunciation deviation is then calculated and used to automatically identify MWEs in a large speech corpus. The results show that neither this overall measure, nor frequency of co-occurrence alone are suitable for identifying MWEs. The best results are obtained by using a metric that combines overall pronunciation reduction with weighted frequency. In this way, recurring "islands of pronunciation reduction" that contain (potential) MWEs can be identified in a large speech corpus.
引用
收藏
页码:41 / 58
页数:18
相关论文
共 50 条
  • [21] Annotation of multiword expressions in the Prague dependency treebank
    Bejcek, Eduard
    Stranak, Pavel
    LANGUAGE RESOURCES AND EVALUATION, 2010, 44 (1-2) : 7 - 21
  • [22] A ROMANIAN CORPUS ANNOTATED WITH VERBAL MULTIWORD EXPRESSIONS
    Mititelu, Verginica Barbu
    Rizea, Monica-Mihaela
    Ionescu, Mihaela
    Onofrei, Mihaela
    Irimia, Elena
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE 'LINQUISTIC RESOURCES AND TOOLS FOR PROCESSING THE ROMANIAN LANGUAGE', 2016, : 193 - 195
  • [23] A QUANTITATIVE STUDY OF THE MORPHOLOGY OF ITALIAN MULTIWORD EXPRESSIONS
    Nissim, Malvina
    Zaninello, Andrea
    LINGUE E LINGUAGGIO, 2011, 10 (02) : 283 - 299
  • [24] Modeling Semantic Compositionality of Croatian Multiword Expressions
    Snajder, Jan
    Almic, Petra
    INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2015, 39 (03): : 301 - 309
  • [25] Dictionary of Bulgarian Multiword Expressions - Advances and Prospects
    Stoyanova, Ivelina
    Todorova, Maria
    Leseva, Svetlozara
    PROCEEDINGS OF THE INTERNATIONAL JUBILEE CONFERENCE OF THE INSTITUTE FOR BULGARIAN LANGUAGE, VOL 1, 2017, : 311 - 320
  • [26] DuELME: a Dutch electronic lexicon of multiword expressions
    Gregoire, Nicole
    LANGUAGE RESOURCES AND EVALUATION, 2010, 44 (1-2) : 23 - 39
  • [27] Annotation of multiword expressions in the Prague dependency treebank
    Eduard Bejček
    Pavel Straňák
    Language Resources and Evaluation, 2010, 44 : 7 - 21
  • [28] A Romanian Treebank Annotated with Verbal Multiword Expressions
    Mititelu, Verginica Barbu
    Cristescu, Mihaela
    Mitrofan, Maria
    Zgreaban, Bianca-Madalina
    Barbulescu, Elena-Andreea
    PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2022, 2022, : 137 - 145
  • [29] DuELME: a Dutch electronic lexicon of multiword expressions
    Nicole Grégoire
    Language Resources and Evaluation, 2010, 44 : 23 - 39
  • [30] Alignment-based extraction of multiword expressions
    Caseli, Helena de Medeiros
    Ramisch, Carlos
    Volpe Nunes, Maria das Gracas
    Villavicencio, Aline
    LANGUAGE RESOURCES AND EVALUATION, 2010, 44 (1-2) : 59 - 77