WordSeg: Standardizing unsupervised word form segmentation from text

被引:16
作者
Bernard, Mathieu [1 ,2 ]
Thiolliere, Roland [1 ]
Saksida, Amanda [3 ]
Loukatou, Georgia R. [1 ]
Larsen, Elin [1 ,2 ]
Johnson, Mark [4 ]
Fibla, Laia [1 ,5 ]
Dupoux, Emmanuel [1 ,2 ]
Daland, Robert [6 ]
Cao, Xuan Nga [1 ,2 ]
Cristia, Alejandrina [1 ]
机构
[1] PSL Res Univ, LSCP, Dept Etud Cognit, ENS,EHESS,CNRS, 29 Rue Ulm, F-75005 Paris, France
[2] INRIA, Villers Les Nancy, France
[3] Inst Maternal & Child Hlth IRCCS Burlo Garofolo T, Trieste, Italy
[4] Macquarie Univ, Sydney, NSW, Australia
[5] Univ East Anglia, Norwich, Norfolk, England
[6] Univ Calif Los Angeles, Los Angeles, CA USA
基金
欧洲研究理事会;
关键词
Unsupervised word discovery; First language acquisition; Natural language processing; Cumulative science; ACQUISITION; SPEECH; CORPUS;
D O I
10.3758/s13428-019-01223-3
中图分类号
B841 [心理学研究方法];
学科分类号
040201 ;
摘要
A basic task in first language acquisition likely involves discovering the boundaries between words or morphemes in input where these basic units are not overtly segmented. A number of unsupervised learning algorithms have been proposed in the last 20 years for these purposes, some of which have been implemented computationally, but whose results remain difficult to compare across papers. We created a tool that is open source, enables reproducible results, and encourages cumulative science in this domain. WordSeg has a modular architecture: It combines a set of corpora description routines, multiple algorithms varying in complexity and cognitive assumptions (including several that were not publicly available, or insufficiently documented), and a rich evaluation package. In the paper, we illustrate the use of this package by analyzing a corpus of child-directed speech in various ways, which further allows us to make recommendations for experimental design of follow-up work. Supplementary materials allow readers to reproduce every result in this paper, and detailed online instructions further enable them to go beyond what we have done. Moreover, the system can be installed within container software that ensures a stable and reliable environment. Finally, by virtue of its modular architecture and transparency, WordSeg can work as an open-source platform, to which other researchers can add their own segmentation algorithms.
引用
收藏
页码:264 / 278
页数:15
相关论文
共 43 条
  • [1] [Anonymous], 2009, CHILDES PROJECT 1
  • [2] [Anonymous], THESIS
  • [3] [Anonymous], 2017, ARXIV170407047
  • [4] Baudet G., 2018, XLINGCORRELATION
  • [5] Borschinger B., 2012, Proceedings of the 24th International Conference on Computational Linguistics (COLING2012), P325
  • [6] Brent M. R., 1999, TRENDS COGNITIVE SCI, V3
  • [7] Distributional regularity and phonotactic constraints are useful for segmentation
    Brent, MR
    Cartwright, TA
    [J]. COGNITION, 1996, 61 (1-2) : 93 - 125
  • [8] Learning Diphone-Based Segmentation
    Daland, Robert
    Pierrehumbert, Janet B.
    [J]. COGNITIVE SCIENCE, 2011, 35 (01) : 119 - 155
  • [9] Word-minimality epenthesis and coda licensing in the early acquisition of English
    Demuth, Katherine
    Culbertson, Jennifer
    Alter, Jennifer
    [J]. LANGUAGE AND SPEECH, 2006, 49 : 137 - 174
  • [10] An Automatically Aligned Corpus of Child-directed Speech
    Elsner, Micha
    Ito, Kiwako
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1736 - 1740