Rethinking The Corpus: Moving towards Dynamic Linguistic Resources

被引:0
作者
Rosenberg, Andrew [1 ]
机构
[1] CUNY Queens Coll, Dept Comp Sci, Flushing, NY 11367 USA
来源
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3 | 2012年
关键词
Linguistic Resources; Opinion paper;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The corpus is an invaluable resource in Spoken and Natural Language Processing. Consistent data sets have allowed for empirical evaluation of competing algorithms. The sharing of high-quality annotated linguistic data has enabled participation and experimentation by a wide range of researchers. However, despite dubbing these annotations as "gold-standard", many corpora contain labeling errors and idiosyncrasies. The current view of the corpus as a static resource makes correction of errors and other modifications prohibitively difficult. In this paper, a perspective of the corpus as dynamically changing is advanced. We highlight the problems of the static view of the corpus through case studies of the Penn Treebank, Switchboard, Hub-4 and Boston University Radio News Corpus. We propose the use of version control software as a mechanism to facilitate this dynamic view.
引用
收藏
页码:1390 / 1393
页数:4
相关论文
共 17 条
  • [1] [Anonymous], 1993, COMPUT LINGUIST, DOI DOI 10.21236/ADA273556
  • [2] Blaheta D., 2002, EMNLP
  • [3] Chen K., 2004, ICASSP
  • [4] Deshmukh N., 1998, ICSLP
  • [5] Dickinson M, 2003, EACL 2003: 10TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P107
  • [6] Dickinson M., 2003, P TLT, V3, P45
  • [7] Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
  • [8] GRAFF D, 1997, P 1997 DARPA SPEECH
  • [9] Huizenga J.R., 1993, COLD FUSION SCI FIAS
  • [10] MAEDA K, 2002, P 3 INT C LANG RES E