Speech corpora subset selection based on time-continuous utterances features

被引:38
作者
Dong, Luobing [1 ]
Guo, Qiumin [2 ]
Wu, Weili [3 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, 2 South Taibai Rd, Xian 710071, Shanxi, Peoples R China
[2] Beijing Univ Chem Technol, Sch Sci, Beijing, Peoples R China
[3] Univ Texas Dallas, Dept Comp Sci, Dallas, TX USA
基金
美国国家科学基金会;
关键词
Speech corpora; Subset selection; Time-continuous utterances;
D O I
10.1007/s10878-018-0350-2
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
An extremely large corpus with rich acoustic properties is very useful for training new speech recognition and semantic analysis models. However, it also brings some troubles, because the complexity of the acoustic model training usually depends on the size of the corpora. In this paper, we propose a corpora subset selection method considering data contributions from time-continuous utterances and multi-label constraints that are not limited to single-scale metrics. Our goal is to extract a sufficiently rich subset from large corpora under certain meaningful constraints. In addition, taking into account the uniform coverage of the target subset and its internal property, we design a constrained subset selection algorithm. Specifically, a fast subset selection algorithm is designed by introducing n-grams models. Experiments are implemented based on very large real speech corpora database and validate the effectiveness of our method.
引用
收藏
页码:1237 / 1248
页数:12
相关论文
共 20 条
[1]   Scaling to very very large corpora for natural language disambiguation [J].
Banko, M ;
Brill, E .
39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2001, :26-33
[2]  
Boleda G., 2006, P WAC 06 2 INT WORKS, P19
[3]  
Brown P. F., 1992, Computational Linguistics, V18, P467
[4]  
Clarke C. L. A., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P369
[5]  
Curran J.R., 2002, Proceedings of the 6th Conference on Natural Language Learning-Volume 20. COLING-02, V20, P1, DOI [10.3115/1118853. 1118861., DOI 10.3115/1118853.1118861]
[6]  
Drouin P., 2004, Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), P79
[7]   An Extensible Schema for Building Large Weakly-Labeled Semantic Corpora [J].
English, S. Matthew .
1ST INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE AND SPEECH PROCESSING, 2018, 128 :65-71
[8]  
FUJISHIGE S, 2005, SUBMODULAR FUNCTIONS, V315, P363
[9]  
Glavas G., 2017, C EMP METH NAT LANG, P1757
[10]   Document embeddings learned on various types of n-grams for cross-topic authorship attribution [J].
Gomez-Adorno, Helena ;
Posadas-Duran, Juan-Pablo ;
Sidorov, Grigori ;
Pinto, David .
COMPUTING, 2018, 100 (07) :741-756