A fast and flexible architecture for very large word n-gram datasets

被引:1
|
作者
Flor, Michael [1 ]
机构
[1] Educ Testing Serv, NLP & Speech Grp, Princeton, NJ 08541 USA
关键词
Compendex;
D O I
10.1017/S1351324911000349
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel trie-based architecture, features lossless compression, and provides optimization for both speed and memory use. In addition to literal queries, it also supports fast pattern matching searches (with wildcards or regular expressions), on the same data structure, without any additional indexing. Language models are updateable directly in the compiled binary format, allowing rapid encoding of existing tabulated collections, incremental generation of n-gram models from streaming text, and merging of encoded compiled files. This architecture offers flexible choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model, or on-demand partial data loading with very modest memory requirements. The implemented system runs successfully on several different platforms, under different operating systems, even when the n-gram model file is much larger than available memory. Experimental evaluation results are presented with the Google Web1T collection and the Gigaword corpus.
引用
收藏
页码:61 / 93
页数:33
相关论文
共 2 条
  • [1] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    Lect. Notes Comput. Sci., 1600, (557-565):
  • [2] Vocabanalyzer: A referred word list analyzing tool with keyword, concordancing and N-gram functions
    Chung, Siaw-Fong
    Chao, F.Y. August
    Hsieh, Yi-Chen
    PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, 2009, 2 : 638 - 645