Balanced corpus of contemporary written Japanese

被引:96
作者
Maekawa, Kikuo [1 ]
Yamazaki, Makoto [1 ]
Ogiso, Toshinobu [1 ]
Maruyama, Takehiko [1 ]
Ogura, Hideki [2 ]
Kashino, Wakako [1 ]
Koiso, Hanae [3 ]
Yamaguchi, Masaya [1 ]
Tanaka, Makiro [1 ]
Den, Yasuharu [4 ]
机构
[1] Natl Inst Japanese Language & Linguist NINJAL, Dept Corpus Studies, Tokyo, Japan
[2] Ritsumeikan Univ, Coll Letters, Kusatsu, Japan
[3] NINJAL, Dept Linguist Theory & Struct, Tokyo, Japan
[4] Chiba Univ, Fac Letters, Chiba, Japan
关键词
BCCWJ; Japanese; Balanced corpus; Design; Annotation; Dual POS analysis; Evaluation; Shonagon; Chunagon;
D O I
10.1007/s10579-013-9261-0
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The balanced corpus of contemporary written Japanese (BCCWJ) is Japan's first 100 million words balanced corpus. It consists of three subcorpora (publication subcorpus, library subcorpus, and special-purpose subcorpus) and covers a wide range of text registers including books in general, magazines, newspapers, governmental white papers, best-selling books, an internet bulletin-board, a blog, school textbooks, minutes of the national diet, publicity newsletters of local governments, laws, and poetry verses. A random sampling technique is utilized whenever possible in order to maximize the representativeness of the corpus. The corpus is annotated in terms of dual POS analysis, document structure, and bibliographical information. The BCCWJ is currently accessible in three different ways including Chunagon a web-based interface to the dual POS analysis data. Lastly, results of some pilot evaluation of the corpus with respect to the textual diversity are reported. The analyses include POS distribution, word-class distribution, entropy of orthography, sentence length, and variation of the adjective predicate. High textual diversity is observed in all these analyses.
引用
收藏
页码:345 / 371
页数:27
相关论文
共 16 条
[1]  
Asahara M., 2000, Proceedings of the 18th Conference on Computational Linguistics, V1, P21
[2]  
Den Y, 2008, SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, P1019
[3]  
Fujiike Y., 2010, TOK RYOIK KENK NIH K, P93
[4]  
Kabashima T., 1978, SHIN BUNSHO KOGAKU H
[5]  
Koiso H., 2009, P 15 ANN M ASS NAT L, P593
[6]  
Kozawa S., 2011, TOK RYOIK KENK NIH K, P331
[7]  
Kudo T., 2004, P 2004 C EMP METH NA, P230, DOI DOI 10.1109/ICCSIT.2009.5234727
[8]  
Kurohashi S., 1994, P INT WORKSH SHAR NA, P22
[9]  
Kurohashi Sadao, 1998, INT C LANGUAGE RESOU, P719
[10]  
Maekawa K., 2012, P 1 JAP CORP LING WO, P211