Turkish Labeled Text Corpus

被引:0
作者
Ozturk, Secil [1 ]
Sankur, Bulent [1 ]
Gungor, Tunga [2 ]
Yilmaz, Mustafa Berkay [3 ]
Koroglu, Bilge [4 ]
Agin, Onur [4 ]
Isbilen, Mustafa [4 ]
Ulas, Cagdas [4 ]
Ahat, Mehmet [4 ]
机构
[1] Bogazici Univ, Elekt Elekt Muhendisligi Bolumleri, TR-80815 Bebek, Turkey
[2] Bogazici Univ, Bilgisayar Muhendisligi Bolumleri, TR-80815 Bebek, Turkey
[3] Sabanci Univ, Bilgisayar Bilimi & Muhendisligi Bolumu, Istanbul, Turkey
[4] Yapi & Kredi Bankasi AS, Ar Ge Bolumu, Istanbul, Istanbul Provin, Turkey
来源
2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU) | 2014年
关键词
Corpus; Turkish; Paper; Abstract; Natural Language Processing; NLP; Classification; Latent Dirichlet Allocation; Term Frequcney; Inverse Document Frequency; TF-IDF;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A labeled text corpus made up of Turkish papers' titles, abstracts and keywords is collected. The corpus includes 35 number of different disciplines, and 200 documents per subject. This study presents the text corpus' collection and content. The classification performance of Term Frequcney - Inverse Document Frequency (TF-IDF) and topic probabilities of Latent Dirichlet Allocation (LDA) features are compared for the text corpus. The text corpus is shared as open source so that it could be used for natural language processing applications with academic purposes.
引用
收藏
页码:1395 / 1398
页数:4
相关论文
共 4 条
[1]  
[Anonymous], 1990, SUPPORT VECTOR LEARN
[2]  
Fuhr N., 2007, COMPARATIVE EVALUATI, V4518
[3]  
Hall M., 2009, SIGKDD Explorations, V11, P10, DOI DOI 10.1145/1656274.1656278
[4]  
Ohta T., 2002, HLT 02