A paper-text perspective Studies on the influence of feature granularity for Chinese short-text-classification in the Big Data era

被引:8
作者
Wang, Hao [1 ]
Deng, Sanhong [1 ]
机构
[1] Nanjing Univ, Sch Informat Management, Nanjing, Jiangsu, Peoples R China
关键词
Categories discriminative capacity; Chinese character features; Chinese short-Text-Classification; Feature granularity; Feature optimization; FEATURE-SELECTION METHOD; CATEGORIZATION; REPRESENTATION; INFORMATION; REGRESSION; ALGORITHM; MODEL;
D O I
10.1108/EL-09-2016-0192
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Purpose - In the era of Big Data, network digital resources are growing rapidly, especially the short-text resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to compare the categories discriminative capacity (CDC) of Chinese language fragments with different granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature, such as Chinese characters in Chinese short-text classification (CSTC). Design/methodology/approach - This study takes discipline classification of journal articles from CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classification features with various granularities, including keywords, terms and characters, the classification effects accessed by the SVM algorithm are comprehensively compared and evaluated from three angles of using the same experiment samples, testing before and after feature optimization, and introducing external data. Findings - The granularity of a classification feature has an important impact on CSTC. In general, the larger the granularity is, the better the classification result is, and vice versa. However, a low-granularity feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a high-granularity feature if synthetically considering classification precision, computational complexity and text coverage. Originality/value - This is the first study to propose that Chinese characters are more suitable as descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character features could be strengthened by mixing frequency and position as weight.
引用
收藏
页码:689 / 708
页数:20
相关论文
共 47 条
[1]  
Aizawa Akiko., 2001, Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001), P307
[2]   Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation [J].
Alhaj, Taqwa Ahmed ;
Siraj, Maheyzah Md ;
Zainal, Anazida ;
Elshoush, Huwaida Tagelsir ;
Elhaj, Fatin .
PLOS ONE, 2016, 11 (11)
[3]   Feature Selection for Ordinal Text Classification [J].
Baccianella, Stefano ;
Esuli, Andrea ;
Sebastiani, Fabrizio .
NEURAL COMPUTATION, 2014, 26 (03) :557-591
[4]   Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection [J].
Botsis, Taxiarchis ;
Nguyen, Michael D. ;
Woo, Emily Jane ;
Markatou, Marianthi ;
Ball, Robert .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2011, 18 (05) :631-638
[5]   Turning from TF-IDF to TF-IGM for term weighting in text classification [J].
Chen, Kewen ;
Zhang, Zuping ;
Long, Jun ;
Zhang, Hao .
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 :245-260
[6]   Using chi-square statistics to measure similarities for text categorization [J].
Chen, Yao-Tsung ;
Chen, Meng Chang .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (04) :3085-3090
[7]  
Chen ZG, 2012, INFORMATION-TOKYO, V15, P4255
[8]   Rough set-aided keyword reduction for text categorization [J].
Chouchoulas, A ;
Shen, Q .
APPLIED ARTIFICIAL INTELLIGENCE, 2001, 15 (09) :843-873
[9]   Phytoremediation of palm oil mill secondary effluent (POMSE) by Chrysopogon zizanioides (L.) using artificial neural networks [J].
Darajeh, Negisa ;
Idris, Azni ;
Masoumi, Hamid Reza Fard ;
Nourani, Abolfazl ;
Truong, Paul ;
Rezania, Shahabaldin .
INTERNATIONAL JOURNAL OF PHYTOREMEDIATION, 2017, 19 (05) :413-424
[10]  
Figueroa R. L., 2012, J AM MED INFORM ASSN, V21, P651