Weighted Document Frequency for Feature Selection in Text Classification

被引:0
作者
Li, Baoli [1 ]
Yan, Qiuling [1 ]
Xu, Zhenqiang [1 ]
Wang, Guicai [1 ]
机构
[1] Henan Univ Technol, Coll Informat Sci & Engn, Zhengzhou, Peoples R China
来源
PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING | 2015年
关键词
Document Frequency; Weighted Document Frequency; Feature Selection; Text Classification; Text Categorization;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the past research, Document Frequency (DF) has been validated to be a simple yet quite effective measure for feature selection in text classification. The calculation is based on how many documents in a collection contain a feature, which can be a word, a phrase, a n-gram, or a specially derived attribute. The counting process takes a binary strategy: if a feature appears in a document, its DF will be increased by one. This traditional DF metric concerns only about whether a feature appears in a document, but does not consider how important the feature is in that document. Obviously, thus counted document frequency is very likely to introduce much noise. Therefore, a weighted document frequency (WDF) is proposed and expected to reduce such noise to some extent. Extensive experiments on two text classification data sets demonstrate the effectiveness of the proposed measure.
引用
收藏
页码:132 / 135
页数:4
相关论文
共 12 条
  • [1] [Anonymous], 2014, Data classification: Algorithms and applications, DOI [DOI 10.1201/B17320, 10.1201/b17320]
  • [2] [Anonymous], 1997, ICML
  • [3] A survey on feature selection methods
    Chandrashekar, Girish
    Sahin, Ferat
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) : 16 - 28
  • [4] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [5] Fan RE, 2008, J MACH LEARN RES, V9, P1871
  • [6] Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
  • [7] Guyon I., 2003, J MACH LEARN RES, V3, P1157
  • [8] Lang K., 1995, MACH LEARN P 1995, P331, DOI 10.1016/B978-1-55860-377-6.50048-7
  • [9] A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis
    Lazar, Cosmin
    Taminau, Jonatan
    Meganck, Stijn
    Steenhoff, David
    Coletta, Alain
    Molter, Colin
    de Schaetzen, Virginie
    Duque, Robin
    Bersini, Hugues
    Nowe, Ann
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (04) : 1106 - 1119
  • [10] Manning C., 1999, Foundations of Statistical Natural Language Processing