Feature selection based on feature interactions with application to text categorization

被引:63
作者
Tang, Xiaochuan [1 ,2 ]
Dai, Yuanshun [2 ]
Xiang, Yanping [2 ]
机构
[1] Chengdu Univ Technol, Sch Cyber Secur, Chengdu 610059, Sichuan, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature selection; Feature interaction; Mutual information; Joint mutual information; Text categorization; MUTUAL INFORMATION; FRAMEWORK;
D O I
10.1016/j.eswa.2018.11.018
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is an import preprocessing approach for machine learning and text mining. It reduces the dimensions of high-dimensional data. A popular approach is based on information theoretic measures. Most of the existing methods used two- and three-dimensional mutual information terms that are ineffective in detecting higher-order feature interactions. To fill this gap, we employ two- through five-way interactions for feature selection. We first identify a relaxed assumption to decompose the mutual information-based feature selection problem into a sum of low-order interactions. A direct calculation of the decomposed interaction terms is computationally expensive. We employ five-dimensional joint mutual information, a computationally efficient measure, to estimate the interaction terms. We use the 'maximum of the minimum' nonlinear approach to avoid the overestimation of the feature significance. We also apply the proposed method to text categorization. To evaluate the performance of the proposed method, we compare it with eleven popular feature selection methods, eighteen benchmark data and seven text categorization data. Experimental results with four different types of classifiers provide concrete evidence that higher-order interactions are effective in improving feature selection methods. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:207 / 216
页数:10
相关论文
共 50 条
  • [21] A novel feature selection algorithm for text categorization
    Shang, Wenqian
    Huang, Houkuan
    Zhu, Haibin
    Lin, Yongmin
    Qu, Youli
    Wang, Zhihai
    EXPERT SYSTEMS WITH APPLICATIONS, 2007, 33 (01) : 1 - 5
  • [22] A Method of Feature Selection Based on Word2Vec in Text Categorization
    Tian, Wenfeng
    Li, Jun
    Li, Hongguang
    2018 37TH CHINESE CONTROL CONFERENCE (CCC), 2018, : 9452 - 9455
  • [23] A two-stage feature selection method for text categorization
    Meng, Jiana
    Lin, Hongfei
    Yu, Yuhai
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2011, 62 (07) : 2793 - 2800
  • [24] Trigonometric comparison measure: A feature selection method for text categorization
    Kim, Kyoungok
    Zzang, See Young
    DATA & KNOWLEDGE ENGINEERING, 2019, 119 : 1 - 21
  • [25] GU metric - A new feature selection algorithm for text categorization
    Uchyigit, Gulden
    Clark, Keith
    ICEIS 2007: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS: ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS, 2007, : 399 - 402
  • [26] An extended document frequency metric for feature selection in text categorization
    Xu, Yan
    Wang, Bin
    Li, JinTao
    Jing, Hongfang
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 71 - +
  • [27] COMPARATIVE STUDY OF FEATURE SELECTION APPROACHES FOR URDU TEXT CATEGORIZATION
    Zia, Tehseen
    Akhter, Muhammad Pervez
    Abbas, Qaiser
    MALAYSIAN JOURNAL OF COMPUTER SCIENCE, 2015, 28 (02) : 93 - 109
  • [28] Study and Analyze on Feature Selection in Text Categorization for Engineering Domain
    Wu Junyun
    EMERGING MATERIALS AND MECHANICS APPLICATIONS, 2012, 487 : 383 - 386
  • [29] Improved Comprehensive Measurement Feature Selection Method for Text Categorization
    Feng, LiZhou
    Zuo, WanLi
    Wang, YouWei
    2015 INTERNATIONAL CONFERENCE ON NETWORK AND INFORMATION SYSTEMS FOR COMPUTERS (ICNISC), 2015, : 125 - 128
  • [30] Toward Optimal Feature Selection in Naive Bayes for Text Categorization
    Tang, Bo
    Kay, Steven
    He, Haibo
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (09) : 2508 - 2521