FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA

被引:0
|
作者
Yee, Jen-Yuan [1 ]
Tsai, Cheng-Jung [2 ]
Hsu, Tien-Yu [3 ]
Lin, Jung-Yi [4 ]
Cheng, Pei-Cheng [5 ]
机构
[1] Natl Museum Nat Sci, Visitor Serv, Dept Operat, Collect & Informat Management, Taichung 40453, Taiwan
[2] Natl Changhua Univ Educ, Grad Inst Stat & Informat Sci, Changhua 50007, Taiwan
[3] Natl Museum Nat Sci, Dept Sci Educ, Taichung 40453, Taiwan
[4] Hon Hai Precis IndCo Ltd Foxconn, IP Affairs Div, Taipei 11492, Taiwan
[5] Chien Hsin Univ Sci & Technol, Dept Informat Management, Taoyuan 32097, Taiwan
关键词
Citation analysis; cited text spans identification; feature selection; classification; class imbalance; performance evaluation; scientific paper summarization;
D O I
10.22452/mjcs.vol34no4.3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent studies in scientific paper summarization have explored a new form of structured summary for a reference paper by grouping all cited and citing sentences together by facet. This involves three main tasks: (1) identifying cited text spans for citances (i.e., citing sentences), (2) classifying their discourse facets, and (3) generating a structured summary from the cited text spans and citances. This paper focuses on the first task, and approaches the task as binary classification to distinguish relevant pairs of citances and reference sentences from irrelevant pairs. We propose a new method that integrates feature selection and classification techniques to enhance classification performance. The proposed method investigates combinations of six feature selection methods (chi(2)-Statistics, Information Gain, Gain Ratio, Relief-F, Significance Attribute Evaluation, and Symmetrical Uncertainty), and five classification algorithms (k-Nearest Neighbors, Decision Tree, Support Vector Machine, Naive Bayes, and Random Forest). Additionally, to address imbalanced data during training, we apply SMOTE (Synthetic Minority Over sampling Technique) to introduce synthetic biases towards the minority. Experiments are conducted using the CLSciSumm corpora to compare the effect of feature selection applied to classification. The results reveal the benefits of feature selection in significantly boosting performance of F-1 score metric, and show that our method is competitive to the state-of-the-art methods in the CL-SciSumm evaluations.
引用
收藏
页码:355 / 373
页数:19
相关论文
共 50 条
  • [1] A feature selection method to handle imbalanced data in text classification
    Chang, Fengxiang
    Guo, Jun
    Xu, Weiran
    Yao, Kejun
    Journal of Digital Information Management, 2015, 13 (03): : 169 - 175
  • [2] An Embedded Feature Selection Method for Imbalanced Data Classification
    Liu, Haoyue
    Zhou, MengChu
    Liu, Qing
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2019, 6 (03) : 703 - 715
  • [3] An Embedded Feature Selection Method for Imbalanced Data Classification
    Haoyue Liu
    MengChu Zhou
    Qing Liu
    IEEE/CAAJournalofAutomaticaSinica, 2019, 6 (03) : 703 - 715
  • [4] A Classification Method Based on Feature Selection for Imbalanced Data
    Liu, Yi
    Wang, Yanzhen
    Ren, Xiaoguang
    Zhou, Hao
    Diao, Xingchun
    IEEE ACCESS, 2019, 7 : 81794 - 81807
  • [5] Feature selection method on imbalanced text
    Liao, Yi-Xing
    Pan, Xue-Zeng
    Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2012, 41 (04): : 592 - 595
  • [6] Optimal Feature Selection for Imbalanced Text Classification
    Khurana A.
    Verma O.P.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147
  • [7] On Identifying Cited Texts for Citances and Classifying Their Discourse Facets by Classification Techniques
    Yeh, Jen-Yuan
    Hsu, Tien-Yu
    Tsai, Cheng-Jung
    Cheng, Pei-Cheng
    Lin, Jung-Yi
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2019, 35 (01) : 61 - 86
  • [8] Comparison of metrics for feature selection in imbalanced text classification
    Ogura, Hiroshi
    Amano, Hiromi
    Kondo, Masato
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (05) : 4978 - 4989
  • [9] Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection
    Xu, Jin
    Zhang, Chengzhi
    Ma, Shutian
    INFORMATION RETRIEVAL (CCIR 2019), 2019, 11772 : 95 - 107
  • [10] Imbalanced Data Classification Based on Feature Selection Techniques
    Ksieniewicz, Pawel
    Wozniak, Michal
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING (IDEAL 2018), PT II, 2018, 11315 : 296 - 303