FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA

被引：0

作者：

Yee, Jen-Yuan ^{[1
]}

Tsai, Cheng-Jung ^{[2
]}

Hsu, Tien-Yu ^{[3
]}

Lin, Jung-Yi ^{[4
]}

Cheng, Pei-Cheng ^{[5
]}

机构：

[1] Natl Museum Nat Sci, Visitor Serv, Dept Operat, Collect & Informat Management, Taichung 40453, Taiwan

[2] Natl Changhua Univ Educ, Grad Inst Stat & Informat Sci, Changhua 50007, Taiwan

[3] Natl Museum Nat Sci, Dept Sci Educ, Taichung 40453, Taiwan

[4] Hon Hai Precis IndCo Ltd Foxconn, IP Affairs Div, Taipei 11492, Taiwan

[5] Chien Hsin Univ Sci & Technol, Dept Informat Management, Taoyuan 32097, Taiwan

来源：

MALAYSIAN JOURNAL OF COMPUTER SCIENCE | 2021年 / 34卷 / 04期

关键词：

Citation analysis; cited text spans identification; feature selection; classification; class imbalance; performance evaluation; scientific paper summarization;

D O I：

10.22452/mjcs.vol34no4.3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent studies in scientific paper summarization have explored a new form of structured summary for a reference paper by grouping all cited and citing sentences together by facet. This involves three main tasks: (1) identifying cited text spans for citances (i.e., citing sentences), (2) classifying their discourse facets, and (3) generating a structured summary from the cited text spans and citances. This paper focuses on the first task, and approaches the task as binary classification to distinguish relevant pairs of citances and reference sentences from irrelevant pairs. We propose a new method that integrates feature selection and classification techniques to enhance classification performance. The proposed method investigates combinations of six feature selection methods (chi(2)-Statistics, Information Gain, Gain Ratio, Relief-F, Significance Attribute Evaluation, and Symmetrical Uncertainty), and five classification algorithms (k-Nearest Neighbors, Decision Tree, Support Vector Machine, Naive Bayes, and Random Forest). Additionally, to address imbalanced data during training, we apply SMOTE (Synthetic Minority Over sampling Technique) to introduce synthetic biases towards the minority. Experiments are conducted using the CLSciSumm corpora to compare the effect of feature selection applied to classification. The results reveal the benefits of feature selection in significantly boosting performance of F-1 score metric, and show that our method is competitive to the state-of-the-art methods in the CL-SciSumm evaluations.

引用

页码：355 / 373

页数：19

共 50 条

[1] A feature selection method to handle imbalanced data in text classification
Chang, Fengxiang
Guo, Jun
Xu, Weiran
Yao, Kejun
Journal of Digital Information Management, 2015, 13 (03): : 169 - 175
[2] An Embedded Feature Selection Method for Imbalanced Data Classification
Liu, Haoyue
Zhou, MengChu
Liu, Qing
IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2019, 6 (03) : 703 - 715
[3] An Embedded Feature Selection Method for Imbalanced Data Classification
Haoyue Liu
MengChu Zhou
Qing Liu
IEEE/CAAJournalofAutomaticaSinica, 2019, 6 (03) : 703 - 715
[4] A Classification Method Based on Feature Selection for Imbalanced Data
Liu, Yi
Wang, Yanzhen
Ren, Xiaoguang
Zhou, Hao
Diao, Xingchun
IEEE ACCESS, 2019, 7 : 81794 - 81807
[5] Feature selection method on imbalanced text
Liao, Yi-Xing
Pan, Xue-Zeng
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2012, 41 (04): : 592 - 595
[6] Optimal Feature Selection for Imbalanced Text Classification
Khurana A.
Verma O.P.
IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147
[7] On Identifying Cited Texts for Citances and Classifying Their Discourse Facets by Classification Techniques
Yeh, Jen-Yuan
Hsu, Tien-Yu
Tsai, Cheng-Jung
Cheng, Pei-Cheng
Lin, Jung-Yi
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2019, 35 (01) : 61 - 86
[8] Comparison of metrics for feature selection in imbalanced text classification
Ogura, Hiroshi
Amano, Hiromi
Kondo, Masato
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (05) : 4978 - 4989
[9] Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection
Xu, Jin
Zhang, Chengzhi
Ma, Shutian
INFORMATION RETRIEVAL (CCIR 2019), 2019, 11772 : 95 - 107
[10] Imbalanced Data Classification Based on Feature Selection Techniques
Ksieniewicz, Pawel
Wozniak, Michal
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING (IDEAL 2018), PT II, 2018, 11315 : 296 - 303

← 1 2 3 4 5 →