The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach

被引:16
|
作者
Sebok, Miklos [1 ]
Kacsuk, Zoltan [1 ,2 ]
机构
[1] Hungarian Acad Sci, Ctr Social Sci, Budapest, Hungary
[2] Hsch Medien, Stuttgart, Germany
关键词
machine learning; statistical analysis of texts; Comparative Agendas Project; multiclass classification; automated content analysis;
D O I
10.1017/pan.2020.27
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
In this article, we present a machine learning-based solution for matching the performance of the gold standard of double-blind human coding when it comes to content analysis in comparative politics. We combine a quantitative text analysis approach with supervised learning and limited human resources in order to classify the front-page articles of a leading Hungarian daily newspaper based on their full text. Our goal was to assign items in our dataset to one of 21 policy topics based on the codebook of the Comparative Agendas Project. The classification of the imbalanced classes of topics was handled by a hybrid binary snowball workflow. This relies on limited human resources as well as supervised learning; it simplifies the multiclass problem to one of binary choice; and it is based on a snowball approach as we augment the training set with machine-classified observations after each successful round and also between corpora. Our results show that our approach provided better precision results (of over 80% for most topic codes) than what is customary for human coders and most computer-assisted coding projects. Nevertheless, this high precision came at the expense of a relatively low, below 60%, share of labeled articles.
引用
收藏
页码:236 / 249
页数:14
相关论文
共 50 条
  • [21] Detection of Parkinson disease using multiclass machine learning approach
    Srinivasan, Saravanan
    Ramadass, Parthasarathy
    Mathivanan, Sandeep Kumar
    Panneer Selvam, Karthikeyan
    Shivahare, Basu Dev
    Shah, Mohd Asif
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [22] A model fusion approach for severity prediction of diabetes with respect to binary and multiclass classification
    Zohair M.
    Chandra R.
    Tiwari S.
    Agarwal S.
    International Journal of Information Technology, 2024, 16 (3) : 1955 - 1965
  • [23] An hybrid GA/SVM approach for multiclass classification with directed acyclic graphs
    Lorena, AC
    de Carvalho, ACPD
    ADVANCES IN ARTIFICIAL INTELLIGENCE - SBIA 2004, 2004, 3171 : 366 - 375
  • [24] DRL-GAN: A Hybrid Approach for Binary and Multiclass Network Intrusion Detection
    Strickland, Caroline
    Zakar, Muhammad
    Saha, Chandrika
    Nejad, Sareh Soltani
    Tasnim, Noshin
    Lizotte, Daniel J.
    Haque, Anwar
    SENSORS, 2024, 24 (09)
  • [25] Probability based voting extreme learning machine for multiclass XML documents classification
    Zhao, Xiangguo
    Bi, Xin
    Qiao, Baiyou
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2014, 17 (05): : 1217 - 1231
  • [26] Probability based voting extreme learning machine for multiclass XML documents classification
    Xiangguo Zhao
    Xin Bi
    Baiyou Qiao
    World Wide Web, 2014, 17 : 1217 - 1231
  • [27] Multiclass Brain Tumor Classification Using Hyperspectral Imaging and Supervised Machine Learning
    Ruiz, Luisa
    Martin, Alberto
    Urbanos, Gemma
    Villanueva, Marta
    Sancho, Jaime
    Rosa, Gonzalo
    Villa, Manuel
    Chavarrias, Miguel
    Perez, Angel
    Juarez, Eduardo
    Lagares, Alfonso
    Sanz, Cesar
    2020 XXXV CONFERENCE ON DESIGN OF CIRCUITS AND INTEGRATED SYSTEMS (DCIS), 2020,
  • [28] Predictive modeling of gestational weight gain: a machine learning multiclass classification study
    Victor, Audencio
    dos Santos, Hellen Geremias
    Silva, Gabriel Ferreira Santos
    Barcellos Filho, Fabiano
    Cobre, Alexandre de Fatima
    Luzia, Liania A.
    Rondo, Patricia H. C.
    Chiavegatto Filho, Alexandre Dias Porto
    BMC PREGNANCY AND CHILDBIRTH, 2024, 24 (01)
  • [29] Comparative Analysis of Multiclass Classification Machine Learning Models for Cybersecurity Intrusion Detection
    Loughmari, Mohamed
    El Affar, Anass
    DIGITAL TECHNOLOGIES AND APPLICATIONS, ICDTA 2024, VOL 2, 2024, 1099 : 97 - 108
  • [30] Multiclass Mood Classification on Twitter Using Lexicon Dictionary and Machine Learning Algorithms
    Gaikwad, Govin
    Joshi, Deepali J.
    2016 INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT), VOL 1, 2016, : 512 - 517