Classification of Text Documents Based on a Probabilistic Topic Model

被引:0
作者
Karpovich, S. N. [1 ]
Smirnov, A. V. [2 ]
Teslya, N. N. [2 ]
机构
[1] Olymp Corp, Moscow 121205, Russia
[2] Russian Acad Sci SPIIRAS, St Petersburg Inst Informat & Automat, St Petersburg 199178, Russia
基金
俄罗斯基础研究基金会;
关键词
classification; binary classification; topic modeling; natural language processing; SUPPORT;
D O I
10.3103/S0147688219050034
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
An approach to text document classification that utilizes a probabilistic topic model, which is characterized by the fact that its training document set contains objects of only one class, is proposed. This approach makes it possible to identify positive samples (samples resembling the target class) in collections and streams of text documents. This article considers models created for solving the problems of text document classification and trained on samples of a single class, describes their key features. The Positive Example Based Learning-TM classification model is presented and a software prototype that implements it as a basis for classification of text documents is developed. Despite having no information about negative document samples, the model demonstrates a high level of classification accuracy that exceeds the performance of alternative approaches. The superiority of the Positive Example Based Learning-TM model with respect to the classification accuracy criterion when using a small training set is experimentally proven.
引用
收藏
页码:314 / 320
页数:7
相关论文
共 50 条
[41]   Automatic Topic Identification and Classification of Text Messages in the SMSALL System [J].
Pervaiz, Fahad ;
Subramanian, Lakshmi ;
Saif, Umar .
PROCEEDINGS OF THE 2ND ACM SYMPOSIUM ON COMPUTING FOR DEVELOPMENT (ACM DEV 2012), 2012,
[42]   Dataless Text Classification: A Topic Modeling Approach with Document Manifold [J].
Li, Ximing ;
Li, Changchun ;
Chi, Jinjin ;
Ouyang, Jihong ;
Li, Chenliang .
CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, :973-982
[43]   TopicStriKer: A topic kernels-powered approach for text classification [J].
Chandran, Nikhil, V ;
Anoop, V. S. ;
Asharaf, S. .
RESULTS IN ENGINEERING, 2023, 17
[44]   ROMANIAN TOPIC MODELING - AN EVALUATION OF PROBABILISTIC VERSUS TRANSFORMER-BASED TOPIC MODELING FOR DOMAIN CATEGORIZATION [J].
Nitu, Melania ;
Dascalu, Mihai ;
Dascalu, Maria-Iuliana .
REVUE ROUMAINE DES SCIENCES TECHNIQUES-SERIE ELECTROTECHNIQUE ET ENERGETIQUE, 2023, 68 (03) :295-300
[45]   Automated Classification of Construction Claim Documents Using Text Mining [J].
Malaeb, Zeina ;
Momenifar, Samaneh ;
Rehman, Tooba ;
Biglari, Ava ;
Mohammed, Yasser ;
Karim, Mohammad Rezaul .
PROCEEDINGS OF THE CANADIAN SOCIETY FOR CIVIL ENGINEERING ANNUAL CONFERENCE 2023, VOL 5, CSCE 2023, 2024, 499 :313-325
[46]   Text and image area classification in mobile scanned digitised documents [J].
Ettl, Anne-Sophie ;
Kuijper, Arjan .
INTERNATIONAL JOURNAL OF APPLIED PATTERN RECOGNITION, 2014, 1 (02) :173-198
[47]   Topic Modeling Technique for Text Mining Over Biomedical Text Corpora Through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering [J].
Rashid, Junaid ;
Shah, Syed Muhammad Adnan ;
Irtaza, Aun ;
Mahmood, Toqeer ;
Nisar, Muhammad Wasif ;
Shafiq, Muhammad ;
Gardezi, Akber .
IEEE ACCESS, 2019, 7 :146070-146080
[48]   Topic Change Detection on Dialog Based Text [J].
Senel, Lutfi Kerem ;
Yucesoy, Veysel ;
Koc, Aykut ;
Cukur, Tolga .
2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
[49]   Topic Modeling Based Text Summarization Approach [J].
Yu, Shusi ;
Wang, Wei .
2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING APPLICATIONS (CSEA 2015), 2015, :203-207
[50]   Semantic Text Alignment based on Topic Modeling [J].
Le, Huong T. ;
Pham, Lam N. ;
Nguyen, Duy D. ;
Nguyen, Son V. ;
Nguyen, An N. .
2016 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES, RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2016, :67-72