Text Categorization via Similarity Search An Efficient and Effective Novel Algorithm

被引:0
|
作者
Duan, Hubert Haoyang [1 ]
Pestov, Vladimir G. [1 ]
Singla, Varun [2 ]
机构
[1] Univ Ottawa, Ottawa, ON K1N 6N5, Canada
[2] Indian Inst Technol, New Delhi, India
来源
SIMILARITY SEARCH AND APPLICATIONS (SISAP) | 2013年 / 8199卷
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency distribution of the words in the text. The algorithm is both effective and efficient, as further confirmed by experiments on the Reuters 21578 dataset.
引用
收藏
页码:182 / 193
页数:12
相关论文
共 50 条
  • [41] Olex: Effective Rule Learning for Text Categorization
    Rullo, Pasquale
    Policicchio, Veronica Lucia
    Cumbo, Chiara
    Iiritano, Salvatore
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (08) : 1118 - 1132
  • [42] A Novel Efficient Classification Algorithm for Search Engines
    Hosni, Hanan Ahmed
    Alla, Mahmoud Abd
    PROCEEDINGS OF THE 8TH WSEAS INTERNATIONAL CONFERENCE ON APPLIED INFORMATICS AND COMMUNICATIONS, PTS I AND II: NEW ASPECTS OF APPLIED INFORMATICS AND COMMUNICATIONS, 2008, : 356 - +
  • [43] An Effective Feature Selection Method for Text Categorization
    Qiu, Xipeng
    Zhou, Jinlong
    Huang, Xuanjing
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I: 15TH PACIFIC-ASIA CONFERENCE, PAKDD 2011, 2011, 6634 : 50 - 61
  • [44] Learning effective features for Chinese text categorization
    Luo, DS
    Wang, XH
    Wu, XH
    Chi, HS
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 608 - 613
  • [45] Local Similarity Search for Unstructured Text
    Wang, Pei
    Xiao, Chuan
    Qin, Jianbin
    Wang, Wei
    Zhang, Xiaoyang
    Ishikawa, Yoshiharu
    SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 1991 - 2005
  • [46] Continuous Similarity Search for Text Sets
    Tsuchida, Yuma
    Kubo, Kohei
    Koga, Hisashi
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2022, PT II, 2022, 13427 : 229 - 234
  • [47] An effective WSN deployment algorithm via search economics
    Tsai, Chun-Wei
    COMPUTER NETWORKS, 2016, 101 : 178 - 191
  • [48] In search of similarity: Stereotypes as naive theories in social categorization
    Wittenbrink, B
    Hilton, JL
    Gist, PL
    SOCIAL COGNITION, 1998, 16 (01) : 31 - 55
  • [49] A Novel Similarity Algorithm for Fixing Erroneous Turkish Text and Detection of Roots
    Ozdemir, Cuneyt
    Atas, Musa
    2014 22ND SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2014, : 830 - 833
  • [50] Novel similarity measures for the effective and efficient retrieval of pharmacological datasets
    Rivera Borroto, Oscar Miguel
    Hernandez Diaz, Yoandy
    Manuel Garcia de la Vega, Jose
    Grau Abalo, Ricardo del Corazon
    Marrero Ponce, Yovani
    AFINIDAD, 2011, 68 (551) : 50 - 56