Text Categorization via Similarity Search An Efficient and Effective Novel Algorithm

被引:0
|
作者
Duan, Hubert Haoyang [1 ]
Pestov, Vladimir G. [1 ]
Singla, Varun [2 ]
机构
[1] Univ Ottawa, Ottawa, ON K1N 6N5, Canada
[2] Indian Inst Technol, New Delhi, India
来源
SIMILARITY SEARCH AND APPLICATIONS (SISAP) | 2013年 / 8199卷
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency distribution of the words in the text. The algorithm is both effective and efficient, as further confirmed by experiments on the Reuters 21578 dataset.
引用
收藏
页码:182 / 193
页数:12
相关论文
共 50 条
  • [21] Effective Categorization of Text in Practical Design
    Ravi, S.
    Sambath, M.
    RameshKumar, K.
    2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2014,
  • [22] Novel feature selection algorithm for Chinese text categorization based on CHI
    Cai Zhenliang
    Wang Jian
    Liu Jiqiang
    PROCEEDINGS OF 2016 IEEE 13TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP 2016), 2016, : 1035 - 1039
  • [23] Efficient algorithm for sequence similarity search based on reference indexing
    Dai D.-B.
    Xiong Y.
    Zhu Y.-Y.
    Ruan Jian Xue Bao/Journal of Software, 2010, 21 (04): : 718 - 731
  • [24] MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance
    Zhang, Haoyu
    Zhang, Qin
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 566 - 576
  • [25] A fast KNN algorithm for text categorization
    Wang, Yu
    Wang, Zheng-Ou
    PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 3436 - +
  • [26] A simple KNN algorithm for text categorization
    Soucy, P
    Mineau, GW
    2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, : 647 - 648
  • [27] A KNN BASED ALGORITHM FOR TEXT CATEGORIZATION
    Bucar, Joze
    Povh, Janez
    SOR'13 PROCEEDINGS: THE 12TH INTERNATIONAL SYMPOSIUM ON OPERATIONAL RESEARCH IN SLOVENIA, 2013, : 367 - 372
  • [28] A constructive learning algorithm for text categorization
    Chen, Weijun
    Zhang, Bo
    ADVANCES IN NEURAL NETWORKS - ISNN 2006, PT 2, PROCEEDINGS, 2006, 3972 : 259 - 264
  • [29] Using KNN Algorithm for Text Categorization
    Wajeed, M. A.
    Adilakshmi, T.
    COMPUTATIONAL INTELLIGENCE AND INFORMATION TECHNOLOGY, 2011, 250 : 796 - +
  • [30] An Improved Parallel Algorithm for Text Categorization
    Yang, Wenchuan
    Fu, Yimin
    Zhang, Dong
    2016 INTERNATIONAL SYMPOSIUM ON COMPUTER, CONSUMER AND CONTROL (IS3C), 2016, : 451 - 454