Text Categorization via Similarity Search An Efficient and Effective Novel Algorithm

被引:0
|
作者
Duan, Hubert Haoyang [1 ]
Pestov, Vladimir G. [1 ]
Singla, Varun [2 ]
机构
[1] Univ Ottawa, Ottawa, ON K1N 6N5, Canada
[2] Indian Inst Technol, New Delhi, India
来源
SIMILARITY SEARCH AND APPLICATIONS (SISAP) | 2013年 / 8199卷
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency distribution of the words in the text. The algorithm is both effective and efficient, as further confirmed by experiments on the Reuters 21578 dataset.
引用
收藏
页码:182 / 193
页数:12
相关论文
共 50 条
  • [31] san_sim : Factual and Efficient URL Text Similarity Algorithm
    Sandhya
    Ghose, Udayan
    PROCEEDINGS OF THE 2017 3RD INTERNATIONAL CONFERENCE ON APPLIED AND THEORETICAL COMPUTING AND COMMUNICATION TECHNOLOGY (ICATCCT), 2017, : 359 - 364
  • [32] Contextual Text Categorization: An Improved Stemming Algorithm to Increase the Quality of Categorization in Arabic Text
    Gadri, Said
    Moussaoui, Abdelouahab
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (06) : 835 - 841
  • [33] TEXT CATEGORIZATION AND SORTING OF WEB SEARCH RESULTS
    Radovanovic, Milos
    Ivanovic, Mirjana
    Budimac, Zoran
    COMPUTING AND INFORMATICS, 2009, 28 (06) : 861 - 893
  • [34] Accelerating Graph Similarity Search via Efficient GED Computation
    Chang, Lijun
    Feng, Xing
    Yao, Kai
    Qin, Lu
    Zhang, Wenjie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (05) : 4485 - 4498
  • [35] Challenges and Techniques for Effective and Efficient Similarity Search in Large Video Databases
    Shao, Jie
    Shen, Heng Tao
    Zhou, Xiaofang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02): : 1598 - 1603
  • [36] Local similarity preserved hashing learning via Markov graph for efficient similarity search
    Liu, Hong
    Jiang, Aiwen
    Wang, Mingwen
    Wan, Jianyi
    NEUROCOMPUTING, 2015, 159 : 144 - 150
  • [37] A Novel Method for Efficient Multi-Label Text Categorization of research articles
    Jindal, Rajni
    Shweta
    2018 INTERNATIONAL CONFERENCE ON COMPUTING, POWER AND COMMUNICATION TECHNOLOGIES (GUCON), 2018, : 326 - 329
  • [38] Massive Text Normalization via an Efficient Randomized Algorithm
    Jiang, Nan
    Luo, Chen
    Lakshman, Vihan
    Dattatreya, Yesh
    Xue, Yexiang
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 2946 - 2956
  • [39] A Novel Efficient Classification Algorithm for Search Engines
    Alla, Hanan Ahmed Hosni Mahmoud Abd
    Al-Ghreimil, Nadia
    2008 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING CONTROL & AUTOMATION, VOLS 1 AND 2, 2008, : 773 - 778
  • [40] Batch Text Similarity Search with MapReduce
    Li, Rui
    Ju, Li
    Peng, Zhuo
    Yu, Zhiwei
    Wang, Chaokun
    WEB TECHNOLOGIES AND APPLICATIONS, 2011, 6612 : 412 - +