Text Categorization via Similarity Search An Efficient and Effective Novel Algorithm

被引:0
|
作者
Duan, Hubert Haoyang [1 ]
Pestov, Vladimir G. [1 ]
Singla, Varun [2 ]
机构
[1] Univ Ottawa, Ottawa, ON K1N 6N5, Canada
[2] Indian Inst Technol, New Delhi, India
来源
SIMILARITY SEARCH AND APPLICATIONS (SISAP) | 2013年 / 8199卷
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency distribution of the words in the text. The algorithm is both effective and efficient, as further confirmed by experiments on the Reuters 21578 dataset.
引用
收藏
页码:182 / 193
页数:12
相关论文
共 50 条
  • [1] Oscillating feature subset search algorithm for text categorization
    Novovicova, Jana
    Somol, Petr
    Pudil, Pavel
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2006, 4225 : 578 - 587
  • [2] A novel feature selection algorithm for text categorization
    Shang, Wenqian
    Huang, Houkuan
    Zhu, Haibin
    Lin, Yongmin
    Qu, Youli
    Wang, Zhihai
    EXPERT SYSTEMS WITH APPLICATIONS, 2007, 33 (01) : 1 - 5
  • [3] A Novel Feature Weight Algorithm for Text Categorization
    Shang, Wenqian
    Dong, Hongbin
    Zhu, Haibin
    Wang, Yongbin
    IEEE NLP-KE 2008: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2008, : 269 - 275
  • [4] An Efficient Video Similarity Search Algorithm
    Cao, Zheng
    Zhu, Ming
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (02) : 751 - 755
  • [5] Efficient Hyperparameter Tuning with Grid Search for Text Categorization using kNN Approach with BM25 Similarity
    Ghawi, Raji
    Pfeffer, Juergen
    OPEN COMPUTER SCIENCE, 2019, 9 (01): : 160 - 180
  • [6] An efficient text categorization algorithm based on category memberships
    Deng, ZH
    Tang, SW
    Zhang, M
    FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PT 1, PROCEEDINGS, 2005, 3613 : 374 - 382
  • [7] On effective conceptual indexing and similarity search in text data
    Aggarwal, CC
    Yu, PS
    2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, : 3 - 10
  • [8] An Efficient and Effective Video Similarity Search Method
    Zhu Liuzhang
    Li Zimian
    Cao Zheng
    INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2011), 2011, 8285
  • [9] An Efficient Similarity Search Algorithm for Web Video
    Cao, Zheng
    Zhu, Ming
    2009 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INTELLIGENT SYSTEMS, PROCEEDINGS, VOL 4, 2009, : 209 - 213
  • [10] FORESTEXTER: An efficient random forest algorithm for imbalanced text categorization
    Wu, Qingyao
    Ye, Yunming
    Zhang, Haijun
    Ng, Michael K.
    Ho, Shen-Shyang
    KNOWLEDGE-BASED SYSTEMS, 2014, 67 : 105 - 116