A segment-based approach to clustering multi-topic documents

被引:30
|
作者
Tagarelli, Andrea [1 ]
Karypis, George [2 ]
机构
[1] Univ Calabria, Dept Elect Comp & Syst Sci, I-87036 Arcavacata Di Rende, CS, Italy
[2] Univ Minnesota, Dept Comp Sci & Engn, Digital Technol Ctr, Minneapolis, MN 55455 USA
关键词
Document clustering; Text segmentation; Topic identification; Interdisciplinary documents; TEXT; SIMILARITY; ALGORITHMS;
D O I
10.1007/s10115-012-0556-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.
引用
收藏
页码:563 / 595
页数:33
相关论文
共 50 条
  • [1] A segment-based approach to clustering multi-topic documents
    Andrea Tagarelli
    George Karypis
    Knowledge and Information Systems, 2013, 34 : 563 - 595
  • [2] A Data-Based Approach to Discovering Multi-Topic Influential Leaders
    Tang, Xing
    Miao, Qiguang
    Yu, Shangshang
    Quan, Yining
    PLOS ONE, 2016, 11 (07):
  • [3] A constraint-based approach for the authoring of multi-topic multimedia presentations
    Bertino, E
    Ferrari, E
    Perego, A
    Santi, D
    2005 IEEE International Conference on Multimedia and Expo (ICME), Vols 1 and 2, 2005, : 578 - 581
  • [4] Multi-Topic Labelling Classification Based on LSTM
    AlBatayha, Duha
    2021 12TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2021, : 471 - 474
  • [5] Line Segment-Based Clustering Approach With Self-Organizing Maps
    Chamundeswari, G.
    Varma, G. P. S.
    Satyanarayana, C.
    JOURNAL OF INFORMATION TECHNOLOGY RESEARCH, 2021, 14 (04) : 33 - 44
  • [6] Detection as multi-topic tracking
    Allan, J
    INFORMATION RETRIEVAL, 2002, 5 (2-3): : 139 - 157
  • [7] Detection As Multi-Topic Tracking
    James Allan
    Information Retrieval, 2002, 5 : 139 - 157
  • [8] Segment-Based Test Case Prioritization: A Multi-objective Approach
    Hieu Huynh
    Nhu Pham
    Nguyen, Tien N.
    Vu Nguyen
    PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 1149 - 1160
  • [9] Segment-based approach to the recognition of emotions in speech
    Shami, MT
    Kamel, MS
    2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, 2005, : 366 - 369
  • [10] Analysis and Visualization of Students' Learning Based on Multi-Topic Chat Text
    Zhang, Haoyi
    Pan, Feng
    Wu, Zhenyu
    Ji, Yang
    PROCEEDINGS OF THE 2018 IEEE 6TH INTERNATIONAL CONFERENCE ON MOOCS, INNOVATION AND TECHNOLOGY IN EDUCATION (MITE 2018), 2018, : 90 - 97