On the Use of Side Information for Mining Text Data

被引:15
作者
Aggarwal, Charu C. [1 ]
Zhao, Yuchen [2 ]
Yu, Philip S. [3 ]
机构
[1] IBM TJ Watson Res Ctr, Dept Comp Sci, Yorktown Hts, NY 10532 USA
[2] Sumo Log Inc, Redwood City, CA 94063 USA
[3] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
基金
美国国家科学基金会;
关键词
Data mining; clustering;
D O I
10.1109/TKDE.2012.148
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need a principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
引用
收藏
页码:1415 / 1429
页数:15
相关论文
共 36 条
[1]   On using partial supervision for text categorization [J].
Aggarwal, CC ;
Gates, SC ;
Yu, PS .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (02) :245-255
[2]  
Aggarwal CC, 2011, SOCIAL NETWORK DATA ANALYTICS, P1, DOI 10.1007/978-1-4419-8462-3
[3]  
Aggarwal CC, 2010, ADV DATABASE SYST, V40, P1, DOI 10.1007/978-1-4419-6045-0
[4]  
[Anonymous], 2003, P 26 ANN INT ACM SIG
[5]  
[Anonymous], ACM SIGMOD RECORD
[6]  
[Anonymous], P TEXT MIN WORKSH KD
[7]  
[Anonymous], P 17 INT C WORLD WID
[8]  
[Anonymous], P IEEE ICDE C WASH D
[9]  
[Anonymous], 1996, Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering
[10]  
[Anonymous], J MACH LEARN RES P T