Multi-label text classification based on the label correlation mixture model

被引：6

作者：

He, Zhiyang ^{[1
]}

Wu, Ji ^{[1
]}

Lv, Ping ^{[2
]}

机构：

[1] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China

[2] Tsinghua iFlytek Joint Lab Speech Technol, Beijing, Peoples R China

来源：

INTELLIGENT DATA ANALYSIS | 2017年 / 21卷 / 06期

关键词：

Label correlation mixture model; probabilistic generative model; multi-label text classification; label correlation model; label correlation network; Bayes decision theory; DESIGN;

D O I：

10.3233/IDA-163055

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the current paper, we propose a probabilistic generative model, the label correlation mixture model (LCMM), to depict multi-labeled document data, which can be utilized for multi-label text classification. LCMM assumes two stochastic generative processes, which correspond to two submodels: 1) a label correlation model; and 2) a label mixture model. The former model formulates labels' generative process, in which a label correlation network is created to depict the dependency between labels. Moreover, we present an efficient inference algorithm for calculating the generative probability of a multi-label class. Furthermore, in order to optimize the label correlation network, we propose a parameter-learning algorithm based on gradient descent. The second submodel in the LCMM depicts the generative process of words in a document with the given labels. Different traditional mixture models can be adopted in this generative process, such as the mixture of language models, or topic models. In the multi-label classification stage, we propose a two-step strategy to most efficiently utilize the LCMM based on the framework of Bayes decision theory. We conduct extensive multi-label classification experiments on three standard text data sets. The experimental results show significant performance improvements comparing to existing approaches. For example, the improvements on accuracy and macro F-score measures in the OHSUMED data set achieve 28.3% and 37.0%, respectively. These performance enhancements demonstrate the effectiveness of the proposed models and solutions.

引用

页码：1371 / 1392

页数：22

共 35 条

[11]

He ZY, 2014, IEEE W SP LANG TECH, P83, DOI 10.1109/SLT.2014.7078554

[12] Label Correlation Mixture Model: A Supervised Generative Approach to Multilabel Spoken Document Categorization [J].

He, Zhiyang ;

Wu, Ji ;

Li, Tao .

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2015, 3 (02) :235-245

[13]

He ZY, 2014, 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), P39, DOI 10.1109/ISCSLP.2014.6936665

[14] Probabilistic latent semantic indexing [J].

Hofmann, T .

SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :50-57

[15] FSKNN: Multi-label text categorization based on fuzzy similarity and k nearest neighbors [J].

Jiang, Jung-Yi ;

Tsai, Shian-Chi ;

Lee, Shie-Jue .

EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (03) :2813-2821

[16] Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method [J].

Katagiri, S ;

Juang, BH ;

Lee, CH .

PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2345-2373

[17]

Kian Ming, 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P97

[18]

Lewis D.D., REUTERS 21578 TEXT C

[19]

Li T, 2006, PROC INT C TOOLS ART, P86

[20]

Manning C., 1999, FDN STAT NATURAL LAN

← 1 2 3 4 →