Open source software classification using cost-sensitive multi-label learning

被引:0
作者
Han, Le [1 ]
Li, Ming [1 ]
机构
[1] National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing
来源
Ruan Jian Xue Bao/Journal of Software | 2014年 / 25卷 / 09期
关键词
Cost-sensitive learning; Machine learning; Multi-label learning; Software automatic tagging; Software mining;
D O I
10.13328/j.cnki.jos.004639
中图分类号
学科分类号
摘要
With the explosive growth of open source software, retrieving desired software in open source software communities becomes a great challenge. Tagging open source software is usually a manual process which assigns software with several tags describing its functions and characteristics. Users can search their desired software by matching the keywords. Because of the simplicity and convenience, software retrieval based on tags has been widely used. However, since human effort is expensive and time-consuming, developers are not willing to tag software sufficiently when uploading software projects. Thus automatic software tagging, with tags describing functions and characteristics according to software projects' text descriptions provided by users, becomes key to effective software retrieval. This article formalizes this problem as a multi-label learning problem and proposes a new multi-label learning method ML-CKNN which can effectively solve this problem when the number of different tags is extremely large. By imposing cost value of wrong classification into multi-label learning, ML-CKNN can effectively solve this imbalanced problem, as each tag instances associated with this tag are much less than those not associated with this tag. Experiments on three open source software community datasets show that ML-CKNN can provide high-quality tags for new uploading open source software while significantly outperforming existing methods. © Copyright 2014, Institute of Software, the Chinese Academy of Science. All Rights Reserved.
引用
收藏
页码:1982 / 1991
页数:9
相关论文
共 15 条
  • [1] Wang T., Yin G., Li X., Wang H., Labeled topic detection of open source software from mining mass textual project profiles, Proc. of the 1st Int'l Workshop on Software Mining, pp. 17-24, (2012)
  • [2] Xia X., Lo D., Wang X., Zhou B., Tag recommendation in software information sites, Proc. of the 10th Int'l Workshop on Mining Software Repositories, pp. 287-296, (2013)
  • [3] Zhang M.L., Zhou Z.H., ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognition, 40, 7, pp. 2038-2048, (2007)
  • [4] De Souza A.F., Pedroni F., Oliveira E., Ciarelli P.M., Henrique W.F., Veronese L., Badue C., Automated multi-label text categorization with VG-RAM weightless neural networks, Neurocomputing, 72, 10, pp. 2209-2217, (2009)
  • [5] Ciarelli P.M., Oliveira E., Badue C., De Souza A.F., Multi-Label text categorization using a probabilistic neural network, Int'l Journal of Computer Information Systems and Industrial Management Applications, 1, pp. 133-144, (2009)
  • [6] Jiang J.Y., Tsai S.C., Lee S.J., FSKNN: Multi-Label text categorization based on fuzzy similarity and k nearest neighbors, Expert Systems with Applications, 39, 3, pp. 2813-2821, (2012)
  • [7] Tsoumakas G., Katakis I., Vlahavas I., Mining multi-label data, Data Mining and Knowledge Discovery Handbook, pp. 667-685, (2010)
  • [8] Tsoumakas G., Vlahavas I., Random k-label sets: An ensemble method for multilabel classification, Proc. of the 18th European Conf. on Machine Learning, pp. 406-417, (2007)
  • [9] Lo H.Y., Wang J.C., Wang H.M., Lin S.D., Cost-Sensitive multi-label learning for audio tag annotation and retrieval, IEEE Trans. on Multimedia, 13, 3, pp. 518-529, (2011)
  • [10] Liu A.Y., The effect of oversampling and undersampling on classifying imbalanced text datasets, (2004)