A Method of K-Means Clustering Based on TF-IDF for Software Requirements Documents Written in Chinese Language

被引:5
作者
Zhu, Jing [1 ,2 ]
Huang, Song [1 ]
Shi, Yaqing [1 ]
Wu, Kaishun [1 ]
Wang, Yanqiu [3 ]
机构
[1] Army Engn Univ PLA, Command & Control Engn Coll, Nanjing 210000, Peoples R China
[2] Navy Command Coll, Training Management Dept, Nanjing 210000, Peoples R China
[3] Baopo Technol Co Ltd, Nanjing 210000, Peoples R China
基金
中国博士后科学基金; 国家重点研发计划;
关键词
Chinese; TF-IDF; K-means; clustering;
D O I
10.1587/transinf.2021EDP7144
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays there is no way to automatically obtain the function points when using function point analyze (FPA) method, especially for the requirement documents written in Chinese language. Considering the characteristics of Chinese grammar in words segmentation, it is necessary to divide words accurately Chinese words, so that the subsequent entity recognition and disambiguation can be carried out in a smaller range, which lays a solid foundation for the efficient automatic extraction of the function points. Therefore, this paper proposed a method of K-Means clustering based on TF-IDF, and conducts experiments with 24 software requirement documents written in Chinese language. The results show that the best clustering effect is achieved when the extracted information is retained by 55% to 75% and the number of clusters takes the middle value of the total number of clusters. Not only for Chinese, this method and conclusion of this paper, but provides an important reference for automatic extraction of function points from software requirements documents written in other Oriental languages, and also fills the gaps of data preprocessing in the early stage of automatic calculation function points.
引用
收藏
页码:736 / 754
页数:19
相关论文
共 12 条
[1]   IFPUG Function Points to COSMIC Function Points convertibility: A fine-grained statistical approach [J].
Abualkishik, Abedallah Zaid ;
Lavazza, Luigi .
INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 97 :179-191
[2]  
Albrech A., 1979, PROC JOINT SHAREGUID, P83
[3]   A tracking and summarization system for online Chinese news topics [J].
Chang, Hsien-Tsung ;
Liu, Shu-Wei ;
Mishra, Nilamadhab .
ASLIB JOURNAL OF INFORMATION MANAGEMENT, 2015, 67 (06) :687-699
[4]   Conceptual Association of Functional Size Measurement Methods [J].
Demirors, Onur ;
Gencel, Cigdem .
IEEE SOFTWARE, 2009, 26 (03) :71-78
[5]  
Feldt R, 2010, 22ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING & KNOWLEDGE ENGINEERING (SEKE 2010), P374
[6]   Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec [J].
Kim, Donghwa ;
Seo, Deokseong ;
Cho, Suhyoun ;
Kang, Pilsung .
INFORMATION SCIENCES, 2019, 477 :15-29
[7]  
Luo Y., 2016, TEXT KEYWORD EXTRACT
[8]   Functional and Non- functional Size Measurement with IFPUG FPA and SNAP - Case Study [J].
Ochodek, Miroslaw ;
Ozgok, Batuhan .
SOFTWARE ENGINEERING IN INTELLIGENT SYSTEMS (CSOC2015), VOL 3, 2015, 349 :19-33
[9]   Research on aviation unsafe incidents classification with improved TF-IDF algorithm [J].
Wang, Yanhua ;
Zhang, Zhiyuan ;
Huo, Weigang .
MODERN PHYSICS LETTERS B, 2016, 30 (12)
[10]   A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing [J].
Yan, Hang ;
Qiu, Xipeng ;
Huang, Xuanjing .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :78-92