The Standard Deviation Score: a novel similarity metric for data analysis

被引:0
|
作者
Ismael, Osama [1 ]
机构
[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza, Egypt
关键词
Similarity measurement; Standard Deviation Score; Distance metrics; k-Nearest Neighbor; K-means; Gaussian; Skewed; Multimodal distributions; DISTANCE;
D O I
10.1186/s40537-025-01091-z
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The ability to measure similarity or distance between data points is critical for various analytical tasks, including classification, clustering, and anomaly detection. However, traditional distance metrics such as Euclidean, Manhattan, and Hamming often struggle with mixed data types, varying attribute scales, and noise, limiting their robustness in diverse datasets. This paper introduces the Standard Deviation Score (SD-score), a novel similarity metric designed to address these challenges. By transforming traditional distance values into standard deviation units relative to a target point, the SD-score enables robust and interpretable similarity assessments. Extensive experimental evaluations demonstrate that the SD-score consistently outperforms conventional metrics in accuracy, precision, recall, and F-score within the k-Nearest Neighbors classification framework. Also, a comprehensive evaluation of the SD-score's performance across Gaussian, skewed, and multimodal distributions showed promising results in the cluster coherence experiment, in which the Silhouette score was measured through the K-means clustering algorithm, emphasizing its adaptability to real-world data complexities. Additionally, the experiments detail improved handling of mixed numerical, ordinal, and categorical data types through a unified framework. The proposed metric incorporates inherent normalization mechanisms, reducing sensitivity to outliers and ensuring consistency across varying data scales and distributions, making it a versatile tool for real-world applications. This advancement in similarity measurement paves the way for more accurate and efficient data analysis across multiple domains.
引用
收藏
页数:42
相关论文
共 50 条
  • [1] STANDARD SCORE NOT DEVIATION IQ
    KAPLAN, MS
    PERSONNEL AND GUIDANCE JOURNAL, 1965, 44 (02): : 194 - 194
  • [2] The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis
    Xie, Juanying
    Wang, Mingzhao
    Xu, Shengquan
    Huang, Zhao
    Grant, Philip W.
    FRONTIERS IN GENETICS, 2021, 12
  • [3] Text similarity computing based on standard deviation
    Liu, T
    Guo, J
    ADVANCES IN INTELLIGENT COMPUTING, PT 1, PROCEEDINGS, 2005, 3644 : 456 - 464
  • [4] A novel similarity metric with application to big process data analytics
    Guo, Zijian
    Shang, Chao
    Ye, Hao
    CONTROL ENGINEERING PRACTICE, 2021, 113
  • [5] Application of distance standard deviation in functional data analysis
    Krzysko, Miroslaw
    Smaga, Lukasz
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2024, 18 (02) : 431 - 454
  • [6] The role of the sample standard deviation in the analysis of measurement data
    Willink, Robin
    ACCREDITATION AND QUALITY ASSURANCE, 2009, 14 (07) : 353 - 358
  • [7] The role of the sample standard deviation in the analysis of measurement data
    Robin Willink
    Accreditation and Quality Assurance, 2009, 14 : 353 - 358
  • [8] STANDARD DEVIATION VERSUS AGE AS A SCORE UNIT
    Willson, G. M.
    JOURNAL OF EDUCATIONAL RESEARCH, 1926, 13 (03): : 189 - 196
  • [9] A multi-metric similarity based analysis of microarray data
    Altiparmak, Fatih
    Erdal, Selnur
    Ozturk, Ozgur
    Ferhatosmanoglu, Hakan
    2007 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, PROCEEDINGS, 2007, : 317 - +
  • [10] Detection of Deviation in Performance of Battery Cells by Data Compression and Similarity Analysis
    Vachkov, Gancho
    Byttner, Stefan
    Svensson, Magnus
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2014, 29 (03) : 207 - 222