Hierarchical Multi-Class Classification of Voice Disorders Using Self-Supervised Models and Glottal Features

被引:17
作者
Tirronen, Saska [1 ]
Kadiri, Sudarsana Reddy [1 ]
Alku, Paavo [1 ]
机构
[1] Aalto Univ, Dept Informat & Commun Engn, Espoo, Finland
来源
IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2023年 / 4卷
基金
芬兰科学院;
关键词
Feature extraction; Training data; Pathology; Databases; Pipelines; Task analysis; Training; Pathological voices; voice disorders; hierarchical classification; glottal source extraction; multi-class classification; Wav2vec; HuBERT; SELECTION; DISEASE; SYSTEM;
D O I
10.1109/OJSP.2023.3242862
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Previous studies on the automatic classification of voice disorders have mostly investigated the binary classification task, which aims to distinguish pathological voice from healthy voice. Using multi-class classifiers, however, more fine-grained identification of voice disorders can be achieved, which is more helpful for clinical practitioners. Unfortunately, there is little publicly available training data for many voice disorders, which lowers the classification performance on data from unseen speakers. Earlier studies have shown that the usage of glottal source features can reduce data redundancy in detection of laryngeal voice disorders. Another approach to tackle the problems caused by scarcity of training data is to utilize deep learning models, such as wav2vec 2.0 and HuBERT, that have been pre-trained on larger databases. Since the aforementioned approaches have not been thoroughly studied in the multi-class classification of voice disorders, they will be jointly studied in the present work. In addition, we study a hierarchical classifier, which enables task-wise feature optimization and more efficient utilization of data. In this work, the aforementioned three approaches are compared with traditional mel frequency cepstral coefficient (MFCC) features and one-vs-rest and one-vs-one SVM classifiers. The results in a 3-class classification problem between healthy voice and two laryngeal disorders (hyperfunctional dysphonia and vocal fold paresis) indicate that all the studied methods outperform the baselines. The best performance was achieved by using features from wav2vec 2.0 LARGE together with hierarchical classification. The balanced classification accuracy of the system was 62.77% for male speakers, and 55.36% for female speakers, which outperformed the baseline systems by an absolute improvement of 15.76% and 6.95% for male and female speakers, respectively.
引用
收藏
页码:80 / 88
页数:9
相关论文
共 46 条
  • [1] A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis
    Airaksinen, Manu
    Juvela, Lauri
    Bollepalli, Bajibabu
    Yamagishi, Junichi
    Alku, Paavo
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) : 1658 - 1670
  • [2] Amara F, 2016, Appl Math, V10, P1061
  • [3] Ardila R, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4218
  • [4] Baevski A, 2020, ADV NEUR IN, V33
  • [5] Towards automatic assessment of voice disorders: A clinical approach
    Barche, Purva
    Gurugubelli, Krishna
    Vuppala, Anil Kumar
    [J]. INTERSPEECH 2020, 2020, : 2537 - 2541
  • [6] Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0
    Bayerl, Sebastian P.
    Wagner, Dominik
    Noeth, Elmar
    Riedhammer, Korbinian
    [J]. INTERSPEECH 2022, 2022, : 2868 - 2872
  • [7] Behroozmand R, 2005, 2005 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), VOLS 1 AND 2, P844
  • [9] Bhattacharjee Soumyadeep, 2022, Smart Health, DOI 10.1016/j.smhl.2021.100233
  • [10] Cawley GC, 2010, J MACH LEARN RES, V11, P2079