Graph based feature selection investigating boundary region of rough set for language identification

被引:17
作者
Yasmin, Ghazaala [1 ]
Das, Asit Kumar [2 ]
Nayak, Janmenjoy [3 ]
Pelusi, Danilo [4 ]
Ding, Weiping [5 ]
机构
[1] St Thomas Coll Engn & Technol, Kolkata, W Bengal, India
[2] Indian Inst Engn Sci & Technol, Sibpur, Howrah, India
[3] Aditya Inst Technol & Management AITAM, Tekkali, India
[4] Univ Teramo, Dept Commun Sci, Teramo, Italy
[5] Nantong Univ, Sch Informat Sci & Technol, Nantong 226019, Peoples R China
基金
中国国家自然科学基金;
关键词
Language identification; Feature selection; Relative indiscernibility relation; Attribute dependency; Boundary region exploration; FEATURE-EXTRACTION; COMMUNITY STRUCTURE; NETWORK;
D O I
10.1016/j.eswa.2020.113575
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Language can be chosen to be a species where maximum information can be extracted. In the world, there are many countries, some of which are of numerous types and flavours of regions based on their languages. The challenge is to make the spoken language recognition to be automated through machine learning. The proposed language identification system extracts various features from speech of different languages and constructs a complete weighted graph with extracted features as nodes and similarity among the features as weights of the edges. Similarity values are computed using the concepts of positive region and boundary region of rough set theory and a graph based feature selection algorithm is devised to select only the minimal subset of features relevant to language identification. It is observed that, investigating the boundary region together with the positive region, more valuable information is extracted which helps in selection of more relevant features for language identification. The constructed complete weighted graph is made sparse using Gini index based sparsity measure. As a result, the graph contains only the edges whose terminal nodes are highly similar. Next, a maximal spanning tree of the graph is generated using Prim's algorithm. This tree is a basic structure that provides the maximal similarity among the nodes in the graph. Finally, score of each node is computed based on weights of the edges in the tree and a node with the high est score is selected and removed from the spanning tree. This process of selection and removal of nodes is continued until the graph becomes null. The resultant set of selected nodes is considered as the important feature subset of the audio speeches used for language identification. Experimental results show the effectiveness of the proposed rough set theory based feature selection method. The results also demonstrate the usefulness of investigation of boundary region of rough sets. (C) 2020 Elsevier Ltd. All rights reserved.
引用
收藏
页数:17
相关论文
共 69 条
  • [1] Adami A.G., 2003, 8 EUR C SPEECH COMM
  • [2] [Anonymous], 2018, NEURAL COMPUT APPL
  • [3] [Anonymous], 2018, INTRO DATA MINING
  • [4] [Anonymous], 2011, 12 ANN C INT SPEECH
  • [5] Integration of dense subgraph finding with feature clustering for unsupervised feature selection
    Bandyopadhyay, Sanghamitra
    Bhadra, Tapas
    Mitra, Pabitra
    Maulik, Ujjwal
    [J]. PATTERN RECOGNITION LETTERS, 2014, 40 : 104 - 112
  • [6] Language Identification Using Deep Convolutional Recurrent Neural Networks
    Bartz, Christian
    Herold, Tom
    Yang, Haojin
    Meinel, Christoph
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT VI, 2017, 10639 : 880 - 889
  • [7] Fast unfolding of communities in large networks
    Blondel, Vincent D.
    Guillaume, Jean-Loup
    Lambiotte, Renaud
    Lefebvre, Etienne
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2008,
  • [8] Bullock J., ICMC
  • [9] Campus P. -B. S., 2010, ANAL DESIGN ALGORITH
  • [10] Chandrasekhar V, 2011, INT CONF ACOUST SPEE, P5724