STFF-SM: Steganalysis Model Based on Spatial and Temporal Feature Fusion for Speech Streams

被引:4
|
作者
Tian, Hui [1 ,2 ]
Qiu, Yiqin [1 ,2 ]
Mazurczyk, Wojciech [3 ]
Li, Haizhou [4 ,5 ]
Qian, Zhenxing [6 ]
机构
[1] Natl Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 361021, Peoples R China
[2] Xiamen Key Lab Data Secur & Blockchain Technol, Xiamen 361021, Peoples R China
[3] Warsaw Univ Technol, Fac Elect & Informat Technol, Inst Comp Sci, PL-00665 Warsaw, Poland
[4] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen 518172, Peoples R China
[5] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[6] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Delays; Feature extraction; Steganography; Quantization (signal); Distortion; Speech coding; Resistance; Steganalysis; steganography; voice over Internet protocol; speech streams; deep neural networks; pitch delays; STEGANOGRAPHY; SCHEME; VOICE;
D O I
10.1109/TASLP.2022.3224295
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The real-time detection of speech steganography in Voice-over-Internet-Protocol (VoIP) scenarios remains an open problem, as it requires steganalysis methods to perform for low-intensity embeddings and short-sample inputs, as well as provide rapid detection results. To address these challenges, this paper presents a novel steganalysis model based on spatial and temporal feature fusion (STFF-SM). Differing from the existing methods, we take both the integer and fractional pitch delays as input, and design subframe-stitch module to organically integrate subframe-wise integer delays and frame-wise fractional pitch delays. Further, we design a spatial fusion module based on pre-activation residual convolution to extract the pitch spatial features and gradually increase their dimensions to discover finer steganographic distortions to enhance the detection effect, where a Group-Squeeze-Weighting block is introduced to alleviate the information loss in the process of increasing the feature dimension. In addition, we design a temporal fusion module to extract pitch temporal features using the stacked LSTM, where a Gated Feed-Forward Network is introduced to learn the interaction between different feature maps while suppressing the features that are not useful for detection. We evaluated the performance of STFF-SM through comprehensive experiments and comparisons with the state-of-the-art solutions. The experimental results demonstrate that STFF-SM can well meet the needs of real-time detection of speech steganography in VoIP streams, and outperforms the existing methods in detection performance, especially with low embedding strengths and short window sizes.
引用
收藏
页码:277 / 289
页数:13
相关论文
共 50 条
  • [31] Classification of Speech Emotion State Based on Feature Map Fusion of TCN and Pretrained CNN Model From Korean Speech Emotion Data
    Jo, A-Hyeon
    Kwak, Keun-Chang
    IEEE ACCESS, 2025, 13 : 19947 - 19963
  • [32] Radar-Based Human Activity Recognition Using Dual-Stream Spatial and Temporal Feature Fusion Network
    Li, Jianjun
    Xu, Hongji
    Zeng, Jiaqi
    Ai, Wentao
    Li, Shijie
    Li, Xiaoman
    Li, Xinya
    IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, 2024, 60 (02) : 1835 - 1847
  • [33] Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion
    Wang, Mengsheng
    Ma, Hongbin
    Wang, Yingli
    Sun, Xianhe
    APPLIED ACOUSTICS, 2024, 218
  • [34] A Discriminative Deep Model With Feature Fusion and Temporal Attention for Human Action Recognition
    Yu, Jiahui
    Gao, Hongwei
    Yang, Wei
    Jiang, Yueqiu
    Chin, Weihong
    Kubota, Naoyuki
    Ju, Zhaojie
    IEEE ACCESS, 2020, 8 : 43243 - 43255
  • [35] Semantic Segmentation Based on Spatial Pyramid Pooling and Multilayer Feature Fusion
    Ji, Jian
    Li, Sitong
    Liao, Xianfu
    Zhang, Fangrong
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (03) : 1524 - 1535
  • [36] Binocular Feature Fusion and Spatial Attention Mechanism Based Gaze Tracking
    Dai, Lihong
    Liu, Jinguo
    Ju, Zhaojie
    IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2022, 52 (02) : 302 - 311
  • [37] A Spatial-Temporal Graph Model for Pronunciation Feature Prediction of Chinese Poetry
    Wang, Qing
    Liu, Weiping
    Wang, Xiumei
    Chen, Xinghong
    Chen, Guannan
    Wu, Qingxiang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10294 - 10308
  • [38] DeepComboSAD: Spectro-Temporal Correlation Based Speech Activity Detection for Naturalistic Audio Streams
    Joglekar, Aditya
    Hansen, John H. L.
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1472 - 1476
  • [39] A Feature Fusion Model Based on Temporal Convolutional Network for Automatic Sleep Staging Using Single-Channel EEG
    Bao, Jiameng
    Wang, Guangming
    Wang, Tianyu
    Wu, Ning
    Hu, Shimin
    Lee, Won Hee
    Lo, Sio-Long
    Yan, Xiangguo
    Zheng, Yang
    Wang, Gang
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (11) : 6641 - 6652
  • [40] Speech emotion recognition based on multi-feature and multi-lingual fusion
    Wang, Chunyi
    Ren, Ying
    Zhang, Na
    Cui, Fuwei
    Luo, Shiying
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (04) : 4897 - 4907