STFF-SM: Steganalysis Model Based on Spatial and Temporal Feature Fusion for Speech Streams

被引：4

作者：

Tian, Hui ^{[1
,2
]}

Qiu, Yiqin ^{[1
,2
]}

Mazurczyk, Wojciech ^{[3
]}

Li, Haizhou ^{[4
,5
]}

Qian, Zhenxing ^{[6
]}

机构：

[1] Natl Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 361021, Peoples R China

[2] Xiamen Key Lab Data Secur & Blockchain Technol, Xiamen 361021, Peoples R China

[3] Warsaw Univ Technol, Fac Elect & Informat Technol, Inst Comp Sci, PL-00665 Warsaw, Poland

[4] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen 518172, Peoples R China

[5] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore

[6] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Delays; Feature extraction; Steganography; Quantization (signal); Distortion; Speech coding; Resistance; Steganalysis; steganography; voice over Internet protocol; speech streams; deep neural networks; pitch delays; STEGANOGRAPHY; SCHEME; VOICE;

D O I：

10.1109/TASLP.2022.3224295

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The real-time detection of speech steganography in Voice-over-Internet-Protocol (VoIP) scenarios remains an open problem, as it requires steganalysis methods to perform for low-intensity embeddings and short-sample inputs, as well as provide rapid detection results. To address these challenges, this paper presents a novel steganalysis model based on spatial and temporal feature fusion (STFF-SM). Differing from the existing methods, we take both the integer and fractional pitch delays as input, and design subframe-stitch module to organically integrate subframe-wise integer delays and frame-wise fractional pitch delays. Further, we design a spatial fusion module based on pre-activation residual convolution to extract the pitch spatial features and gradually increase their dimensions to discover finer steganographic distortions to enhance the detection effect, where a Group-Squeeze-Weighting block is introduced to alleviate the information loss in the process of increasing the feature dimension. In addition, we design a temporal fusion module to extract pitch temporal features using the stacked LSTM, where a Gated Feed-Forward Network is introduced to learn the interaction between different feature maps while suppressing the features that are not useful for detection. We evaluated the performance of STFF-SM through comprehensive experiments and comparisons with the state-of-the-art solutions. The experimental results demonstrate that STFF-SM can well meet the needs of real-time detection of speech steganography in VoIP streams, and outperforms the existing methods in detection performance, especially with low embedding strengths and short window sizes.

引用

页码：277 / 289

页数：13

共 50 条

[31] Classification of Speech Emotion State Based on Feature Map Fusion of TCN and Pretrained CNN Model From Korean Speech Emotion Data
Jo, A-Hyeon
Kwak, Keun-Chang
IEEE ACCESS, 2025, 13 : 19947 - 19963
[32] Radar-Based Human Activity Recognition Using Dual-Stream Spatial and Temporal Feature Fusion Network
Li, Jianjun
Xu, Hongji
Zeng, Jiaqi
Ai, Wentao
Li, Shijie
Li, Xiaoman
Li, Xinya
IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, 2024, 60 (02) : 1835 - 1847
[33] Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion
Wang, Mengsheng
Ma, Hongbin
Wang, Yingli
Sun, Xianhe
APPLIED ACOUSTICS, 2024, 218
[34] A Discriminative Deep Model With Feature Fusion and Temporal Attention for Human Action Recognition
Yu, Jiahui
Gao, Hongwei
Yang, Wei
Jiang, Yueqiu
Chin, Weihong
Kubota, Naoyuki
Ju, Zhaojie
IEEE ACCESS, 2020, 8 : 43243 - 43255
[35] Semantic Segmentation Based on Spatial Pyramid Pooling and Multilayer Feature Fusion
Ji, Jian
Li, Sitong
Liao, Xianfu
Zhang, Fangrong
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (03) : 1524 - 1535
[36] Binocular Feature Fusion and Spatial Attention Mechanism Based Gaze Tracking
Dai, Lihong
Liu, Jinguo
Ju, Zhaojie
IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2022, 52 (02) : 302 - 311
[37] A Spatial-Temporal Graph Model for Pronunciation Feature Prediction of Chinese Poetry
Wang, Qing
Liu, Weiping
Wang, Xiumei
Chen, Xinghong
Chen, Guannan
Wu, Qingxiang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10294 - 10308
[38] DeepComboSAD: Spectro-Temporal Correlation Based Speech Activity Detection for Naturalistic Audio Streams
Joglekar, Aditya
Hansen, John H. L.
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1472 - 1476
[39] A Feature Fusion Model Based on Temporal Convolutional Network for Automatic Sleep Staging Using Single-Channel EEG
Bao, Jiameng
Wang, Guangming
Wang, Tianyu
Wu, Ning
Hu, Shimin
Lee, Won Hee
Lo, Sio-Long
Yan, Xiangguo
Zheng, Yang
Wang, Gang
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (11) : 6641 - 6652
[40] Speech emotion recognition based on multi-feature and multi-lingual fusion
Wang, Chunyi
Ren, Ying
Zhang, Na
Cui, Fuwei
Luo, Shiying
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (04) : 4897 - 4907

← 1 2 3 4 5 →