1.烟台职业学院智能控制系,山东 烟台 264035
2.烟台弘武机电科技有限公司,山东 烟台 264035
3.烟台大学物理与电子信息学院,山东 烟台 264005
[ "孟祥彩(1992- ),女,烟台职业学院智能控制系讲师,主要研究方向为人工智能技术。" ]
[ "庄云朋(1994- ),男,博士,烟台职业学院智能控制系讲师,主要研究方向为新能源技术。" ]
[ "钟强(1991- ),男,烟台弘武机电科技有限公司工程师,主要研究方向为通信、电子对抗。" ]
[ "欧世峰(1979- ),男,博士,烟台大学物理与电子信息学院教授,主要研究方向为语音信号处理。" ]
收稿:2025-05-30,
修回:2025-10-11,
录用:2025-10-11,
纸质出版:2025-12-20
移动端阅览
孟祥彩,庄云朋,钟强等.基于深度可分离卷积的MambaUNet的语音增强方法[J].电信科学,2025,41(12):181-190.
MENG Xiangcai,ZHUANG Yunpeng,ZHONG Qiang,et al.A speech enhancement method based on depthwise separable convolution and MambaUNet[J].Telecommunications Science,2025,41(12):181-190.
孟祥彩,庄云朋,钟强等.基于深度可分离卷积的MambaUNet的语音增强方法[J].电信科学,2025,41(12):181-190. DOI: 10.11959/j.issn.1000-0801.2025272.
MENG Xiangcai,ZHUANG Yunpeng,ZHONG Qiang,et al.A speech enhancement method based on depthwise separable convolution and MambaUNet[J].Telecommunications Science,2025,41(12):181-190. DOI: 10.11959/j.issn.1000-0801.2025272.
致力于提升语音增强技术在复杂噪声环境中的鲁棒性与建模效率,提出了一种基于深度可分离卷积与结构化状态空间模型融合的语音增强网络(DW-MambaUNet)。该网络以U-Net结构为基础,引入TF-Mamba模块在时序与频率双路径上建模全局依赖,同时结合深度可分离卷积增强局部特征提取能力,从而实现对语音信号的多尺度特征恢复与精细增强。模型在频谱重建过程中通过可学习Sigmoid与Arctan2函数分别优化幅度与相位输出,在保持参数量较小的前提下大幅提升了语音质量。此外,引入动态权重调节策略,结合损失历史的平滑趋势与语音质量感知评估(perceptual evaluation of speech quality,PESQ)感知反馈机制,自适应平衡多任务损失函数的重要性,有效缓解固定加权方式导致的训练收敛瓶颈。在VoiceBank+DEMAND与TIMIT数据集上的实验结果表明,所提DW-MambaUNet在PESQ、STOI、MOS等多个指标上均优于现有多种主流语音增强模型,尤其在低信噪比条件下表现出良好的增强效果与泛化能力。消融实验进一步验证了TF-Mamba模块与DWConv结构对模型性能的贡献。该研究为低复杂度、高性能的语音增强模型设计提供了新思路,具有良好的理论意义与应用价值。
Aiming to enhance the robustness and modeling efficiency of speech enhancement technology in complex noisy environments
a novel speech enhancement network
DW-MambaUNet was proposed
which integrated depthwise separable convolution and a structured state space model. Built upon a U-Net architecture
the TF-Mamba module was incorporated to model global dependencies along both temporal and frequency paths
while the depthwise separable convolution enhanced local feature extraction. Effective multi-scale feature restoration and fine-grained enhancement of speech signals were enabled by this design. During the spectrogram reconstruction process
learnable Sigmoid and Arctan2 functions were used to separately optimize magnitude and phase outputs
significantly improving speech quality while maintaining a lightweight parameter count. Additionally
a dynamic weight adjustment strategy was introduced that adaptively balanced the importance of multi-task loss functions by leveraging smoothed loss history and PESQ-aware feedback
effectively alleviating convergence bottlenecks caused by fixed-weight schemes. Experimental results on the VoiceBank+DEMAND and TIMIT datasets demonstrate that the proposed DW-MambaUNet outperforms various mainstream speech enhancement models in terms of PESQ
STOI
and MOS metrics
particularly under low signal-to-noise ratio conditions
showing strong enhancement performance and generalization ability. Ablation studies further confirm the effectiveness of the TF-Mamba module and DWConv structure in improving model performance. This study provides a novel perspective for the design of low-complexity and high-performance speech enhancement models
with both theoretical significance and practical value.
LOIZOU P C . Speech enhancement: theory and practice [M ] . Boca Raton : CRC Press , 2007 .
BEROUTI M , SCHWARTZ R , MAKHOUL J . Enhancement of speech corrupted by acoustic noise [C ] // Proceedings of the 2003 . IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ' 79 ). Piscataway : IEEE Press , 2003 : 208 - 211 .
EPHRAIM Y . Statistical-model-based speech enhancement systems [J ] . IEEE , 1992 , 80 ( 10 ): 1526 - 1555 .
ZHENG N J , ZHANG X L . Phase-aware speech enhancement based on deep neural networks [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2019 , 27 ( 1 ): 63 - 76 .
TAN X , ZHANG X L . Speech enhancement aided end-to-end multi-task learning for voice activity detection [C ] // Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2021 : 6823 - 6827 .
JIANG X L , HAN C , MESGARANI N . Dual-path mamba: short and long-term bidirectional selective structured state space models for speech separation [C ] // Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2025 : 1 - 5 .
KONG Z F , PING W , DANTREY A , et al . Speech denoising in the waveform domain with self-attention [C ] // Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 7867 - 7871 .
DAO T , GU A . Transformers are SSMs: generalized models and efficient algorithms through structured state space duality [EB ] . 2024 .
ZHAO H , ZARAR S , TASHEV I , et al . Convolutional-recurrent neural networks for speech enhancement [C ] // Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2018 : 2401 - 2405 .
LUO Y , CHEN Z , YOSHIOKA T . Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation [C ] // Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2020 : 46 - 50 .
MACARTNEY C , WEYDE T . Improved speech enhancement with the Wave-U-Net [EB ] . 2018 .
ABDULATIF S , CAO R Z , YANG B . CMGAN: conformer-based metric-GAN for monaural speech enhancement [EB ] . 2022 .
FAN C H , LIU E R , LI A D , et al . BSDB-Net: band-split dual-branch network with selective state spaces mechanism for monaural speech enhancement [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park : AAAI Press , 2025 , 39 ( 22 ): 23850 - 23858 .
PASCUAL S , BONAFONTE A , SERRÀ J . SEGAN: speech enhancement generative adversarial network [EB ] . 2017 .
KU P J , YANG C H , SINISCALCHI S , et al . A multi-dimensional deep structured state space approach to speech enhancement using small-footprint models [C ] // Proceedings of the Interspeech 2023 . Farmington Hills : Cengage Learning , 2023 : 2453 - 2457 .
ZHANG P F , LO E , LU B T . High performance depthwise and pointwise convolutions on mobile devices [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park : AAAI Press , 2020 , 34 ( 4 ): 6795 - 6802 .
TAAL C H , HENDRIKS R C , HEUSDENS R , et al . A short-time objective intelligibility measure for time-frequency weighted noisy speech [C ] // Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE Press , 2010 : 4214 - 4217 .
ZHU L H , LIAO B C , ZHANG Q , et al . Vision mamba: efficient visual representation learning with bidirectional state space model [EB ] . 2024 .
ZHANG S , ZHENG D , HU X , et al . Bidirectional long short-term memory networks for relation classification [C ] // Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (PACLIC) . Stroudsburg : ACL , 2015 : 73 - 78 .
ZHANG B , SENNRICH R . Root mean square layer normalization [C ] // Proceedings of the 33th International Conference on Neural Information Processing Systems (NeurIPS) . New York : Curran Associates , 2019 : 32 .
ZHANG Q Q , SONG Q , NI Z H , et al . Time-frequency attention for monaural speech enhancement [C ] // Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 7852 - 7856 .
LU Y X , AI Y , LING Z H . MP-SENet: a speech enhancement model with parallel denoising of magnitude and phase spectra [C ] // Proceedings of the Interspeech 2023 . Farmington Hills : Cengage Learning , 2023 : 3834 - 3838 .
HUANG G , LIU Z , VAN DER MAATEN L , et al . Densely connected convolutional networks [C ] // Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2017 : 2261 - 2269 .
DANG F , CHEN H T , ZHANG P Y . DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement [C ] // Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 6857 - 6861 .
VALENTINI-BOTINHAO C , WANG X , TAKAKI S , et al . Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech [C ] // Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9) . Farmington Hills : Cengage Learning , 2016 : 146 - 152 .
AI Y , LING Z H . Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses [C ] // Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2023 : 1 - 5 .
THIEMANN J , ITO N , VINCENT E . The diverse environments multi-channel acoustic noise database (DEMAND): a database of multichannel environmental noise recordings [J ] . Acoustics , 2013 , 19 : 035081 .
FU S W , YU C , HSIEH T A , et al . MetricGAN+: an improved version of MetricGAN for speech enhancement [C ] // Proceedings of the Interspeech 2021 . Farmington Hills : Cengage Learning , 2021 : 201 - 205 .
WANG K , HE B B , ZHU W P . TSTNN: two-stage transformer based neural network for speech enhancement in the time domain [C ] // Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2021 : 7098 - 7102 .
WANG J Y . Efficient encoder-decoder and dual-path conformer for comprehensive feature learning in speech enhancement [C ] // Proceedings of the Interspeech 2023 . Farmington Hills : Cengage Learning , 2023 : 2853 - 2857 .
LIN Z Z , CHEN X T , WANG J Y . MUSE: flexible voiceprint receptive fields and multi-path fusion enhanced Taylor transformer for U-Net-based speech enhancement [EB ] . 2024 .
0
浏览量
90
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621