基于深度可分离卷积的MambaUNet的语音增强方法

孟祥彩; 庄云朋; 钟强; 欧世峰

doi:10.11959/j.issn.1000-0801.2025272

您当前的位置：

首页 >

文章列表页 >

基于深度可分离卷积的MambaUNet的语音增强方法

研究与开发 | 更新时间：2026-01-08

- 基于深度可分离卷积的MambaUNet的语音增强方法
- A speech enhancement method based on depthwise separable convolution and MambaUNet
- 电信科学 2025年41卷第12期页码：181-190
- 作者机构：
  
  1.烟台职业学院智能控制系，山东烟台 264035
  2.烟台弘武机电科技有限公司，山东烟台 264035
  3.烟台大学物理与电子信息学院，山东烟台 264005
- 作者简介：
  
  [ "孟祥彩（1992- ），女，烟台职业学院智能控制系讲师，主要研究方向为人工智能技术。" ]
  [ "庄云朋（1994- ），男，博士，烟台职业学院智能控制系讲师，主要研究方向为新能源技术。" ]
  [ "钟强（1991- ），男，烟台弘武机电科技有限公司工程师，主要研究方向为通信、电子对抗。" ]
  [ "欧世峰（1979- ），男，博士，烟台大学物理与电子信息学院教授，主要研究方向为语音信号处理。" ]
- 基金信息：
  
  山东省自然科学基金青年项目(ZR2024QB388);烟台职业学院校本科研项目(2024XBYB005)
- DOI：10.11959/j.issn.1000-0801.2025272
  中图分类号： TN912.3
- 收稿：2025-05-30，
  
  修回：2025-10-11，
  
  录用：2025-10-11，
  
  纸质出版：2025-12-20
- 稿件说明：
移动端阅览
孟祥彩,庄云朋,钟强等.基于深度可分离卷积的MambaUNet的语音增强方法[J].电信科学,2025,41(12):181-190.

MENG Xiangcai,ZHUANG Yunpeng,ZHONG Qiang,et al.A speech enhancement method based on depthwise separable convolution and MambaUNet[J].Telecommunications Science,2025,41(12):181-190.
孟祥彩,庄云朋,钟强等.基于深度可分离卷积的MambaUNet的语音增强方法[J].电信科学,2025,41(12):181-190. DOI： 10.11959/j.issn.1000-0801.2025272.

MENG Xiangcai,ZHUANG Yunpeng,ZHONG Qiang,et al.A speech enhancement method based on depthwise separable convolution and MambaUNet[J].Telecommunications Science,2025,41(12):181-190. DOI： 10.11959/j.issn.1000-0801.2025272.

摘要

致力于提升语音增强技术在复杂噪声环境中的鲁棒性与建模效率，提出了一种基于深度可分离卷积与结构化状态空间模型融合的语音增强网络（DW-MambaUNet）。该网络以U-Net结构为基础，引入TF-Mamba模块在时序与频率双路径上建模全局依赖，同时结合深度可分离卷积增强局部特征提取能力，从而实现对语音信号的多尺度特征恢复与精细增强。模型在频谱重建过程中通过可学习Sigmoid与Arctan2函数分别优化幅度与相位输出，在保持参数量较小的前提下大幅提升了语音质量。此外，引入动态权重调节策略，结合损失历史的平滑趋势与语音质量感知评估（perceptual evaluation of speech quality，PESQ）感知反馈机制，自适应平衡多任务损失函数的重要性，有效缓解固定加权方式导致的训练收敛瓶颈。在VoiceBank+DEMAND与TIMIT数据集上的实验结果表明，所提DW-MambaUNet在PESQ、STOI、MOS等多个指标上均优于现有多种主流语音增强模型，尤其在低信噪比条件下表现出良好的增强效果与泛化能力。消融实验进一步验证了TF-Mamba模块与DWConv结构对模型性能的贡献。该研究为低复杂度、高性能的语音增强模型设计提供了新思路，具有良好的理论意义与应用价值。

Abstract

Aiming to enhance the robustness and modeling efficiency of speech enhancement technology in complex noisy environments

a novel speech enhancement network

DW-MambaUNet was proposed

which integrated depthwise separable convolution and a structured state space model. Built upon a U-Net architecture

the TF-Mamba module was incorporated to model global dependencies along both temporal and frequency paths

while the depthwise separable convolution enhanced local feature extraction. Effective multi-scale feature restoration and fine-grained enhancement of speech signals were enabled by this design. During the spectrogram reconstruction process

learnable Sigmoid and Arctan2 functions were used to separately optimize magnitude and phase outputs

significantly improving speech quality while maintaining a lightweight parameter count. Additionally

a dynamic weight adjustment strategy was introduced that adaptively balanced the importance of multi-task loss functions by leveraging smoothed loss history and PESQ-aware feedback

effectively alleviating convergence bottlenecks caused by fixed-weight schemes. Experimental results on the VoiceBank+DEMAND and TIMIT datasets demonstrate that the proposed DW-MambaUNet outperforms various mainstream speech enhancement models in terms of PESQ

STOI

and MOS metrics

particularly under low signal-to-noise ratio conditions

showing strong enhancement performance and generalization ability. Ablation studies further confirm the effectiveness of the TF-Mamba module and DWConv structure in improving model performance. This study provides a novel perspective for the design of low-complexity and high-performance speech enhancement models

with both theoretical significance and practical value.

关键词

Keywords

references

LOIZOU P C . Speech enhancement: theory and practice [M ] . Boca Raton : CRC Press , 2007 .

BEROUTI M , SCHWARTZ R , MAKHOUL J . Enhancement of speech corrupted by acoustic noise [C ] // Proceedings of the 2003 . IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ' 79 ). Piscataway : IEEE Press , 2003 : 208 - 211 .

EPHRAIM Y . Statistical-model-based speech enhancement systems [J ] . IEEE , 1992 , 80 ( 10 ): 1526 - 1555 .

ZHENG N J , ZHANG X L . Phase-aware speech enhancement based on deep neural networks [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2019 , 27 ( 1 ): 63 - 76 .

TAN X , ZHANG X L . Speech enhancement aided end-to-end multi-task learning for voice activity detection [C ] // Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2021 : 6823 - 6827 .

JIANG X L , HAN C , MESGARANI N . Dual-path mamba: short and long-term bidirectional selective structured state space models for speech separation [C ] // Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2025 : 1 - 5 .

KONG Z F , PING W , DANTREY A , et al . Speech denoising in the waveform domain with self-attention [C ] // Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 7867 - 7871 .

DAO T , GU A . Transformers are SSMs: generalized models and efficient algorithms through structured state space duality [EB ] . 2024 .

ZHAO H , ZARAR S , TASHEV I , et al . Convolutional-recurrent neural networks for speech enhancement [C ] // Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2018 : 2401 - 2405 .

LUO Y , CHEN Z , YOSHIOKA T . Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation [C ] // Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2020 : 46 - 50 .

MACARTNEY C , WEYDE T . Improved speech enhancement with the Wave-U-Net [EB ] . 2018 .

ABDULATIF S , CAO R Z , YANG B . CMGAN: conformer-based metric-GAN for monaural speech enhancement [EB ] . 2022 .

FAN C H , LIU E R , LI A D , et al . BSDB-Net: band-split dual-branch network with selective state spaces mechanism for monaural speech enhancement [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park : AAAI Press , 2025 , 39 ( 22 ): 23850 - 23858 .

PASCUAL S , BONAFONTE A , SERRÀ J . SEGAN: speech enhancement generative adversarial network [EB ] . 2017 .

KU P J , YANG C H , SINISCALCHI S , et al . A multi-dimensional deep structured state space approach to speech enhancement using small-footprint models [C ] // Proceedings of the Interspeech 2023 . Farmington Hills : Cengage Learning , 2023 : 2453 - 2457 .

ZHANG P F , LO E , LU B T . High performance depthwise and pointwise convolutions on mobile devices [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park : AAAI Press , 2020 , 34 ( 4 ): 6795 - 6802 .

TAAL C H , HENDRIKS R C , HEUSDENS R , et al . A short-time objective intelligibility measure for time-frequency weighted noisy speech [C ] // Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE Press , 2010 : 4214 - 4217 .

ZHU L H , LIAO B C , ZHANG Q , et al . Vision mamba: efficient visual representation learning with bidirectional state space model [EB ] . 2024 .

ZHANG S , ZHENG D , HU X , et al . Bidirectional long short-term memory networks for relation classification [C ] // Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (PACLIC) . Stroudsburg : ACL , 2015 : 73 - 78 .

ZHANG B , SENNRICH R . Root mean square layer normalization [C ] // Proceedings of the 33th International Conference on Neural Information Processing Systems (NeurIPS) . New York : Curran Associates , 2019 : 32 .

ZHANG Q Q , SONG Q , NI Z H , et al . Time-frequency attention for monaural speech enhancement [C ] // Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 7852 - 7856 .

LU Y X , AI Y , LING Z H . MP-SENet: a speech enhancement model with parallel denoising of magnitude and phase spectra [C ] // Proceedings of the Interspeech 2023 . Farmington Hills : Cengage Learning , 2023 : 3834 - 3838 .

HUANG G , LIU Z , VAN DER MAATEN L , et al . Densely connected convolutional networks [C ] // Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2017 : 2261 - 2269 .

DANG F , CHEN H T , ZHANG P Y . DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement [C ] // Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 6857 - 6861 .

VALENTINI-BOTINHAO C , WANG X , TAKAKI S , et al . Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech [C ] // Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9) . Farmington Hills : Cengage Learning , 2016 : 146 - 152 .

AI Y , LING Z H . Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses [C ] // Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2023 : 1 - 5 .

THIEMANN J , ITO N , VINCENT E . The diverse environments multi-channel acoustic noise database (DEMAND): a database of multichannel environmental noise recordings [J ] . Acoustics , 2013 , 19 : 035081 .

FU S W , YU C , HSIEH T A , et al . MetricGAN+: an improved version of MetricGAN for speech enhancement [C ] // Proceedings of the Interspeech 2021 . Farmington Hills : Cengage Learning , 2021 : 201 - 205 .

WANG K , HE B B , ZHU W P . TSTNN: two-stage transformer based neural network for speech enhancement in the time domain [C ] // Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2021 : 7098 - 7102 .

WANG J Y . Efficient encoder-decoder and dual-path conformer for comprehensive feature learning in speech enhancement [C ] // Proceedings of the Interspeech 2023 . Farmington Hills : Cengage Learning , 2023 : 2853 - 2857 .

LIN Z Z , CHEN X T , WANG J Y . MUSE: flexible voiceprint receptive fields and multi-path fusion enhanced Taylor transformer for U-Net-based speech enhancement [EB ] . 2024 .

浏览量

104

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于最大熵二值化时频图和DL-YOLOv5s的跳周期估计和跳频频率估计

基于Patch域对抗训练的语音增强

基于多尺度特征融合的轻量化人脸图像修复算法