浏览全部资源
扫码关注微信
宁波大学信息科学与工程学院,浙江 宁波 315211
[ "王鸿韬(2000- ),男,宁波大学信息科学与工程学院硕士生,主要研究方向为语音增强。" ]
[ "陆志华(1983- ),男,宁波大学信息科学与工程学院副教授、硕士生导师,主要研究方向为信号处理、多运动目标的实时跟踪等。" ]
[ "叶庆卫(1970- ),男,宁波大学信息科学与工程学院教授、硕士生导师,主要研究方向为信号检测、最优化搜索等。" ]
[ "章联军(1980- ),男,宁波大学信息科学与工程学院实验师,主要研究方向为网络技术实验通信与管理。" ]
收稿日期:2024-07-19,
修回日期:2024-09-09,
纸质出版日期:2024-10-20
移动端阅览
王鸿韬,陆志华,叶庆卫等.基于Patch域对抗训练的语音增强[J].电信科学,2024,40(10):52-60.
WANG Hongtao,LU Zhihua,YE Qingwei,et al.Patch-based domain adversarial training for speech enhancement[J].Telecommunications Science,2024,40(10):52-60.
王鸿韬,陆志华,叶庆卫等.基于Patch域对抗训练的语音增强[J].电信科学,2024,40(10):52-60. DOI: 10.11959/j.issn.1000-0801.2024225.
WANG Hongtao,LU Zhihua,YE Qingwei,et al.Patch-based domain adversarial training for speech enhancement[J].Telecommunications Science,2024,40(10):52-60. DOI: 10.11959/j.issn.1000-0801.2024225.
在基于深度学习的语音增强方法中,往往会遇到训练数据和测试数据分布不匹配的问题,这种不匹配包括两个数据中说话人、说话内容、噪声类型及信噪比的不匹配。严重的数据不匹配问题会导致语音增强的性能大幅下降,针对这种情况提出了一种基于Patch域对抗训练的语音增强方法。该方法在先前域对抗训练的语音增强方法基础上,通过域判别器的隐式建模,能使整段语音被划分为多个独立Patch再进行判别,实现了对训练数据的适应性学习,从而减小训练数据和测试数据之间的分布差异,提高了模型在测试数据上的增强能力。实验结果表明,该方法在不同程度的数据不匹配问题下较先前方法都表现出优异的性能,且作为对抗训练也保持了良好的稳定性。
In deep learning-based speech enhancement methods
mismatched distributions between training data and test data are often encountered. These mismatches can include differences in speakers
speech content
noise types
and signal-to-noise ratios between the datasets. Severe data mismatches can significantly degrade the performance of speech enhancement. To address this issue
a speech enhancement method based on Patch domain adversarial training was proposed. Building on previous domain adversarial training methods for speech enhancement
implicit modeling of a domain discriminator was employed
allowing the entire speech signal to be divided into multiple independent patches for discrimination. Adaptive learning of the training data was enabled
thereby reducing distribution differences between the training and test data and improving the model’s enhancement capabilities on test data. Experimental results show that this method exhibits superior performance compared to previous methods under various degrees of data mismatch and maintains good stability as an adversarial training approach.
BOLL S . Suppression of acoustic noise in speech using spectral subtraction [J ] . IEEE Transactions on Acoustics, Speech, and Signal Processing , 1979 , 27 ( 2 ): 113 - 120 .
CHEN J D , BENESTY J , HUANG Y T , et al . New insights into the noise reduction Wiener filter [J ] . IEEE Transactions on Audio, Speech, and Language Processing , 2006 , 14 ( 4 ): 1218 - 1234 .
WILSON K W , RAJ B , SMARAGDIS P , et al . Speech denoising using nonnegative matrix factorization with priors [C ] // Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE Press , 2008 : 4029 - 4032 .
OCHIENG P . Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis [J ] . Artificial Intelligence Review , 2023 , 56 ( 3 ): 3651 - 3703 .
WANG M , DENG W H . Deep visual domain adaptation: a survey [J ] . Neurocomputing , 2018 , 312 : 135 - 153 .
GANIN Y , USTINOVA E , AJAKAN H , et al . Domain-adversarial training of neural networks [J ] . Journal of machine learning research , 2016 , 17 ( 59 ): 1 - 35 .
HOU N N , XU C L , CHNG E S , et al . Domain adversarial training for speech enhancement [C ] // Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) . Piscataway : IEEE Press , 2019 : 667 - 672 .
LIAO C F , TSAO Y , LEE H Y , et al . Noise adaptive speech enhancement using domain adversarial training [J ] . arXiv preprint arXiv: 1807. 07501 , 2018 .
LAM M W Y , WANG J , SU D , et al . Effective low-cost time-domain audio separation using globally attentive locally recurrent networks [C ] // Proceedings of the 2021 IEEE Spoken Language Technology Workshop(SLT) . Piscataway : IEEE Press , 2021 : 801 - 808 .
SUBAKAN C , RAVANELLI M , CORNELL S , et al . Exploring self-attention mechanisms for speech separation [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2023 ( 31 ): 2169 - 2180 .
GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al . Generative adversarial nets [J ] . Advances in Neural Information Processing Systems , 2014 , 27 .
LARSEN A B L , SØNDERBY S K , LAROCHELLE H , et al . Autoencoding beyond pixels using a learned similarity metric [EB ] . 2015 : 1512 .09300.
ABDULATIF S , CAO R Z , YANG B . CMGAN: conformer-based metric-GAN for monaural speech enhancement [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2024 ( 32 ): 2477 - 2493 .
ZADOROZHNYY V , YE Q , KOISHIDA K . SCP-GAN: self-correcting discriminator optimization for training consistency preserving metric GAN on speech enhancement tasks [EB ] . 2022 : 2210 .14474.
ISOLA P , ZHU J Y , ZHOU T H , et al . Image-to-image translation with conditional adversarial networks [C ] // Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR) . Piscataway : IEEE Press , 2017 : 5967 - 5976 .
YU G C , LI A D , WANG H , et al . DBT-net: dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2022 ( 30 ): 2629 - 2644 .
LU Y X , AI Y , LING Z H . MP-SENET: a speech enhancement model with parallel denoising of magnitude and phase spectra [J ] . arXiv preprint arXiv: 2305.13686 , 2023 .
VEAUX C , YAMAGISHI J , KING S . The voice bank corpus: Design, collection and data analysis of a large regional accent speech database [C ] // Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation(O-COCOSDA/CASLRE) . Piscataway : IEEE Press , 2013 : 1 - 4 .
THIEMANN J , ITO N , VINCENT E . The diverse environments multi-channel acoustic noise database(DEMAND): a database of multichannel environmental noise recordings [C ] // Proceedings of the Meetings on Acoustics . ASA, AIP Publishing , 2013 : 19(1) .
WENINGER F , ERDOGAN H , WATANABE S , et al . Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR [M ] . Cham : Springer International Publishing , 2015 : 91 - 99 .
Rec . ITU-T. Perceptual evaluation of speech quality(PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs [J ] . Rec. ITU-T P . 862 , 2001 .
TAAL C H , HENDRIKS R C , HEUSDENS R , et al . An algorithm for intelligibility prediction of time–frequency weighted noisy speech [J ] . IEEE Transactions on Audio, Speech, and Language Processing , 2011 , 19 ( 7 ): 2125 - 2136 .
HU Y , LOIZOU P C . Evaluation of objective quality measures for speech enhancement [C ] // Proceedings of the IEEE Transactions on Audio, Speech, and Language Processing . Piscataway : IEEE Press , 2008 : 229 - 238 .
0
浏览量
5
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构