一种基于DN-ResNet11的语音情感识别算法

应娜; 邹雨鉴; 杨雪滢; 孙文胜; 叶学义; 蒋银河

doi:10.11959/j.issn.1000-0801.2025042

您当前的位置：

首页 >

文章列表页 >

一种基于DN-ResNet11的语音情感识别算法

研究与开发 | 更新时间：2025-06-28

- 一种基于DN-ResNet11的语音情感识别算法
- A speech emotion recognition algorithm based on DN-ResNet11
- 电信科学 2025年41卷第6期页码：139-153
- 作者机构：
  
  杭州电子科技大学通信工程学院，浙江杭州 310018
- 作者简介：
  
  [ "应娜（1978- ），女，博士，杭州电子科技大学通信工程学院副教授、硕士生导师，主要研究方向为智能信号处理与应用。" ]
  [ "邹雨鉴（1998- ），男，杭州电子科技大学通信工程学院硕士生，主要研究方向为多模态情感识别。" ]
  [ "杨雪滢（1997- ），女，杭州电子科技大学通信工程学院硕士生，主要研究方向为语音情感识别。" ]
  [ "孙文胜（1966- ），男，现就职于杭州电子科技大学通信工程学院，主要研究方向为网络通信。" ]
  [ "叶学义（1973- ），男，博士，杭州电子科技大学通信工程学院副教授、硕士生导师，主要研究方向为图像处理、模式识别、信息隐藏。" ]
  [ "蒋银河（1999- ），男，杭州电子科技大学通信工程学院硕士生，主要研究方向为多模态情感识别。" ]
- 基金信息：
  
  浙江省科技计划项目(LGF21F010003);浙江省“领雁”研发攻关计划项目;The “Leading Goose” Technologies Research and Development Program of Zhejiang Province(2022C03065)
- DOI：10.11959/j.issn.1000-0801.2025042
  中图分类号： TP18
- 收稿日期：2024-09-27，
  
  修回日期：2024-12-05，
  
  纸质出版日期：2025-06-20
- 稿件说明：
移动端阅览
应娜,邹雨鉴,杨雪滢等.一种基于DN-ResNet11的语音情感识别算法[J].电信科学,2025,41(06):139-153.

YING Na,ZOU Yujian,YANG Xueying,et al.A speech emotion recognition algorithm based on DN-ResNet11[J].Telecommunications Science,2025,41(06):139-153.
应娜,邹雨鉴,杨雪滢等.一种基于DN-ResNet11的语音情感识别算法[J].电信科学,2025,41(06):139-153. DOI： 10.11959/j.issn.1000-0801.2025042.

YING Na,ZOU Yujian,YANG Xueying,et al.A speech emotion recognition algorithm based on DN-ResNet11[J].Telecommunications Science,2025,41(06):139-153. DOI： 10.11959/j.issn.1000-0801.2025042.

摘要

为解决网络训练复杂度高的问题并改进语音情感特征提取，提出了基于双嵌套残差网络（DN-ResNet11）与通道注意残差网络（CRNet）的双支路特征提取模型。首先，设计了低复杂度的DN-ResNet11以高效提取语谱图的融合情感特征，提升情感识别率；然后，结合多尺度引导滤波和局部二值模式（local binary pattern，LBP）算法对语谱图进行细节增强；最后，融合两组特征进行情感分类，形成双支路加权融合模型（weighted fusion model based on dual nested residual and channel residual network，WFDN_CRNet），进一步提升情感表征能力。在CASIA、EMO-DB、IEMOCAP等语音情感数据集上情感识别率分别达到94.58%、85.59%、65.72%，所提方法在情感识别率优于ResNet18等基准方法的同时，显著降低了计算成本，验证了模型的有效性。

Abstract

To address the high complexity of network training and improve speech emotion feature extraction

a dual-branch feature extraction model based on DN-ResNet11 and a channel attention residual network (CRNet) was proposed. Firstly

the low-complexity DN-ResNet11 was designed to efficiently extract fused emotional features from spectrograms

enhancing emotion recognition accuracy. Secondly

multiscale guided filtering and the local binary pattern (LBP) algorithm were incorporated to enhance spectrogram details. Finally

the two sets of features were fused for emotion classification

forming a dual-branch weighted fusion model (weighted fusion model based on dual nested residual and channel residual network

WFDN_CRNet)

further enhancing emotional representation ability. Experiments on the CASIA

EMO-DB

and IEMOCAP speech emotion datasets show emotion recognition rates of 94.58%

85.59%

and 65.72%

respectively. The proposed method not only achieves superior emotion recognition rates compared to baseline models such as ResNet18

but also significantly reduces computational cost

demonstrating the model’s effectiveness.

关键词

Keywords

references

NWE T L , FOO S W , DE S L C . Speech emotion recognition using hidden Markov models [J ] . Speech Communication , 2003 , 41 ( 4 ): 603 - 623 .

WU S , FALK T H , CHAN W Y . Automatic speech emotion recognition using modulation spectral features [J ] . Speech Communication , 2011 , 53 ( 5 ): 768 - 785 .

SCHULLER B , RIGOLL G , LANG M . Hidden Markov model-based speech emotion recognition [C ] // Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing . Piscataway : IEEE Press , 2003 , 2: II-1.

SUN L , ZOU B , FU S , et al . Speech emotion recognition based on DNN-decision tree SVM model [J ] . Speech Communication , 2019 , 115 : 29 - 37 .

YENIGALLA P , KUMAR A , TRIPATHI S , et al . Speech emotion recognition using spectrogram & phoneme embedding [C ] // Proceedings of the 2018 IEEE International Conference on Interspeech . Piscataway : IEEE Press , 2018 : 3688 - 3692 .

LI Z , LI J , MA S , et al . Speech emotion recognition based on residual neural network with different classifiers [C ] // Proceedings of the 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS) . Piscataway : IEEE Press , 2019 : 186 - 190 .

WANG J , XUE M , CULHANE R , et al . Speech emotion recognition with dual-sequence LSTM architecture [C ] // Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2020 : 6474 - 6478 .

ZHANG W , JIA Y . A study on speech emotion recognition model based on Mel-spectrogram and CapsNet [C ] // Proceedings of the 2021 3rd International Academic Exchange Conference on Science and Technology Innovation (IAECST) . Piscataway : IEEE Press , 2021 : 231 - 235 .

金俊林 , 于玲 , 周骁群 . 基于图卷积神经网络的语音情感识别 [J ] . 信息技术与信息化 , 2022 ( 8 ): 202 - 205 .

JIN J L , YU L , ZHOU X Q . Speech emotion recognition based on graph convolutional nerual network [J ] . Information Technology and Informatization , 2022 ( 8 ): 202 - 205 .

MRUNAL P G ， ABHISHEK V . Automatic recognition of emotions in speech with large self-supervised learning transformer models [C ] // Proceedings of the IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings) . Piscataway : IEEE Press , 2023 .

CHEN L W , RUDNICKY A . Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition [C ] // Proceedings of the 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2023 : 1 - 5 .

ZI H Z , YAN F W , YU W . Multi-level fusion of wav2vec 2.0 and BERT for multimodal emotion recognition [C ] // Proceedings of the 2022 IEEE International Conference on Interspeech . Piscataway : IEEE Press , 2022 : 725 - 4729 .

HAN S , POOL J , TRAN J , et al . Learning both weights and connections for efficient neural networks [J ] . In Advances in Neural Information Processing Systems (NeurIPS) , 2015 : 1135 - 1143 .

COURBARIAUX M , BENGIO Y , DAVID J P . BinaryConnect: training deep neural networks with binary weights during propagations [J ] . In Advances in Neural Information Processing Systems (NeurIPS) , 2015 : 3123 - 3131 .

CHOLLET F . Xception: deep learning with depthwise separable convolutions [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2017 : 1251 - 1258 .

HE K , ZHANG X , REN S , et al . Deep residual learning for image recognition [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2016 : 770 - 778 .

GREFF K , SRIVASTAVA R K , SCHMIDHUBER J . Highway and residual networks learn unrolled iterative estimation [J ] . arXiv preprint arXiv: 1612.07771 , 2016 .

ORHAN A E , PITKOW X . Skip connections eliminate singularities [J ] . arXiv preprint arXiv: 1701.09175 , 2017 .

XU J , LI Z , DU B , et al . Reluplex made more practical: leaky ReLU [C ] // Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC) . Piscataway : IEEE Press , 2020 : 1 - 7 .

HE K , SUN J , TANG X . Guided image filtering [J ] . IEEE transactions on pattern analysis and machine intelligence , 2012 , 35 ( 6 ): 1397 - 1409 .

WU H , ZHENG S , ZHANG J , et al . Fast end-to-end trainable guided filter [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 1838 - 1847 .

PRAKASA E . Texture feature extraction by using local binary pattern [J ] . INKOM Journal , 2016 , 9 ( 2 ): 45 - 48 .

WANG Q , WU B , ZHU P , et al . ECA-Net: efficient channel attention for deep convolutional neural networks [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2020 : 11534 - 11542 .

JIA X Y , XIN C W , YU J W , et al . Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition [C ] // Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2023 .

VLAD S C , COSMIN S C , ADRIANA S . TBDM-Net: bidirectional dense networks with gender information for speech emotion recognition [C ] // Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing . Piscataway : IEEE Press , 2024 .

ZHENG L , XIN K , FU J R . Dual-TBNet: improving the robustness of speech features via dualtransformer-BiLSTM for speech emotion recognition [C ] // Proceedings of the IEEE/ACM Transactions on Audio, Speech, and Language Processing . Piscataway : IEEE Press , 2023 : 2193 – 2203 .

ZHU R F , SUN C X , WEI X P , et al . Speech emotion recognition using channel attention mechanism [C ] // Proceedings of the IEEE International Conference on Computer Engineering and Application (ICCEA) . Piscataway : IEEE Press , 2023 : 680 - 684 .

YONG W , CHENG L , YUAN Z , et al . Time-frequency transformer: a novel time frequency joint learning method for speech emotion recognition [C ] // Proceedings of the International Conference on Neural Information Processing (ICONIP) . Changsha : Central South University Press , 2023 .

LIAO Z , SHEN S . Speech emotion recognition based on swin-transformer [J ] . Journal of Physics: Conference Series , 2023 : 2508 ( 1 ), 012056 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于语谱图提取深度空间注意特征的语音情感识别算法

基于Contourlet域分块压缩感知的图像融合