基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别

王一鸣; 陈恳; 萨阿卜杜萨拉木·艾海提拉木

doi:10.11959/j.issn.1000-0801.2019290

您当前的位置：

首页 >

文章列表页 >

基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别

研究与开发 | 更新时间：2024-06-05

- 基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别
- End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM
- 电信科学 2019年35卷第12期页码：79-89
- 作者机构：
- 作者简介：
  
  [ "王一鸣（1993- ），男，宁波大学信息科学与工程学院硕士生，主要研究方向为音视频信息处理、视听觉语音识别" ]
  [ "陈恳（1962- ），男，宁波大学信息科学与工程学院副教授、硕士生导师，在核心期刊和重要国际会议发表论文共100余篇，参与和主持国家级、省部级、市厅级和校级科研项目共16项；获得相关科研相关奖项3项。主要研究方向为图像及视频信息处理、多媒体通信、智能控制" ]
  [ "阿卜杜萨拉木·艾海提（1995- ），男，宁波大学信息科学与工程学院硕士生，主要研究方向为机器翻译、智能语音翻译" ]
- 基金信息：
  
  国家自然科学基金资助项目;The National Natural Science Foundation of China(60972063);宁波市自然科学基金资助项目;The Natural Science Foundation of Ningbo of China(2014A610065);宁波大学科研基金（理）/学科资助项目;Scientific Research Foundation of Ningbo University(XKXL1308)
- DOI：10.11959/j.issn.1000-0801.2019290
  中图分类号： TP391
- 网络出版日期：2019-12，
  
  纸质出版日期：2019-12-20
- 稿件说明：
移动端阅览
王一鸣, 陈恳, 萨阿卜杜萨拉木·艾海提拉木. 基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别[J]. 电信科学, 2019,35(12):79-89.

Yiming WANG, Ken CHEN, Aihaiti ABUDUSALAMU. End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM[J]. Telecommunications science, 2019, 35(12): 79-89.
王一鸣, 陈恳, 萨阿卜杜萨拉木·艾海提拉木. 基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别[J]. 电信科学, 2019,35(12):79-89. DOI： 10.11959/j.issn.1000-0801.2019290.

Yiming WANG, Ken CHEN, Aihaiti ABUDUSALAMU. End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM[J]. Telecommunications science, 2019, 35(12): 79-89. DOI： 10.11959/j.issn.1000-0801.2019290.

摘要

提出一种端到端的视听语音识别算法。在该算法中，通过具有瓶颈结构的深度信念网络（deep belief network，DBN）中引入混合的l1/2范数和l1范数构建一种稀疏DBN（sparse DBN，SDBN）来提取稀疏瓶颈特征，从而实现对数据的特征降维，然后用双向长短期记忆网络（bidirectional long short-term memory，BLSTM）在时序上对特征进行模态处理，之后利用一种注意力机制将经过模态处理的唇部视觉信息和音频听觉信息进行自动对齐、融合，最后将融合的视听觉信息通过一个附加了Softmax层的BLSTM进行分类识别。实验表明，该算法能有效地识别视听觉信息，在同类算法中有很好的识别率和顽健性。

Abstract

An end-to-end audiovisual speech recognition algorithm was proposed.In algorithm

a sparse DBN was constructed by introducing mixed l1/2norm and l1norm into Deep Belief Network with bottleneck structure to extract the sparse bottleneck features

so as to reduce the dimension of data features

and then a BLSTM was used to model the feature in time series.Then

a attention mechanism was used to align and fuse the lip visual information and audio auditory information automatically.Finally

the fused audiovisual information was classified and identified by a BLSTM with a Softmax layer attached.Experiments show that the algorithm can effectively identify visual and auditory information

and has good recognition rate and robustness in similar algorithms.

关键词

Keywords

references

王海坤 , 潘嘉 , 刘聪 . 语音识别技术的研究进展与展望 [J ] . 电信科学 , 2018 , 34 ( 2 ): 1 - 11 .

WANG H K , PAN J , LIU C . Research development and forecast of automatic speech recognition technologies [J ] . Telecommunications Science , 2018 , 34 ( 2 ): 1 - 11 .

CHIU C C , SAINATH T N , WU Y , et al . State-of-the-art speech recognition with sequence-to-sequence models [C ] // ICASSP 2018-2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 .

王志宏 , 杨震 . 人工智能技术研究及未来智能化信息服务体系的思考 [J ] . 电信科学 , 2017 , 33 ( 5 ): 1 - 11 .

WANG Z H , YANG Z . Research on artificial intelligence technology and the future intelligent information service architecture [J ] . Telecommunications Science , 2017 , 33 ( 5 ): 1 - 11 .

NODA K , YAMAGUCHI Y , NAKADAI K , et al . Audio-visual speech recognition using deep learning [J ] . Applied Intelligence , 2015 , 42 ( 4 ): 722 - 737 .

HU D , LI X , LU X . Temporal multimodal learning in audiovisual speech recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),April 19-24,2016,Brisbane,QLD,Australia . Piscataway:IEEE Press , 2016 : 3574 - 3582 .

TAKASHIMA Y , AIHARA R , TAKIGUCHI T , et al . Audio-visual speech recognition using bimodal-trained bottleneck features for a person with severe hearing loss [J ] . 2016 :

CHUNG J S , SENIOR A , VINYALS O , et al . Lip reading sentences in the wild [J ] . arXiv:1611.05358 , 2016 .

PETRIDIS S , STAFYLAKIS T , MA P , et al . End-to-end audiovisual speech recognition [C ] // 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 : 6548 - 6552 .

WAND M , SCHMIDHUBER J , VU N T . Investigations on End-to-End Audiovisual Fusion [C ] // 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 : 3041 - 3045 .

TAO F , BUSSO C . Gating neural network for large vocabulary audiovisual speech recognition [J ] . IEEE/ACM Transactions on Audio,Speech and Language Processing (TASLP) , 2018 , 26 ( 7 ): 1286 - 1298 .

HINTON G E , SALAKHUTDINOV R R . Reducing the dimensionality of data with neural networks [J ] . Science , 2006 , 313 ( 578 ): 504 - 507 .

GREZL F , KARAFIAT M , KONTAR S , et al . Probabilistic and bottle-neck features for LVCSR of meetings [C ] // IEEE International Conference on Acoustics,April 15-20,2007,Honolulu,HI,USA . Piscataway:IEEE Press , 2007 .

JI N N , ZHANG J S , ZHANG C X . A sparse-response deep belief network based on rate distortion theory [J ] . Pattern Recognition , 2014 , 47 ( 9 ): 3179 - 3191 .

LEE H , EKANADHAM C , NG A Y . Sparse deep belief net model for visual area V2 [C ] // Advances in Neural Information Processing Systems,Dec 3-6,2007,Vancouver,British Columbia,Canada.[S.l:s.n . ] , 2007 : 873 - 880 .

KIAEE F , CHRISTIAN G , ABBASI M . Alternating direction method of multipliers for sparse convolutional neural networks [J ] . 2016 :

KHALID F , FANANY M I . Combining normal sparse into discriminative deep belief networks [C ] // International Conference on Advanced Computer Science ＆ Information Systems,Oct 15-16,2016,Malang,Indonesia . Piscataway:IEEE Press , 2016 .

GRAVES A , SCHMIDHUBER J . Framewise phoneme classification with bidirectional LSTM and other neural network architectures [J ] . Neural Networks , 2005 , 18 ( 5-6 ): 602 - 610 .

LUONG M T , PHAM H , MANNING C D . Effective approaches to attention-based neural machine translation [J ] .,2015. arXiv preprint arXiv:1508.04025 , 2015 .

ANINA I , ZHOU Z , ZHAO G , et al . OuluVS2:a multi-view audiovisual database for non-rigid mouth motion analysis [C ] // IEEE International Conference ＆ Workshops on Automatic Face ＆ Gesture Recognition,May 4-8,2015,Ljubljana,Slovenia . Piscataway:IEEE Press , 2015 .

HINTON G E . A practical guide to training restricted Boltzmann machines [Z ] .2012. 2012 .

GLOROT X , BENGIO Y . Understanding the difficulty of training deep feedforward neural networks [C ] // The Thirteenth International Conference on Artificial Intelligence and Statistics,May 13-15,2010,Chia Laguna Resort,Sardinia,Italy.[S.l.:s.n] . 2010 : 249 - 256 .

PETRIDIS S , WANG Y , LI Z , et al . End-to-end audiovisual fusion with LSTM [J ] . arXiv preprint arXiv:1709.04343 , 2017 .

KOUMPAROULIS A , POTAMIANOS G . Deep View2View mapping for view-invariant lipreading [C ] // 2018 IEEE Spoken Language Technology Workshop (SLT),Dec 18-21,2018,Athens,Greece . Piscataway:IEEE Press , 2018 : 588 - 594 .

SAITOH T , ZHOU Z , ZHAO G , et al . Concatenated frame image based CNN for visual speech recognition [C ] // Asian Conference on Computer Vision,Nov 20-24,2016,Taipei,China . Berlin:Springer , 2016 : 277 - 289 .

FUNG I , MAK B . End-to-end low-resource lip-reading with maxout CNN and LSTM [C ] // 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 : 2511 - 2515 .

LEE D , LEE J , KIM K E . Multi-view automatic lip-reading using neural network [C ] // Asian Conference on Computer Vision,Nov 20-24,2016,Taipei,China . Berlin:Springer , 2016 : 290 - 302 .

PETRIDIS S , WANG Y , LI Z , et al . End-to-end multi-view lipreading [J ] . arXiv preprint arXiv:1709.00443 , 2017 .

浏览量

275

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

GMTBLC：基于深度学习的双模态网络流量分类

基于改进长短期记忆网络的新能源场站网络安全评估方法研究

基于霍克斯过程的动态异质网络表征学习方法

信号增强网络驱动的调制识别

基于特征校正的多对抗域适应方法