浏览全部资源
扫码关注微信
[ "王一鸣(1993- ),男,宁波大学信息科学与工程学院硕士生,主要研究方向为音视频信息处理、视听觉语音识别" ]
[ "陈恳(1962- ),男,宁波大学信息科学与工程学院副教授、硕士生导师,在核心期刊和重要国际会议发表论文共100余篇,参与和主持国家级、省部级、市厅级和校级科研项目共16项;获得相关科研相关奖项3项。主要研究方向为图像及视频信息处理、多媒体通信、智能控制" ]
[ "阿卜杜萨拉木·艾海提(1995- ),男,宁波大学信息科学与工程学院硕士生,主要研究方向为机器翻译、智能语音翻译" ]
网络出版日期:2019-12,
纸质出版日期:2019-12-20
移动端阅览
王一鸣, 陈恳, 萨阿卜杜萨拉木·艾海提拉木. 基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别[J]. 电信科学, 2019,35(12):79-89.
Yiming WANG, Ken CHEN, Aihaiti ABUDUSALAMU. End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM[J]. Telecommunications science, 2019, 35(12): 79-89.
王一鸣, 陈恳, 萨阿卜杜萨拉木·艾海提拉木. 基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别[J]. 电信科学, 2019,35(12):79-89. DOI: 10.11959/j.issn.1000-0801.2019290.
Yiming WANG, Ken CHEN, Aihaiti ABUDUSALAMU. End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM[J]. Telecommunications science, 2019, 35(12): 79-89. DOI: 10.11959/j.issn.1000-0801.2019290.
提出一种端到端的视听语音识别算法。在该算法中,通过具有瓶颈结构的深度信念网络(deep belief network,DBN)中引入混合的l<sub>1/2</sub>范数和l<sub>1</sub>范数构建一种稀疏DBN(sparse DBN,SDBN)来提取稀疏瓶颈特征,从而实现对数据的特征降维,然后用双向长短期记忆网络(bidirectional long short-term memory,BLSTM)在时序上对特征进行模态处理,之后利用一种注意力机制将经过模态处理的唇部视觉信息和音频听觉信息进行自动对齐、融合,最后将融合的视听觉信息通过一个附加了Softmax层的BLSTM进行分类识别。实验表明,该算法能有效地识别视听觉信息,在同类算法中有很好的识别率和顽健性。
An end-to-end audiovisual speech recognition algorithm was proposed.In algorithm
a sparse DBN was constructed by introducing mixed l<sub>1/2</sub>norm and l<sub>1</sub>norm into Deep Belief Network with bottleneck structure to extract the sparse bottleneck features
so as to reduce the dimension of data features
and then a BLSTM was used to model the feature in time series.Then
a attention mechanism was used to align and fuse the lip visual information and audio auditory information automatically.Finally
the fused audiovisual information was classified and identified by a BLSTM with a Softmax layer attached.Experiments show that the algorithm can effectively identify visual and auditory information
and has good recognition rate and robustness in similar algorithms.
王海坤 , 潘嘉 , 刘聪 . 语音识别技术的研究进展与展望 [J ] . 电信科学 , 2018 , 34 ( 2 ): 1 - 11 .
WANG H K , PAN J , LIU C . Research development and forecast of automatic speech recognition technologies [J ] . Telecommunications Science , 2018 , 34 ( 2 ): 1 - 11 .
CHIU C C , SAINATH T N , WU Y , et al . State-of-the-art speech recognition with sequence-to-sequence models [C ] // ICASSP 2018-2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 .
王志宏 , 杨震 . 人工智能技术研究及未来智能化信息服务体系的思考 [J ] . 电信科学 , 2017 , 33 ( 5 ): 1 - 11 .
WANG Z H , YANG Z . Research on artificial intelligence technology and the future intelligent information service architecture [J ] . Telecommunications Science , 2017 , 33 ( 5 ): 1 - 11 .
NODA K , YAMAGUCHI Y , NAKADAI K , et al . Audio-visual speech recognition using deep learning [J ] . Applied Intelligence , 2015 , 42 ( 4 ): 722 - 737 .
HU D , LI X , LU X . Temporal multimodal learning in audiovisual speech recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),April 19-24,2016,Brisbane,QLD,Australia . Piscataway:IEEE Press , 2016 : 3574 - 3582 .
TAKASHIMA Y , AIHARA R , TAKIGUCHI T , et al . Audio-visual speech recognition using bimodal-trained bottleneck features for a person with severe hearing loss [J ] . 2016 :
CHUNG J S , SENIOR A , VINYALS O , et al . Lip reading sentences in the wild [J ] . arXiv:1611.05358 , 2016 .
PETRIDIS S , STAFYLAKIS T , MA P , et al . End-to-end audiovisual speech recognition [C ] // 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 : 6548 - 6552 .
WAND M , SCHMIDHUBER J , VU N T . Investigations on End-to-End Audiovisual Fusion [C ] // 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 : 3041 - 3045 .
TAO F , BUSSO C . Gating neural network for large vocabulary audiovisual speech recognition [J ] . IEEE/ACM Transactions on Audio,Speech and Language Processing (TASLP) , 2018 , 26 ( 7 ): 1286 - 1298 .
HINTON G E , SALAKHUTDINOV R R . Reducing the dimensionality of data with neural networks [J ] . Science , 2006 , 313 ( 578 ): 504 - 507 .
GREZL F , KARAFIAT M , KONTAR S , et al . Probabilistic and bottle-neck features for LVCSR of meetings [C ] // IEEE International Conference on Acoustics,April 15-20,2007,Honolulu,HI,USA . Piscataway:IEEE Press , 2007 .
JI N N , ZHANG J S , ZHANG C X . A sparse-response deep belief network based on rate distortion theory [J ] . Pattern Recognition , 2014 , 47 ( 9 ): 3179 - 3191 .
LEE H , EKANADHAM C , NG A Y . Sparse deep belief net model for visual area V2 [C ] // Advances in Neural Information Processing Systems,Dec 3-6,2007,Vancouver,British Columbia,Canada.[S.l:s.n . ] , 2007 : 873 - 880 .
KIAEE F , CHRISTIAN G , ABBASI M . Alternating direction method of multipliers for sparse convolutional neural networks [J ] . 2016 :
KHALID F , FANANY M I . Combining normal sparse into discriminative deep belief networks [C ] // International Conference on Advanced Computer Science & Information Systems,Oct 15-16,2016,Malang,Indonesia . Piscataway:IEEE Press , 2016 .
GRAVES A , SCHMIDHUBER J . Framewise phoneme classification with bidirectional LSTM and other neural network architectures [J ] . Neural Networks , 2005 , 18 ( 5-6 ): 602 - 610 .
LUONG M T , PHAM H , MANNING C D . Effective approaches to attention-based neural machine translation [J ] .,2015. arXiv preprint arXiv:1508.04025 , 2015 .
ANINA I , ZHOU Z , ZHAO G , et al . OuluVS2:a multi-view audiovisual database for non-rigid mouth motion analysis [C ] // IEEE International Conference & Workshops on Automatic Face & Gesture Recognition,May 4-8,2015,Ljubljana,Slovenia . Piscataway:IEEE Press , 2015 .
HINTON G E . A practical guide to training restricted Boltzmann machines [Z ] .2012. 2012 .
GLOROT X , BENGIO Y . Understanding the difficulty of training deep feedforward neural networks [C ] // The Thirteenth International Conference on Artificial Intelligence and Statistics,May 13-15,2010,Chia Laguna Resort,Sardinia,Italy.[S.l.:s.n] . 2010 : 249 - 256 .
PETRIDIS S , WANG Y , LI Z , et al . End-to-end audiovisual fusion with LSTM [J ] . arXiv preprint arXiv:1709.04343 , 2017 .
KOUMPAROULIS A , POTAMIANOS G . Deep View2View mapping for view-invariant lipreading [C ] // 2018 IEEE Spoken Language Technology Workshop (SLT),Dec 18-21,2018,Athens,Greece . Piscataway:IEEE Press , 2018 : 588 - 594 .
SAITOH T , ZHOU Z , ZHAO G , et al . Concatenated frame image based CNN for visual speech recognition [C ] // Asian Conference on Computer Vision,Nov 20-24,2016,Taipei,China . Berlin:Springer , 2016 : 277 - 289 .
FUNG I , MAK B . End-to-end low-resource lip-reading with maxout CNN and LSTM [C ] // 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),April 15-20,2018,Calgary,AB,Canada . Piscataway:IEEE Press , 2018 : 2511 - 2515 .
LEE D , LEE J , KIM K E . Multi-view automatic lip-reading using neural network [C ] // Asian Conference on Computer Vision,Nov 20-24,2016,Taipei,China . Berlin:Springer , 2016 : 290 - 302 .
PETRIDIS S , WANG Y , LI Z , et al . End-to-end multi-view lipreading [J ] . arXiv preprint arXiv:1709.00443 , 2017 .
0
浏览量
275
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构