语音识别技术的研究进展与展望

王海坤; 潘嘉; 刘聪

doi:10.11959/j.issn.1000-0801.2018095

您当前的位置：

首页 >

文章列表页 >

语音识别技术的研究进展与展望

视点聚焦 | 更新时间：2024-06-05

- 语音识别技术的研究进展与展望
- Research development and forecast of automatic speech recognition technologies
- 电信科学 2018年34卷第2期页码：1-11
- 作者机构：
- 作者简介：
  
  [ "王海坤（1984-），男，科大讯飞股份有限公司人工智能研究院副院长，牵头研发科大讯飞嵌入式识别系统和远场识别系统，叮咚音箱技术总负责人，主要研究方向为语音识别、麦克风阵列语音信号处理、回声消除、语音交互等。著有 40 多篇发明专利，多项研究成果获得省级以上表彰。" ]
  [ "潘嘉（1985-），男，科大讯飞股份有限公司人工智能研究院语音识别组研究主管，科大讯飞学术委员会委员，主要研究方向为语音识别。在深度神经网络领域有极深的造诣，是科大讯飞语音识别系统研发的主要参与者。" ]
  [ "刘聪（1984−），男，博士后，科大讯飞股份有限公司人工智能研究院副院长，长期从事语音识别和人工智能等相关领域的研究工作。从2014年底开始，全面负责科大讯飞人脸识别、医学图像识别、视频监控等方向的研究工作，研究成果在多个内部产品中成功应用。2014 年获得北京市科学技术奖一等奖，发表论文10余篇，获得专利10余项。" ]
- 基金信息：
- DOI：10.11959/j.issn.1000-0801.2018095
  中图分类号： TP393
- 网络出版日期：2018-02，
  
  纸质出版日期：2018-02-20
- 稿件说明：
移动端阅览
王海坤, 潘嘉, 刘聪. 语音识别技术的研究进展与展望[J]. 电信科学, 2018,34(2):1-11.

Haikun WANG, Jia PAN, Cong LIU. Research development and forecast of automatic speech recognition technologies[J]. Telecommunications science, 2018, 34(2): 1-11.
王海坤, 潘嘉, 刘聪. 语音识别技术的研究进展与展望[J]. 电信科学, 2018,34(2):1-11. DOI： 10.11959/j.issn.1000-0801.2018095.

Haikun WANG, Jia PAN, Cong LIU. Research development and forecast of automatic speech recognition technologies[J]. Telecommunications science, 2018, 34(2): 1-11. DOI： 10.11959/j.issn.1000-0801.2018095.

摘要

自动语音识别（ASR）技术的目的是让机器能够“听懂”人类的语音，将人类语音信息转化为可读的文字信息，是实现人机交互的关键技术，也是长期以来的研究热点。最近几年，随着深度神经网络的应用，加上海量大数据的使用和云计算的普及，语音识别取得了突飞猛进的进展，在多个行业突破了实用化的门槛，越来越多的语音技术产品进入了人们的日常生活，包括苹果的Siri、亚马逊的Alexa、讯飞语音输入法、叮咚智能音箱等都是其中的典型代表。对语音识别技术的发展情况、最近几年的关键突破性技术进行了介绍，并对语音识别技术的发展趋势做了展望。

Abstract

The purpose of automatic speech recognition (ASR) is to make the machine to be able to “understand” the human speech and transform it to readable text information.ASR is one of the key technologies of human machine interaction and also a hot research domain for a long time.In recent years

due to the application of deep neural networks

the use of big data and the popularity of cloud computing

ASR has made great progress and break through the threshold of application in many industries.More and more products with ASR have entered people’s daily life

such as Apple’s Siri

Amazon’s Alexa

IFLYTEK speech input method and Dingdong intelligent speaker and so on.The development status and key breakthrough technologies in recent years were introduced.Also

a forecast of ASR technologies’ trend of development was given.

关键词

Keywords

references

DAVIS K H , BIDDULPH R , BALASHEK S . Automatic recognition of spoken digits [J ] . Journal of the Acoustical Society of America , 1952 , 24 ( 6 ): 637 .

FERGUSON J D . Application of hidden Markov models to text and speech [EB ] . 1980 .

RABINER L R . A tutorial on hidden Markov models and selected applications in speech recognition [J ] . Readings in Speech Recognition , 1990 , 77 ( 2 ): 267 - 296 .

LEEE K F L M . An overview of the SPHINX speech recognition system [J ] . IEEE Transactions on Acoustics Speech ＆ Signal Processing Speech , 1990 , 38 ( 1 ): 35 - 45 .

WAIBEL A , HANAZAWA T , HINTON G . Phoneme recognition using time-delay neural networks [J ] . IEEE Transactions on Acoustics,Speech,and Signal Processing , 1990 , 1 ( 2 ): 393 - 404 .

YOUNG S , EVERMANN G , GALES M , et al . The HTK book [EB ] . 2005 .

HINTON G E , OSINDERO S , TEH Y W . A fast learning algorithm for deep belief nets [J ] . Neural Computation , 2006 , 18 ( 7 ): 1527 - 1554 .

MOHAMED A R , DAHL G , HINTON G . Deep belief networks for phone recognition [EB ] . 2009 .

YU D , DENG L . Deep learning and its applications to signal and information processing [J ] . IEEE Signal Processing Magazine , 2011 , 28 ( 1 ): 145 - 154 .

DENG L , . An overview of deep-structured learning for information processing [C ] // Asian-Pacific Signal and Information Processing-Annual Summit and Conference (APSIPA-ASC),October 18,2011 , Xi’an,China .[S.l.:s.n ] 2011 .

BENGIO Y . Learning deep architectures for AI [J ] . Foundations and Trends® in Machine Learning , 2009 , 2 ( 1 ): 1 - 127 .

HINTON G E . Training products of experts by minimizing contrastive divergence [J ] . Neural Computation , 2002 , 14 ( 8 ): 1771 - 1800 .

BAKER J , DENG L , GLASS J , et al . Developments and directions in speech recognition and understanding [J ] . IEEE Signal Processing Magazine , 2009 , 26 ( 3 ): 75 - 80 .

MOHAMED A R , DAHL G , HINTON G . Deep belief networks for phone recognition [EB ] . 2009 .

SAINATH T N , KINGSBURY B , RAMABHADRAN B , et al . Making deep belief networks effective for large vocabulary continuous speech recognition [EB ] . 2011 .

MOHAMED A , DAHL G E , HINTON G . Acoustic modeling using deep belief networks [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2012 , 20 ( 1 ): 14 - 22 .

DAHL G E , YU D , DENG L , et al . Context-dependent pre-trained deep neural networks for large vocabulary speech recognition [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2012 , 20 ( 1 ): 30 - 42 .

HINTON G , DENG L , YU D , et al . Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups [J ] . IEEE Signal Processing Magazine , 2012 , 29 ( 6 ): 82 - 97 .

HOCHREITER S , SCHMIDHUBER J . Long short-term memory [J ] . Neural Computation , 1997 , 9 ( 8 ): 1735 - 1780 .

ZHANG Y , CHEN G G , YU D , et al . Highway long short-term memory RNNS for distant speech recognition [C ] // 2016 IEEE International Conference on Acoustics,Speech and Signal Processing,March 20-25,2016,Shanghai,China . Piscataway:IEEE Press , 2016 .

ZHANG S L , LIU C , JIANG H , et al . Feedforward sequential memory networks:a new structure to learn long-term dependency [J ] . arXiv:1512.08301 , 2015 .

LECUN Y , BENGIO Y . Convolutional networks for images,speech and time-series [M ] . Cambridge : MIT Press , 1995 .

ABDEL-HAMID O , MOHAMED A R , JIANG H , et al . Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition [C ] // 2012 IEEE International Conference on Acoustics,Speech and Signal Processing,March 20,2012,Kyoto,Japan . Piscataway:IEEE Press , 2012 : 4277 - 4280 .

ABDEL-HAMID O , MOHAMED A R , JIANG H , et al . Convolutional neural networks for speech recognition [J ] . IEEE/ACM Transactions on Audio Speech ＆ Language Processing , 2014 , 22 ( 10 ): 1533 - 1545 .

ABDEL-HAMID O , DENG L , YU D . Exploring convolutional neural network structures and optimization techniques for speech recognition [EB ] . 2013 .

SAINATH T N , MOHAMED A R , KINGSBURY B , et al . Deep convolutional neural networks for LVCSR [C ] // 2013 IEEE International Conference on Acoustics,Speech and Signal Processing,May 26-30,2013,Vancouver,BC,Canada . Piscataway:IEEE Press , 2013 : 8614 - 8618 .

SAINATH T N , VINYALS O , SENIOR A , et al . Convolutional,long short-term memory,fully connected deep neural networks [C ] // 2015 IEEE International Conference on Acoustics,Speech and Signal Processing,April 19-24,Brisbane,QLD,Australia . Piscataway:IEEE Press , 2015 : 4580 - 4584 .

SEIDE F , LI G , YU D . Conversational speech transcription using context- dependent deep neural networks [C ] // International Conference on Machine Learning,June 28-July 2,2011,Bellevue , Washington,USA .[S.l.:s.n ] 2011 : 437 - 440 .

DAHL G E , YU D , DENG L , et al . Large vocabulary continuous speech recognition with context-dependent DBNHMMs [C ] // ICASSP,May 22-27,2011,Prague , Czech Republic .[S.l.:s.n ] 2011 : 4688 - 4691 .

YU D , SEIDE F , LI G , et al . Exploiting sparseness in deep neural networks for large vocabulary speech recognition [C ] // ICASSP,March 25-30,2012 , Kyoto,Japan .[S.l.:s.n ] 2012 : 4409 - 4412 .

SAINATH T N , KINGSBURY B , SINDHWANI V , et al . Low-rank matrix factorization for deep neural network training with high-dimensional output targets [C ] // ICASSP,May 26-31,2013,Vancouver , BC,Canada ,.[S.l.:s.n ] 2013 : 6655 - 6659 .

KONTÁR S , . Parallel training of neural networks for speech recognition [C ] // 13th International Conference on Text,Speech and Dialogue,September 6-10,2010,Brno,Czech Republic . New York:ACM Press , 2006 : 6 - 10 .

VESELÝ K , BURGET L , GRÉZL F . Parallel training of neural networks for speech recognition [C ] // 13th International Conference on Text,Speech and Dialogue,September 6-10,2010,Brno,Czech Republic . New York:ACM Press , 2006 : 439 - 446 .

PARK J , DIEHL F , GALES M J F , et al . Efficient generation and use of MLP features for Arabic speech recognition [C ] // Interspeech,Conference of the International Speech Communication Association,September 6-10,2009 , Brighton,UK .[S.l.:s.n ] 2009 : 236 - 239 .

LE Q V , RANZATO M A , MONGA R , et al . Building high-level features using large scale unsupervised learning [J ] . arXiv preprint arXiv:1112.6209 , 2011 .

ZHANG S , ZHANG C , YOU Z , et al . Asynchronous stochastic gradient descent for DNN training [C ] // IEEE International Conference on Acoustics,June 27-July 2,2013,Santa Clara Marriott,CA,USA . Piscataway:IEEE Press , 2013 : 6660 - 6663 .

CHEN X , EVERSOLE A , LI G , et al . Pipelined back-propagation for context-dependent deep neural networks [C ] // 13th Annual Conference of the International Speech Communication Association,September 9-13,2012,Portland , OR,USA .[S.l:s.n ] 2012 : 429 - 433 .

ZHOU P , LIU C , LIU Q , et al . A cluster-based multiple deep neural networks method for large vocabulary continuous speech recognition [C ] // ICASSP,May 26-31,2013,Vancouver , BC,Canada .[S.l.:s.n ] 2013 : 6650 - 6654 .

JELINEK F . The development of an experimental discrete dictation recognizer [J ] . Readings in Speech Recognition , 1990 , 73 ( 11 ): 1616 - 1624 .

BENGIO Y , DUCHARME R , VINCENT P . A neural probabilistic language model [J ] . Journal of Machine Learning Research , 2003 ( 3 ): 1137 - 1155 .

SCHWENK H , GAUVAIN J L . Training neural network language models on very large corpora [C ] // Conference on Human Language Technology ＆ Empirical Methods in Natural Language Processing,October 6-8,2005,Vancouver,BC,Canada . New York:ACM Press , 2005 : 201 - 208 .

ARıSOY E , SAINATH T N , KINGSBURY B , et al . Deep neural network language models [C ] // NAACL-HLT 2012 Workshop,June 8,2012,Montreal,Canada . New York:ACM Press , 2012 : 20 - 28 .

MIKOLOV T , KARAFIAT M , BURGET L , et al . Recurrent neural network based language model [C ] // 11th Annual Conference of the International Speech Communication Association,September 26-30,2010,Makuhari , Chiba,Japan .[S.l.:s.n ] 2010 : 1045 - 1048 .

CHEN X , WANG Y , LIU X , et al . Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch [EB ] . 2014 .

MIKOLOV T , KOMBRINK S , BURGET L , et al . Extensions of recurrent neural network language model [C ] // IEEE International Conference on Acoustics,May 22-27,2011,Prague,Czech Republic . Piscataway:IEEE Press , 2011 : 5528 - 5531 .

SUNDERMEYER M , SCHLUTER R , NEY H . LSTM neural networks for language modeling [EB ] . 2012 .

BENGIO Y , SIMARD P , FRASCONI P . Learning long term dependencies with gradient descent is difficult [J ] . IEEE Transactions on Neural Networks , 1994 , 5 ( 2 ): 157 .

SAK H , SENIOR A , RAO K . Learning acoustic frame labeling for speech recognition with recurrent neural networks [C ] // 2015 ICASSP,April 19-24,2015,Brisbane , QLD,Australia .[S.l.:s.n ] 2015 : 4280 - 4284 .

SAK H , SENIOR A , RAO K , et al . Fast and accurate recurrent neural network acoustic models for speech recognition [J ] . arXiv:1507.06947 , 2015 .

SENIOR A , SAK H , QUITRY F D C , et al . Acoustic modelling with CD-CTC-SMBR LSTM RNNS [C ] // 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),December 13-17,2015,Scottsdale,AZ,USA . Piscataway:IEEE Press , 2015 : 604 - 609 .

BAHDANAU D , CHO K , BENGIO Y . Neural machine translation by jointly learning to align and translate [J ] . arXiv:1409.0473 , 2014 .

MNIH V , HEESS N , GRAVES A , et al . Recurrent models of visual attention [C ] // 28th Annual Conference on Neural Information Processing Systems,December 8-13,2014 . Montreal,Canada .[S.l.:s.n ] 2014 : 2204 - 2212 .

TUSKE Z , GOLIK P , SCHLUTER R , et al . Acoustic modeling with deep neural networks using raw time signal for LVCSR [EB ] . 2014 .

SAINATH T N , WEISS R J , SENIOR A W , et al . Learning the speech front-end with raw waveform [EB ] . 2015 .

浏览量

8247

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于1D-Concatenate的信道估计DNN模型优化方法

基于深度学习的无线通信接收方法研究进展与趋势

基于深度学习的调制识别综述

高效深度神经网络综述

基于人工智能的光纤非线性均衡算法研究概述