[1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017.
[2] DAVIS K H, BIDDULPH R, BALASHEK S. Automatic recognition of spoken digits[J]. The Journal of the Acoustical Society of America, 1952, 24(6): 637-642.
[3] LEE K F. Automatic speech recognition: The development of the SPHINX system: volume 62 [M]. Springer Science & Business Media, 1988.
[4] BERNDT D J, CLIFFORD J. Using dynamic time warping to find patterns in time series[C]// Knowledge discovery in databases workshop: volume 10. Seattle, WA, USA:, 1994: 359-370.
[5] SAON G, CHIEN J T. Large-vocabulary continuous speech recognition systems: A look at some recent advances[J]. IEEE signal processing magazine, 2012, 29(6): 18-33.
[6] LOWERRE B T. The harpy speech recognition system[M]. Carnegie Mellon University, 1976.
[7] VETTERLI M. Filter banks allowing perfect reconstruction[J]. Signal processing, 1986, 10(3): 219-244.
[8] RABINER L, JUANG B H. Fundamentals of speech recognition[M]. Prentice-Hall, Inc., 1993.
[9] PAULS A, KLEIN D. Faster and smaller n-gram language models[C]//Association for Computational Linguistics. 2011: 258-267.
[10] HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(7): 1527-1554.
[11] YU D, DENG L. Deep learning and its applications to signal and information processing [exploratory dsp][J]. IEEE signal processing magazine, 2010, 28(1): 145-154.
[12] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//IEEE Automatic Speech Recognition and Understanding Workshop. 2011.
[13] MOHRI M, PEREIRA F, RILEY M. Weighted finite-state transducers in speech recognition[J].Computer Speech & Language, 2002, 16(1): 69-88.
[14] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the international conference on Machine learning. 2006: 369-376.
[15] GRAVES A, JAITLY N. Towards end-to-end speech recognition with recurrent neural networks [C]//Proceedings of the international conference on machine learning. 2014: 1764-1772.
[16] GRAVES A. Sequence transduction with recurrent neural networks[C]//Proceedings of the international conference on machine learning. 2012.
[17] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//IEEE international conference on acoustics, speech and signal processing. IEEE, 2013.
[18] CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al. Attention-based models for speech recognition[C]//Advances in neural information processing systems. 2015.
[19] CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition[C]//IEEE international conference on acoustics, speech and signal processing. IEEE, 2016.
[20] RAO K, SAK H, PRABHAVALKAR R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-Transducer[C]//IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 2017.
[21] WAIBEL A, HANAZAWA T, HINTON G, et al. Phoneme recognition using time-delay neural networks[J]. IEEE/ACM transactions on acoustics, speech, and signal processing, 1989, 37(3): 328-339.
[22] LECUN Y, BOSER B, DENKER J, et al. Handwritten digit recognition with a back-propagation network[C]//Advances in neural information processing systems. 1989.
[23] ABDEL-HAMID O, MOHAMED A R, JIANG H, et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]//IEEE international conference on Acoustics, speech and signal processing. IEEE, 2012.
[24] ABDEL-HAMID O, MOHAMED A R, JIANG H, et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM transactions on audio, speech, and language processing, 2014, 22(10): 1533-1545.
[25] ZHANG Y, PEZESHKI M, BRAKEL P, et al. Towards end-to-end speech recognition with deep convolutional neural networks[C]//Proc. Interspeech. 2016.
[26] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
[27] HAN W, ZHANG Z, ZHANG Y, et al. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context[C]//Proc. Interspeech. 2020.
[28] MEDSKER L R, JAIN L. Recurrent neural networks[J]. Design and Applications, 2001, 5:64-67.
[29] SAK H, SENIOR A, BEAUFAYS F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. arXiv preprint arXiv:1402.1128, 2014.
[30] ZEYER A, DOETSCH P, VOIGTLAENDER P, et al. A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition[C]//IEEE international conference on acoustics, speech and signal processing. IEEE, 2017.
[31] RAVANELLI M, BRAKEL P, OMOLOGO M, et al. Improving speech recognition by revising gated recurrent units[C]//Proc. Interspeech. 2017.
[32] DONG L, XU S, XU B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018.
[33] GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-augmented transformer for speech recognition[C]//Proc. Interspeech. 2020.
[34] LI S, XU M, ZHANG X L. Effcient conformer-based speech recognition with linear attention[C]//Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. 2021.
[35] SAON G, SOLTAU H, NAHAMOO D, et al. Speaker adaptation of neural network acoustic models using i-vectors[C]//IEEE Automatic Speech Recognition and Understanding Workshop. 2013.
[36] LEGGETTER C J, WOODLAND P C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models[J]. Computer speech & language, 1995, 9(2): 171-185.
[37] HUANG C, CHANG E, ZHOU J, et al. Accent modeling based on pronunciation dictionary adaptation for large vocabulary mandarin speech recognition.[C]//Proc. Interspeech. 2000.
[38] NALLASAMY U, METZE F, SCHULTZ T. Active learning for accent adaptation in automatic speech recognition[C]//IEEE Spoken Language Technology Workshop. IEEE, 2012.
[39] MIRSAMADI S, HANSEN J H. A study on deep neural network acoustic model adaptation for robust far-field speech recognition[C]//International Speech Communication Association. 2015.
[40] COHEN J, KAMM T, ANDREOU A G. Vocal tract normalization in speech recognition: Compensating for systematic speaker variability[J]. The Journal of the Acoustical Society of America, 1995, 97(5): 3246-3247.
[41] GALES M J. Maximum likelihood linear transformations for HMM-based speech recognition [J]. Computer speech & language, 1998, 12(2): 75-98.
[42] NETO J, ALMEIDA L, HOCHBERG M, et al. Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system[C]//International Speech Communication Association.1995.
[43] LIAO H. Speaker adaptation of context dependent deep neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.
[44] DELCROIX M, WATANABE S, OGAWA A, et al. Auxiliary feature based adaptation of endto-end ASR systems[C]//Proc. Interspeech. 2018.
[45] MENG Z, LI J, CHEN Z, et al. Speaker-invariant training via adversarial learning[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018.
[46] MENG Z, LI J, GONG Y. Adversarial speaker adaptation[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019.
[47] GUPTA V, KENNY P, OUELLET P, et al. I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription[C]//IEEE international conference on acoustics, speech and signal processing. IEEE, 2014.
[48] CUCU H, BESACIER L, BURILEANU C, et al. ASR domain adaptation methods for lowresourced languages: Application to Romanian language[C]//The European Signal Processing Conference. 2012.
[49] LI K, XU H, WANG Y, et al. Recurrent neural network language model adaptation for conversational speech recognition[C]//Proc. Interspeech. 2018.48
[50] MANI A, PALASKAR S, MERIPO N V, et al. ASR error correction and domain adaptation using machine translation[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020.
[51] SHAN C, WENG C, WANG G, et al. Component fusion: Learning replaceable language model component for end-to-end speech recognition system[C]//IEEE international conference on Acoustics, speech and signal processing. IEEE, 2019.
[52] MENG Z, GAUR Y, KANDA N, et al. Internal language model adaptation with text-only data for end-to-end speech recognition[J]. arXiv preprint arXiv:2110.05354, 2021.
[53] PYLKKÖNEN J, UKKONEN A, KILPIKOSKI J, et al. Fast text-only domain adaptation of RNN-Transducer prediction network[C]//Proc. Interspeech. 2021.
[54] LOGAN B. Mel frequency cepstral coeffcients for music modeling[C]//In International Symposium on Music Information Retrieval. Citeseer, 2000.
[55] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. Wav2vec: Unsupervised pre-training for speech recognition[C]//Proc. Interspeech. 2019.
[56] BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: A framework for self-supervised learning of speech representations[C]//Advances in neural information processing systems. 2020.
[57] HSU W N, BOLTE B, TSAI Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM transactions on audio, speech, and language processing, 2021, 29: 3451-3460.
[58] KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition[C]//International Speech Communication Association. 2015.
[59] KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017.
[60] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[C]//Proc. Interspeech. 2019.
[61] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[C]//International Conference on Learning Representations. 2015.
[62] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [C]//Advances in neural information processing systems. 2014.
[63] SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rare words with subword units[C]//Association for Computational Linguistics. 2016.
[64] SCHUSTER M, NAKAJIMA K. Japanese and korean voice search[C]//IEEE international conference on acoustics, speech and signal processing. IEEE, 2012.
[65] SRIRAM A, JUN H, SATHEESH S, et al. Cold fusion: Training seq2seq models together with language models[C]//Proc. Interspeech. 2018.
[66] GULCEHRE C, FIRAT O, XU K, et al. On using monolingual corpora in neural machine translation[J]. arXiv preprint arXiv:1503.03535, 2015.
[67] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[C]//Proceedings of The North American Chapter of the Association for Computational Linguistics. 2018.
[68] DONG L, XU S, XU B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018.
[69] PHAM N Q, NGUYEN T S, NIEHUES J, et al. Very deep self-attention networks for end-to-end speech recognition[C]//Proc. Interspeech. 2019.
[70] KARITA S, CHEN N, HAYASHI T, et al. A comparative study on transformer vs rnn in speech applications[C]//IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 2019.
[71] WATANABE S, HORI T, KIM S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.
[72] MIAO H, CHENG G, ZHANG P, et al. online hybrid CTC/Attention architecture for end-to-end speech recognition.[C]//Proc. Interspeech. 2019.
[73] GAO Q, WU H, SUN Y, et al. An end-to-end speech accent recognition method based on hybrid CTC/attention transformer ASR[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021.
[74] MIAO H, CHENG G, GAO C, et al. Transformer-based online CTC/attention end-to-end speech recognition architecture[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020.
[75] LI S, RAJ D, LU X, et al. Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation.[C]//Proc. Interspeech. 2019.
[76] BIE A, VENKITESH B, MONTEIRO J, et al. A simplified fully quantized transformer for end-to-end speech recognition[J]. arXiv preprint arXiv:1911.03604, 2019.
[77] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7794-7803.
[78] SPERBER M, NIEHUES J, NEUBIG G, et al. Self-attentional acoustic models[C]//Proc. Interspeech. 2018.
[79] MOHAMED A, OKHONKO D, ZETTLEMOYER L. Transformers with convolutional context for ASR[J]. arXiv preprint arXiv:1904.11660, 2019.
[80] GEHRING J, AULI M, GRANGIER D, et al. Convolutional sequence to sequence learning[C]//Proceedings of the international conference on machine learning. 2017: 1243-1252.
[81] WATANABE S, HORI T, KARITA S, et al. ESPnet: End-to-end speech processing toolkit[C]//Proc. Interspeech. 2018.
[82] BENGIO Y, COURVILLE A, VINCENT P. Representation learning: A review and new perspectives[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1798-1828.
[83] ROUSSEAU A, DELÉGLISE P, ESTEVE Y. TED-LIUM: An automatic speech recognition dedicated corpus.[C]//Language Resources and Evaluation Conference. 2012.
[84] CHU M, LI C, PENG H, et al. Domain adaptation for TTS systems[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2002.
[85] HE M, DENG Y, HE L. Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS[C]//Proc. Interspeech. 2019.
[86] TJANDRA A, SAKTI S, NAKAMURA S. Listening while speaking: Speech chain by deep learning[C]//IEEE Automatic Speech Recognition and Understanding Workshop. 2017.
[87] ROSSENBACH N, ZEYER A, SCHLÜTER R, et al. Generating synthetic audio data for attention-based speech recognition systems[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020.
[88] WANG G, ROSENBERG A, CHEN Z, et al. Improving speech recognition using consistent predictions on synthesized speech[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020.
[89] TJANDRA A, SAKTI S, NAKAMURA S. End-to-end feedback loss in speech chain framework via straight-through estimator[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019.
[90] HAYASHI T, WATANABE S, ZHANG Y, et al. Back-translation-style data augmentation for end-to-end ASR[C]//IEEE Spoken Language Technology Workshop. 2018.
[91] BASKAR M K, WATANABE S, ASTUDILLO R, et al. Semi-supervised sequence-to-sequence ASR using unpaired speech and text[C]//Proc. Interspeech. 2019.
[92] NOVITASARI S, SAKTI S, NAKAMURA S. Dynamically adaptive machine speech chain inference for TTS in noisy environment: Listen and speak louder[C]//Proc. Interspeech. 2021.
[93] LI J, ZHAO R, MENG Z, et al. Developing RNN-T models surpassing high-performance hybrid models with customization capability[C]//Proc. Interspeech. 2020.
[94] ZHENG X, LIU Y, GUNCELER D, et al. Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end ASR systems[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021.
[95] BASKAR M K, BURGET L, WATANABE S, et al. Eat: Enhanced ASR-TTS for selfsupervised speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021.
[96] LI K, LIU Z, HE T, et al. An empirical study of transformer-based neural language model adaptation[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020.
[97] PYLKKöNEN J, UKKONEN A, KILPIKOSKI J, et al. Fast text-only domain adaptation of RNN-Transducer prediction network[C]//Proc. Interspeech. 2021.
[98] TJANDRA A, SAKTI S, NAKAMURA S. Machine speech chain with one-shot speaker adaptation[C]//Proc. Interspeech. 2018.
[99] WANG Y, SKERRY-RYAN R, STANTON D, et al. Tacotron: Towards end-to-end speech synthesis[C]//Proc. Interspeech. 2017.
[100] SHEN J, PANG R, WEISS R J, et al. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018.
[101] REN Y, RUAN Y, TAN X, et al. Fastspeech: Fast, robust and controllable text to speech[C]//Advances in neural information processing systems. 2019.
[102] REN Y, HU C, TAN X, et al. Fastspeech 2: Fast and high-quality end-to-end text to speech[C]//International Conference on Learning Representations. 2020.
[103] JIA Y, ZHANG Y, WEISS R J, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]//Advances in Neural Information Processing Systems. 2018.
[104] LIU D R, YANG C Y, WU S L, et al. Improving unsupervised style transfer in end-to-end speech synthesis with end-to-end speech recognition[C]//IEEE Spoken Language Technology Workshop. 2018.
[105] ZEN H, DANG V, CLARK R, et al. LibriTTS: A corpus derived from LibriSpeech for text-tospeech[C]//Proc. Interspeech. 2019.
[106] TACHIBANA H, UENOYAMA K, AIHARA S. Effciently trainable text-to-speech system based on deep convolutional networks with guided attention[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018.
修改评论