[1] CHUNG Y A, GLASS J. Speech2Vec: A sequence-to-sequence framework for learning word embeddings from speech[C]//Proceedings of the Interspeech. 2018: 811-815.
[2] CHUANG Y S, LIU C L, LEE H Y, et al. SpeechBERT: An audio-and-text jointly learned language model for end-to-end spoken question answering[C]//Proceedings of the Interspeech. 2020: 4168-4172.
[3] SONG X, WANG G, HUANG Y, et al. Speech-XLNet: Unsupervised acoustic model pretrain- ing for self-attention networks[C]//Proceedings of the Interspeech. 2020: 3765-3769.
[4] WANG C, WU Y, QIAN Y, et al. UniSpeech: Unified speech representation learning with labeled and unlabeled data[C]//Proceedings of the International Conference on Machine Learn- ing. 2021: 10937-10947.
[5] PANAYOTOV V, CHEN G, POVEY D, et al. Librispeech: An asr corpus based on public do- main audio books[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015: 5206-5210.
[6] BAEVSKI A, HSU W N, XU Q, et al. data2vec: A general framework for self-supervised learning in speech, vision and language[C]//Proceedings of the International Conference on Machine Learning: volume 162. 2022: 1298-1312.
[7] RABINER L R, WILPON J G. Considerations in applying clustering techniques to speaker- independent word recognition[J]. The Journal of the Acoustical Society of America, 1979, 66 (3): 663-673.
[8] WILPON J, RABINER L. A modified K-means clustering algorithm for use in isolated work recognition[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1985, 33(3): 587-594.
[9] GAUVAIN J L, LEE C H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains[J]. IEEE Transactions on Speech and Audio Processing, 1994, 2(2): 291-298.
[10] BAHL L, BROWN P, DE SOUZA P, et al. Maximum mutual information estimation of hidden Markov model parameters for speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): volume 11. IEEE, 1986: 49-52.
[11] SMITH N, GALES M. Speech recognition using SVMs[C]//Proceedings of the 14th Interna- tional Conference on Neural Information Processing Systems: Natural and Synthetic. 2001: 1197-1204.
[12] VENKATARAMANI V, CHAKRABARTTY S, BYRNE W. Support vector machines for seg- mental minimum Bayes risk decoding of continuous speech[C]//IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2003: 13-18.
[13] WAN V, RENALS S. SVMSVM: Support vector machine speaker verification methodology [C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP): volume 2. IEEE, 2003: 221-224.
[14] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. science, 2006, 313(5786): 504-507.
[15] VINCENT P, LAROCHELLE H, LAJOIE I, et al. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion[J]. Journal of Machine Learning Research, 2010, 11(12): 3371-3408.
[16] GUTMANN M U, HYVÄRINEN A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics[J]. Journal of Machine Learning Research, 2012, 13(2): 307-361.
[17] WANG A, SINGH A, MICHAEL J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding[C]//Proceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018: 353-355.
[18] WEN YANG S, CHI P H, CHUANG Y S, et al. SUPERB: Speech processing universal perfor- mance benchmark[C]//Proceedings of the Interspeech. 2021: 1194-1198.
[19] TSAI H S, CHANG H J, HUANG W C, et al. SUPERB-SG: Enhanced speech processing universal Performance benchmark for semantic and generative capabilities[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022: 8479-8492.
[20] KINGMA D P, WELLING M. Auto-encoding variational bayes[A/OL]. 2013. https://arxiv.or g/abs/1312.6114.
[21] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[M/OL]. OpenAI, 2018. https://cdn.openai.com/research-covers/la nguage-unsupervised/language_understanding_paper.pdf.
[22] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners [M/OL]. OpenAI, 2019. https://cdn.openai.com/better-language-models/language_models_ar e_unsupervised_multitask_learners.pdf.
[23] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems: volume 33. 2020: 1877-1901.
[24] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-lm: Training multi-billion parameter language models using model parallelism[A/OL]. 2019. https://arxiv.org/abs/1909.08053.
[25] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2019: 4171- 4186.
[26] LIU Y, OTT M, GOYAL N, et al. Roberta: A robustly optimized bert pretraining approach [A/OL]. 2019. https://arxiv.org/abs/1907.11692.
[27] BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[C]//Proceedings of the 34th Advances in Neural Information Processing Systems: volume 33. 2020: 12449-12460.
[28] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
[29] CHEN S, WANG C, CHEN Z, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518.
[30] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems: volume 30. 2017: 5998-6008.
[31] WANG C, RIVIERE M, LEE A, et al. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021: 993-1003.
[32] CHEN G, CHAI S, WANG G, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio[A/OL]. 2021. https://arxiv.org/abs/2106.06909.
[33] AO J, WANG R, ZHOU L, et al. SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022: 5723-5738.
[34] BAI H, ZHENG R, CHEN J, et al. A3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing[C]//International Conference on Machine Learning. PMLR, 2022: 1399-1411.
[35] KHURANA S, LAURENT A, GLASS J. SAMU-XLSR: Semantically-aligned multimodal utterance-level cross-lingual speech representation[J]. IEEE Journal of Selected Topics in Sig- nal Processing, 2022, 16(6): 1493-1504.
[36] AGUILAR G, LING Y, ZHANG Y, et al. Knowledge distillation from internal representations [C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 34. 2020: 7350- 7357.
[37] ZHANG S, ZHENG X, YANG C, et al. You Only Compress Once: Towards effective and elastic BERT compression via exploit-explore stochastic nature gradient[A/OL]. 2021. https: //arxiv.org/abs/2106.02435.
[38] YU S, CHEN T, SHEN J, et al. Unified visual Transformer compression[C/OL]//Proceedings of the International Conference on Learning Representations. 2021. https://openreview.net/for um?id=9jsZiUgkCZP .
[39] ZAFRIR O, LAREY A, BOUDOUKH G, et al. Prune once for all: Sparse pre-trained language models[A/OL]. 2021. https://arxiv.org/abs/2111.05754.
[40] PENG Z, BUDHKAR A, TUIL I, et al. Shrinking Bigfoot: Reducing wav2vec 2.0 footprint [C/OL]//Proceedings of the Second Workshop on Simple and Efficient Natural Language Pro- cessing. 2021: 134-141. DOI: 10.18653/v1/2021.sustainlp-1.14.
[41] CHANG H J, YANG S W, LEE H Y. Distilhubert: Speech representation learning by layer- wise distillation of hidden-unit bert[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7087-7091.
[42] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[A/OL]. 2015. https://arxiv.org/abs/1503.02531.
[43] LIN T Q, YANG T H, CHANG C Y, et al. Compressing Transformer-based self-supervised models for speech processing[A/OL]. 2022. https://arxiv.org/abs/2211.09949.
[44] MENG Y, CHEN H J, SHI J, et al. On compressing sequences for self-supervised speech models [C]//2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023: 1128-1135.
[45] LEE Y, JANG K, GOO J, et al. FitHuBERT: Going thinner and deeper for knowledge distillation of speech self-supervised models[C]//Proceedings of the Interspeech. 2022: 3588-3592.
[46] GUIMARÃES H R, PIMENTEL A, AVILA A R, et al. Improving the robustness of distilHu- BERT to unseen noisy conditions via data augmentation, curriculum learning, and multi-Task enhancement[A/OL]. 2022. https://arxiv.org/abs/2211.06562.
[47] GUIMARÃES H R, PIMENTEL A, AVILA A R, et al. RobustDistiller: compressing universal speech representations for enhanced environment robustness[A/OL]. 2023. https://arxiv.org/ abs/2302.09437.
[48] SKERRY-RYAN R, BATTENBERG E, XIAO Y, et al. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron[C]//International Conference on Machine Learning. PMLR, 2018: 4693-4702.
[49] FISCUS J G, AJOT J, GAROFOLO J S, et al. Results of the 2006 Spoken Term Detection Eval- uation[C]//Proceedings of the ACM Special Interest Group on Information Retrieval: volume 7. 2007: 51-57.
[50] KINNUNEN T, EVANS N, YAMAGISHI J, et al. Asvspoof 2017: Automatic speaker verifi- cation spoofing and countermeasures challenge evaluation plan[J]. Training, 2017, 10(1508): 1508.
[51] National Institute of Standards and Technology. The 2009 (RT-09) rich transcription meeting recognition evaluation plan[EB/OL]. 2009. https://web.archive.org/web/20100606092041if_/ http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf.
[52] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002: 311-318.
[53] KUBICHEK R. Mel-cepstral distance measure for objective speech quality assessment[C]// Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Pro- cessing (PACRIM): volume 1. IEEE, 1993: 125-128.
[54] LUO Y, MESGARANI N. Tasnet: time-domain audio separation network for real-time, single- channel speech separation[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 696-700.
[55] RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): volume 2. IEEE, 2001: 749-752.
[56] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136.
[57] JIA Y, ZHANG Y, WEISS R, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]//Advances in Neural Information Processing Systems: volume 31. 2018: 4485-4495.
[58] KAHN J, RIVIERE M, ZHENG W, et al. Libri-Light: A benchmark for ASR with limited or no supervision[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020: 7669-7673.
[59] MOHAMED A, LEE H Y, BORGHOLT L, et al. Self-supervised speech representation learning: A review[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16: 1179-1210.
[60] VAN DEN OORD A, VINYALS O, KAVUKCUOGLU K. Neural Discrete Representation Learning[C]//Advances in Neural Information Processing Systems: volume 30. 2017: 6309- 6318.
[61] PASCUAL S, RAVANELLI M, SERRà J, et al. Learning problem-agnostic speech representa- tions from multiple self-supervised tasks[C]//Proceedings of the Interspeech. 2019: 161-165.
[62] CHUNG Y A, HSU W N, TANG H, et al. An unsupervised autoregressive model for speech representation learning[C]//Proceedings of the Interspeech. 2019: 146-150.
[63] CHUNG Y A, GLASS J. Generative pre-training for speech with autoregressive predictive coding[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 3497-3501.
[64] WANG W, TANG Q, LIVESCU K. Unsupervised pre-training of bidirectional speech encoders via masked reconstruction[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6889-6893.
[65] LIU A T, LI S W, LEE H Y. Tera: Self-supervised learning of transformer encoder representation for speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.
[66] LIU A T, YANG S W, CHI P H, et al. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders[C]//Proceedings of the IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6419-6423.
[67] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. wav2vec: Unsupervised pre-training for speech recognition[J]. Proceedings of the Interspeech, 2019: 3465-3469.
[68] PAUL D B, BAKER J. The design for the wall street journal-based CSR corpus[C]//Proceedings of the Workshop on Speech and Natural Language. 1992: 357-362.
[69] BAEVSKI A, SCHNEIDER S, AULI M. vq-wav2vec: Self-supervised learning of discrete speech representations[A/OL]. 2019. https://arxiv.org/abs/1910.05453.
[70] CHUNG Y A, ZHANG Y, HAN W, et al. W2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training[A/OL]. 2021. https://arxiv. org/abs/2108.06209.
[71] GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. Proceedings of the Interspeech, 2020: 5036-5040.
[72] CAI H, GAN C, WANG T, et al. Once for all: Train one network and specialize it for efficient deployment[C/OL]//International Conference on Learning Representations. 2020. https://open review.net/forum?id=HylxE1HKwS.
[73] CHEN M, PENG H, FU J, et al. AutoFormer: Searching transformers for visual recognition [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 12270- 12280.
[74] LAI C I J, ZHANG Y, LIU A H, et al. PARP: Prune, adjust and re-prune for self-supervised speech recognition[C]//Proceedings of the 35th Advances in Neural Information Processing Systems: volume 34. 2021: 21256-21272.
[75] YU F, GUO J, XI W, et al. Audio DistilBERT: A distilled audio BERT for speech representation learning[C]//2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021: 1-8.
[76] CHI P H, CHUNG P H, WU T H, et al. Audio albert: A lite bert for self-supervised learning of audio representation[C]//2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 344-350.
[77] LAN Z, CHEN M, GOODMAN S, et al. Albert: A lite bert for self-supervised learning of language representations[A/OL]. 2019. https://arxiv.org/abs/1909.11942.
[78] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE.[J]. Journal of Machine Learning Research, 2008, 9(11).
[79] ASHIHARA T, MORIYA T, MATSUURA K, et al. Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models[A/OL]. 2022. https://arxiv.org/abs/2207.06867.
[80] DENG L, LI G, HAN S, et al. Model compression and hardware acceleration for neural net- works: A comprehensive survey[J]. Proceedings of the IEEE, 2020, 108(4): 485-532.
[81] HE X, ZHAO K, CHU X. AutoML: A survey of the state-of-the-art[J]. Knowledge-Based Systems, 2021, 212: 106622.
[82] KUUTTI S, BOWDEN R, JIN Y, et al. A survey of deep learning applications to autonomous vehicle control[J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 22(2): 712- 733.
[83] LU Z, SREEKUMAR G, GOODMAN E, et al. Neural architecture transfer[J]. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2021, 43(9): 2971-2989.
[84] LI M, LIN J, DING Y, et al. GAN Compression: Efficient architectures for interactive con- ditional GANs[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 5283-5293.
[85] HOU L, HUANG Z, SHANG L, et al. DynaBERT: Dynamic BERT with adaptive width and depth[C]//Proceedings of the 34th Advances in Neural Information Processing Systems: vol- ume 33. 2020: 9782-9793.
[86] WANG R, WEI Z, DUAN H, et al. EfficientTDNN: Efficient architecture search for speaker recognition[A/OL]. 2021. https://arxiv.org/abs/2103.13581.
[87] CHEN H J, MENG Y, LEE H Y. Once-for-all sequence compression for self-supervised speech models[A/OL]. 2022. https://arxiv.org/abs/2211.02332.
[88] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1): 1929- 1958.
[89] OTT M, EDUNOV S, BAEVSKI A, et al. fairseq: A fast, extensible toolkit for sequence modeling[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 2019: 48-53.
[90] SANH V, DEBUT L, CHAUMOND J, et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter[C]//Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - Advances in Neural Information Processing Systems. 2019: 1-5.
[91] WANG B, REN Y, SHANG L, et al. Exploring extreme parameter compression for pre-trained language models[C/OL]//Proceedings of the International Conference on Learning Representa- tions. 2021. https://openreview.net/forum?id=RftryyYyjiG.
[92] KOSSAIFI J, PANAGAKIS Y, ANANDKUMAR A, et al. TensorLy: Tensor learning in Python [J]. Journal of Machine Learning Research, 2019, 20(26): 925-930.
[93] TUCKER L R. Some mathematical notes on three-mode factor analysis[J]. Psychometrika, 1966, 31(3): 279-311.
[94] CHEN S, WU Y, WANG C, et al. Unispeech-SAT: Universal speech representation learning with speaker aware pre-training[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2022: 6152-6156.
[95] SRIRAM A, AULI M, BAEVSKI A. Wav2Vec-Aug: Improved self-supervised training with limited data[A/OL]. 2022. https://arxiv.org/abs/2206.13654.
[96] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//International Conference on Machine Learning. PMLR, 2020: 1597-1607.
[97] LEE H, LEE K, LEE K, et al. Improving transferability of representations via augmentation- aware self-supervision[C]//Advances in Neural Information Processing Systems: volume 34. 2021: 17710-17722.
[98] ZHU Y, KO T, SNYDER D, et al. Self-attentive speaker embeddings for text-independent speaker verification.[C]//Proceedings of the Interspeech. 2018: 3573-3577.
[99] REDDY C K, DUBEY H, GOPAL V, et al. ICASSP 2021 deep noise suppression challenge[C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6623-6627.
修改评论