南方科技大学知识苑(SUSTech KC): 多教师知识蒸馏及其在说话人脸生成中的应用研究

题名	多教师知识蒸馏及其在说话人脸生成中的应用研究
其他题名	RESEARCH ON MULTI-TEACHER KNOWLEDGE DISTILLATION AND ITS APPLICATION IN TALKING FACE GENERATION
姓名	郭格
姓名拼音	GUO Ge
学号	12032488
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	姚新
导师单位	计算机科学与工程系
论文答辩日期	2023-05-13
论文提交日期	2023-06-27
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	知识蒸馏常用于解决深度神经网络（如说话人脸生成模型）由于大量参数和高度复杂的计算而难以部署和运行在资源有限的设备上的问题。相较于传统单教师知识蒸馏，多教师知识蒸馏通过聚合多个教师模型的输出，提供更全面、更准确的标签信息来指导学生模型的训练。由于现有多教师知识蒸馏缺乏对教师集合多样性的关注，本文从提升教师模型多样性的角度出发，提出了一种在线多教师知识蒸馏算法来提高学生模型的表现。一方面，在线的方式为教师模型提供了动态更新的能力，从而具有多样性，同时解决了教师集合难以选择和评估的问题。另一方面，在线训练方式能够有效弥合学生模型与教师模型之间的差距，提高知识传递效率。本文在 CIFAR-100 数据集上进行了广泛的实验，结果表明，本文提出的在线知识蒸馏在八种流行的教师-学生组合中均超过了当前最先进的多教师知识蒸馏方法，在最优的情况下实现了 0.63% 的绝对精度的提升，且平均偏差仅 0.08%，相较于其他方法更具稳定性。此外，本文研究了多教师知识蒸馏在音频驱动的说话人脸生成中的应用。由于现有说话人脸模型在设计上各有侧重，难以兼具多个性能指标最优，因此本文面向说话人脸生成提出一种新的基于多目标演化的多教师知识蒸馏框架。具体而言，首先将说话人脸生成模型的训练建模为多目标优化问题，将指标显式地建模为优化目标，使用演化算法得到一组多样的解作为教师模型。然后学生模型由真实标签和一组教师模型共同指导学习。本文在 LRW 和 LRS2 数据集上进行了广泛的实验，结果表明，本文提出的方法在保持甚至提升模型精度的同时，计算量和参数量分别减少 74.7% 和 75.0%。在最优的情况下与其他知识蒸馏方法相比，本文提出的方法在数据集 LRW 和 LRS2 上分别实现了 SSIM 提升 0.002 和 0.005、LMD 降低 0.13 和 0.09。
其他摘要	Knowledge distillation (KD) is commonly used to solve the problem that deep neural networks, such as audio-driven talking face generation (ATFG) models, are diﬀicult to deploy and run on devices with limited resources due to a large number of parameters and highly complex calculations. Compared with the single-teacher KD, multi-teacher KD (MKD) provides more comprehensive and accurate information to guide the training of student. Due to the existing MKD lacks attention to the diversity of teacher set, this paper proposes an online MKD (OMKD) to improve the performance of the student. On the one hand, the OMKD provides the teachers with the ability to dynamically update, thus having diversity, and solves the problem that the teacher set is diﬀicult to select and evaluate. On the other hand, the OMKD can effectively bridge the gap between student and teachers to improve the eﬀiciency of knowledge transfer. Extensive experiments on the CIFAR- 100 dataset show that the OMKD outperforms the current state-of-the-art MKD methods in eight popular teacher-student combinations, in the optimal case the absolute accuracy improvement of 0.63% has been achieved, and the average deviation is only 0.08%, which is more stable than other methods. In addition, this paper investigates the application of MKD in ATFG. Since the ex- isting ATFG models have different design points, it is diﬀicult to achieve the optimal performance of multiple indicators. Therefore, this paper proposes a new evolutionary multi-objective MKD (EM-MKD) framework for ATFG. Specifically, the training of the ATFG model is firstly modeled as a multi-objective optimization problem, the indicators are explicitly modeled as the optimization objective, and an evolutionary algorithm is used to obtain a set of diverse solutions as the teacher. The student is then jointly guided by the ground truth labels and a set of teachers. Extensive experiments show that the EM- MKD reduces FLOPs and Params by 74.7% and 75.0% respectively while maintaining or even improving the performance of the model. In the optimal case, compared with other KD methods, the EM-MKD achieves SSIM improvements of 0.002 and 0.005, and LMD reductions of 0.13 and 0.09 on the datasets LRW and LRS2, respectively.
关键词	多教师知识蒸馏演化计算多目标优化说话人脸生成多目标学习
其他关键词	Multi-teacher Knowledge Distillation Evolutionary Computing Multi- Objective Optimization Talking Face Generation Multi-objective Learning
语种	中文
培养类别	独立培养
入学年份	2020
学位授予年份	2023-06
参考文献列表	[1] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by backpropagating errors[J]. Nature, 1986, 323(6088): 533-536. [2] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444. [3] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [4] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[A]. 2014. arXiv preprint arXiv:1409.1556. [5] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778. [6] XIAO L, BAHRI Y, SOHL-DICKSTEIN J, et al. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks[A]. 2018. arXiv preprint arXiv:1806.05393. [7] OORD A V D, DIELEMAN S, ZEN H, et al. Wavenet: A generative model for raw audio[A]. 2016. [8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30. [9] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 779-788. [10] FANG C, LI G, PAN C, et al. Globally guided progressive fusion network for 3D pancreas segmentation[C]//Medical Image Computing and Computer Assisted Intervention. Springer, 2019: 210-218. [11] PARISOT S, KTENA S I, FERRANTE E, et al. Spectral graph convolutions for population- based disease prediction[C]//Medical Image Computing and Computer Assisted Intervention. Springer, 2017: 177-185. [12] CHEN L, CUI G, KOU Z, et al. What comprises a good talking-head video generation?: A survey and benchmark[A]. 2020. arXiv preprint arXiv:2005.03201. [13] DENG L, LI G, HAN S, et al. Model compression and hardware acceleration for neural networks: A comprehensive survey[J]. Proceedings of the IEEE, 2020, 108(4): 485-532. [14] BUCILUǎ C, CARUANA R, NICULESCU-MIZIL A. Model compression[C]//Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006: 535-541. [15] HINTON G, VINYALS O, DEAN J, et al. Distilling the knowledge in a neural network: volume 2[A]. 2015. arXiv preprint arXiv:1503.02531. [16] YOU S, XU C, XU C, et al. Learning from multiple teacher networks[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017: 1285-1294. [17] GOU J, YU B, MAYBANK S J, et al. Knowledge distillation: A survey[J]. International Journal of Computer Vision, 2021, 129: 1789-1819. [18] KIM J, PARK S, KWAK N. Paraphrasing complex network: Network compression via factor transfer[J]. Advances in Neural Information Processing Systems, 2018, 31. [19] MIRZADEH S I, FARAJTABAR M, LI A, et al. Improved knowledge distillation via teacher assistant[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 34. 2020: 5191-5198. [20] ROMERO A, BALLAS N, KAHOU S E, et al. Fitnets: Hints for thin deep nets[A]. 2014. arXiv preprint arXiv:1412.6550. [21] HUANG Z, WANG N. Like what you like: Knowledge distill via neuron selectivity transfer[A]. 2017. arXiv preprint arXiv:1707.01219. [22] AHN S, HU S X, DAMIANOU A, et al. Variational information distillation for knowledge trans- fer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9163-9171. [23] HEO B, LEE M, YUN S, et al. Knowledge transfer via distillation of activation boundaries formed by hidden neurons[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 33. 2019: 3779-3787. [24] YIM J, JOO D, BAE J, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017: 4133-4141. [25] LEE S, SONG B C. Graph-based knowledge distillation by multi-head attention network[A]. 2019. arXiv preprint arXiv:1907.02226. [26] LIU Y, CAO J, LI B, et al. Knowledge distillation via instance relationship graph[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 7096-7104. [27] LI T, LI J, LIU Z, et al. Few sample knowledge distillation for eﬀicient network compression [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 14639-14647. [28] ZHANG Y, XIANG T, HOSPEDALES T M, et al. Deep mutual learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 4320-4328. [29] CHEN D, MEI J P, WANG C, et al. Online knowledge distillation with diverse peers[C]// Proceedings of the AAAI Conference on Artificial Intelligence: volume 34. 2020: 3430-3437. [30] ZHANG L, SONG J, GAO A, et al. Be your own teacher: Improve the performance of con- volutional neural networks via self distillation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 3713-3722. [31] MOBAHI H, FARAJTABAR M, BARTLETT P. Self-distillation amplifies regularization in hilbert space[J]. Advances in Neural Information Processing Systems, 2020, 33: 3351-3361. [32] WANG H, ZHAO H, LI X, et al. Progressive blockwise knowledge distillation for neural network acceleration.[C]//International Joint Conferences on Artificial Intelligence. 2018: 2769-2775. [33] ZHU X, GONG S, et al. Knowledge distillation by on-the-fly native ensemble[J]. Advances in Neural Information Processing Systems, 2018, 31. [34] POLINO A, PASCANU R, ALISTARH D. Model compression via distillation and quantization [A]. 2018. arXiv preprint arXiv:1802.05668. [35] MISHRA A, MARR D. Apprentice: Using knowledge distillation techniques to improve low- precision network accuracy[A]. 2017. arXiv preprint arXiv:1711.05852. [36] WEI Y, PAN X, QIN H, et al. Quantization mimic: Towards very tiny cnn for object detection [C]//Proceedings of the European Conference on Computer Vision. 2018: 267-283. [37] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: Eﬀicient convolutional neural networks for mobile vision applications[A]. 2017. arXiv preprint arXiv:1704.04861. [38] ZHANG X, ZHOU X, LIN M, et al. Shufflenet: An extremely eﬀicient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6848-6856. [39] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4700-4708. [40] SAU B B, BALASUBRAMANIAN V N. Deep model compression: Distilling knowledge from noisy teachers[A]. 2016. arXiv preprint arXiv:1610.09650. [41] CHEN X, SU J, ZHANG J. A two-teacher framework for knowledge distillation[C]//Advances in Neural Networks. Springer, 2019: 58-66. [42] PARK S, KWAK N. Feature-level ensemble knowledge distillation for aggregating knowledge from multiple networks[M]//European Association for Artificial Intelligence. IOS Press, 2020: 1411-1418. [43] WU G, GONG S. Peer collaborative learning for online knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 35. 2021: 10302-10310. [44] ASIF U, TANG J, HARRER S. Ensemble knowledge distillation for learning improved and eﬀicient networks[A]. 2019. arXiv preprint arXiv:1909.08097. [45] FUKUDA T, SUZUKI M, KURATA G, et al. Eﬀicient Knowledge Distillation from an Ensem- ble of Teachers.[C]//Interspeech. 2017: 3697-3701. [46] WANG C, ZHANG S, SONG S, et al. Learn from the past: Experience ensemble knowledge distillation[C]//International Conference on Pattern Recognition. IEEE, 2022: 4736-4743. [47] ZHANG H, CHEN D, WANG C. Confidence-aware multi-teacher knowledge distillation[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2022: 4498- 4502. [48] LIU Y, ZHANG W, WANG J. Adaptive multi-teacher multi-level knowledge distillation[J]. Neurocomputing, 2020, 415: 106-113. [49] WU C, WU F, QI T, et al. Unified and effective ensemble knowledge distillation[A]. 2022. arXiv preprint arXiv:2204.00548. [50] DU S, YOU S, LI X, et al. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space[J]. Advances in Neural Information Processing Systems, 2020, 33: 12345-12355. [51] KUMAR R, SOTELO J, KUMAR K, et al. Obamanet: Photo-realistic lip-sync from text[A]. 2017. arXiv preprint arXiv:1801.01442. [52] SUWAJANAKORN S, SEITZ S M, KEMELMACHER-SHLIZERMAN I. Synthesizing obama: Learning lip sync from audio[J]. ACM Transactions on Graphics, 2017, 36(4): 1-13. [53] THIES J, ELGHARIB M, TEWARI A, et al. Neural voice puppetry: Audio-driven facial reen- actment[C]//European Conference on Computer Vision. Springer, 2020: 716-731. [54] FRIED O, TEWARI A, ZOLLHÖFER M, et al. Text-based editing of talking-head video[J]. ACM Transactions on Graphics, 2019, 38(4): 1-14. [55] JAMALUDIN A, CHUNG J S, ZISSERMAN A. You said that?: Synthesising talking faces from audio[J]. International Journal of Computer Vision, 2019, 127(11): 1767-1779. [56] PRAJWAL K, MUKHOPADHYAY R, NAMBOODIRI V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 484-492. [57] ZHOU H, LIU Y, LIU Z, et al. Talking face generation by adversarially disentangled audio- visual representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence: vol- ume 33. 2019: 9299-9306. [58] ZHOU H, SUN Y, WU W, et al. Pose-controllable talking face generation by implicitly modu- larized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4176-4186. [59] CHEN L, LI Z, MADDOX R K, et al. Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision. 2018: 520-535. [60] VOUGIOUKAS K, PETRIDIS S, PANTIC M. Realistic speech-driven facial animation with GANs[J]. International Journal of Computer Vision, 2020, 128(5): 1398-1413. [61] CHEN L, MADDOX R K, DUAN Z, et al. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 7832-7841. [62] GU K, ZHOU Y, HUANG T. Flnet: Landmark driven fetching and learning network for faith- ful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 34. 2020: 10861-10868. [63] BLANZ V, VETTER T. Face recognition based on fitting a 3D morphable model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(9): 1063-1074. [64] SONG L, WU W, QIAN C, et al. Everybody’s talkin’: Let me talk as you want[J]. IEEE Transactions on Information Forensics and Security, 2022, 17: 585-598. [65] YI R, YE Z, ZHANG J, et al. Audio-driven talking face video generation with learning-based personalized head pose[A]. 2020. arXiv preprint arXiv:2002.10137. [66] JI X, ZHOU H, WANG K, et al. Audio-driven emotional video portraits[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 14080-14089. [67] WALAWALKAR D, SHEN Z, SAVVIDES M. Online ensemble model compression using knowledge distillation[C]//European Conference on Computer Vision. Springer, 2020: 18-35. [68] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks [J]. Communications of the ACM, 2020, 63(11): 139-144. [69] NIE X, FENG J, ZHANG J, et al. Single-stage multi-person pose machines[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 6951-6960. [70] LI B, LI J, TANG K, et al. Many-objective evolutionary algorithms: A survey[J]. ACM Com- puting Surveys, 2015, 48(1): 1-35. [71] CHU X, ZHANG B, XU R. Multi-objective reinforced evolution in mobile neural architecture search[C]//European Conference on Computer Vision. Springer, 2020: 99-113. [72] LU Z, DEB K, GOODMAN E, et al. NSGANetV2: Evolutionary multi-objective surrogate- assisted neural architecture search[C]//European Conference on Computer Vision. Springer, 2020: 35-51. [73] HONG W, LI G, LIU S, et al. Multi-objective evolutionary optimization for hardware-aware neural network pruning[J]. Fundamental Research, 2022, in press. [74] HOLLAND J H. Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence[M]. MIT press, 1992. [75] DEB K, AGRAWAL S, PRATAP A, et al. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II[C]//International Conference on Parallel Problem Solving from Nature. Springer, 2000: 849-858. [76] HONG M, XIE Y, LI C, et al. Distilling image dehazing with heterogeneous task imitation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 3462-3471. [77] WANG T, YUAN L, ZHANG X, et al. Distilling object detectors with fine-grained feature im- itation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. 2019: 4933-4942. [78] YOON J W, LEE H, KIM H Y, et al. TutorNet: Towards flexible knowledge distillation for end-to-end speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 1626-1638. [79] KRIZHEVSKY A, HINTON G, et al. Learning multiple layers of features from tiny images [M]. Toronto, ON, Canada, 2009. [80] ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention: Improving the per- formance of convolutional neural networks via attention transfer[A]. 2016. arXiv preprint arXiv:1612.03928. [81] TIAN Y, KRISHNAN D, ISOLA P. Contrastive representation distillation[A]. 2019. arXiv preprint arXiv:1910.10699. [82] CHEN L, CUI G, LIU C, et al. Talking-head generation with rhythmic head motion[C]// European Conference on Computer Vision. Springer, 2020: 35-51. [83] WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612. [84] HONG W, YANG P, TANG K. Evolutionary computation for large-scale multi-objective op- timization: A decade of progresses[J]. International Journal of Automation and Computing, 2021, 18(2): 155-169. [85] HONG W, TANG K, ZHOU A, et al. A scalable indicator-based evolutionary algorithm for large-scale multiobjective optimization[J]. IEEE Transactions on Evolutionary Computation, 2018, 23(3): 525-537. [86] ZHOU A, QU B Y, LI H, et al. Multiobjective evolutionary algorithms: A survey of the state of the art[J]. Swarm and Evolutionary Computation, 2011, 1(1): 32-49. [87] LIN W, JIN X, MU Y, et al. A two-stage multi-objective scheduling method for integrated community energy system[J]. Applied Energy, 2018, 216(APR.15): 428-441. [88] XIAO W, CHENG A, LI S, et al. A multi-objective optimization strategy of steam power system to achieve standard emission and optimal economic by NSGA-II[J]. Energy, 2021. [89] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//Asian Conference on Computer Vision. Springer, 2016: 87-103. [90] SON CHUNG J, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6447- 6456. [91] REN Y, WU J, XIAO X, et al. Online multi-granularity distillation for gan compression[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 6793-6803.
所在学位评定分委会	电子科学与技术
国内图书分类号	TP183
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/544060
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	郭格. 多教师知识蒸馏及其在说话人脸生成中的应用研究[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12032488-郭格-计算机科学与工程（2659KB）	--	--	限制开放	--	请求全文