中文版 | English
题名

高自然度说话人脸视频生成研究

其他题名
RESEARCH ON TALKING FACE VIDEO GENERATION WITH HIGH NATURALNESS
姓名
姓名拼音
WANG Ting
学号
12032474
学位类型
硕士
学位专业
0809 电子科学与技术
学科门类/专业学位类别
08 工学
导师
于仕琪
导师单位
计算机科学与工程系
论文答辩日期
2023-05
论文提交日期
2023-06-29
学位授予单位
南方科技大学
学位授予地点
深圳
摘要

说话人脸视频生成,即通过输入任意一段音频,同时输入一张目标对象的参
考图片或一小段视频作为驱动源,生成一段同步对齐的目标对象说话视频。这项
任务由于在虚拟数字人、动画配音等多媒体生成领域有广泛的应用前景,吸引了
相关学者的极大研究兴趣。生成对象的唇声同步精度和视觉质量,以及头部姿态
的可迁移性,对于合成自然逼真的人脸视频非常重要。在音视频同步和视觉质量
方面,目前的方法在人物唇形重建过程中容易忽略嘴唇的空间位置信息,导致生
成的嘴唇运动不够协调,人物不够自然。或者生成视频的保真度相对较高,但仅
适用于特定的目标说话对象,通用性很差。在说话人脸的姿态可控性方面,目前
的方法通常只能保持原始姿势而无法改变,就导致了生成的说话人脸呆板不自然,
无法适应不同的场景需求。
为了解决上述问题,本文提出了一种新的生成对抗学习框架,可以实现对任
意目标说话对象的人脸视频生成。为了充分利用唇部区域的视觉信息,在模型的
生成器网络中引入一种空间注意力机制。通过平均池化和最大池化操作,有效地
突出信息空间区域,让生成器能够实现更高效的空间特征选择。这使得网络在对
抗学习过程中能够更加关注唇部区域的重建,以进行细粒度的唇形校正。此外,该方法在目标函数中引入内容损失函数和总变分正则化项进行约束,减少了生成视频的嘴唇抖动和伪影现象。本文在LRW、LRS2、LRS3 等多个数据集上进行实验,结果验证了该方法在唇声同步和视觉质量方面相比以往的基于任意目标说话对象的方法提升明显。
针对头部姿态问题,本文提出一种基于姿态迁移的生成对抗网络模型,可以
在生成高保真度说话人脸视频的同时,实现自然灵活的姿态迁移。该方法通过对
姿态源视频解耦出姿态和运动特征信息,定义一个通用范式的隐式空间。然后分
别建立了从隐式空间到包含音频特征的主特征空间和包含头部姿态的姿态空间的
映射。最后,联合音频特征、姿态特征和身份特征得到联合特征,本方法由参考图片、音频和姿态源视频协同驱动生成说话人脸视频,同时保持精准的音视频同步。在LRW 和LRS2 数据集上的测试实验证明了本方法的有效性和优越性。

关键词
语种
中文
培养类别
独立培养
入学年份
2020
学位授予年份
2023-06
参考文献列表

[1] SHA T, ZHANG W, SHEN T, et al. Deep Person Generation: A Survey from the Perspective of Face, Pose, and Cloth Synthesis[J]. ACM Computing Surveys, 2023, 55(12): 1-37.
[2] JI X, ZHOU H, WANG K, et al. Audio-driven emotional video portraits[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 14080-14089.
[3] SUWAJANAKORN S, SEITZ S M, KEMELMACHER-SHLIZERMAN I. Synthesizing obama: learning lip sync from audio[J]. ACM Transactions on Graphics (ToG), 2017, 36(4): 1-13.
[4] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[5] KARRAS T, AILA T, LAINE S, et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.
[6] GUO Y, CHEN K, LIANG S, et al. Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5784-5794.
[7] LIU X, XU Y, WU Q, et al. Semantic-aware implicit neural audio-driven video portrait generation[C]//Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022: 106-125.
[8] YAO S, ZHONG R, YAN Y, et al. DFA-NERF: personalized talking head generation via disentangled face attributes neural rendering[M]. arXiv preprint arXiv:2201.00791, 2022.
[9] MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. Nerf: Representing scenes as neuralradiance fields for view synthesis[C]//European conference on computer vision. Springer, 2020: 405-421.
[10] KR P, MUKHOPADHYAY R, PHILIP J, et al. Towards automatic face-to-face translation[C]//Proceedings of the 27th ACM International Conference on Multimedia. 2019: 1428-1436.
[11] ZHOU Y, HAN X, SHECHTMAN E, et al. Makelttalk: speaker-aware talking-head animation[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 1-15.
[12] ZHOU H, SUN Y, WU W, et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4176-4186.
[13] CHEN L, LI Z, MADDOX R K, et al. Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 520-535.
[14] JAMALUDIN A, CHUNG J S, ZISSERMAN A. You said that?: Synthesising talking faces from audio[J]. International Journal of Computer Vision, 2019, 127(11): 1767-1779.
[15] ESKIMEZ S E, MADDOX R K, XU C, et al. End-to-end generation of talking faces from noisy speech[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 1948-1952.
[16] CHEN L, MADDOX R K, DUAN Z, et al. Hierarchical cross-modal talking face generation withdynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 7832-7841.
[17] YE Z, XIA M, YI R, et al. Audio-driven talking face video generation with dynamic convolution kernels[J]. IEEE Transactions on Multimedia, 2022.
[18] CHEN Y, DAI X, LIU M, et al. Dynamic convolution: Attention over convolution kernels[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020:11030-11039.
[19] QIU Z, ZHUANG Y, YAN F, et al. RGB-DI images and full convolution neural network-based outdoor scene understanding for mobile robots[J]. IEEE Transactions on Instrumentation and Measurement, 2018, 68(1): 27-37.
[20] HONG F T, ZHANG L, SHEN L, et al. Depth-aware generative adversarial network for talking head video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3397-3406.
[21] LIANG B, PAN Y, GUO Z, et al. Expressive talking head generation with granular audiovisual control[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3387-3396.
[22] WANG D, DENG Y, YIN Z, et al. Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis[M]. arXiv preprint arXiv:2211.14506, 2022.
[23] JANG Y, RHO K, WOO J B, et al. That’s What I Said: Fully-Controllable Talking Face Generation[M]. arXiv preprint arXiv:2304.03275, 2023.
[24] ZHU H, HUANG H, LI Y, et al. Arbitrary talking face generation via attentional audio-visual coherence learning[M]. arXiv preprint arXiv:1812.06589, 2018.
[25] ZHANG L, CHEN Q, LIU Z. Talking Head Generation for Media Interaction System with Feature Disentanglement[C]//2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 2023: 403-410.
[26] YU L, YU J, LING Q. Mining audio, text and visual information for talking face generation [C]//2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019: 787-795.
[27] PRAJWAL K, MUKHOPADHYAY R, NAMBOODIRI V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 484-492.
[28] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The bulletin of mathematical biophysics, 1943, 5: 115-133.
[29] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. science, 2006, 313(5786): 504-507.
[30] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[31] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[J].Advances in neural information processing systems, 2014, 27.
[32] TIAN Y, WANG Q, HUANG Z, et al. Off-policy reinforcement learning for efficient and effective gan architecture search[C]//Proceedings of the European Conference on Computer Vision(ECCV). Springer, 2020: 175-192.
[33] EIFFERT S, LI K, SHAN M, et al. Probabilistic crowd GAN: Multimodal pedestrian trajectory prediction using a graph vehicle-pedestrian attention network[J]. IEEE Robotics and Automation Letters, 2020, 5(4): 5026-5033.
[34] YU L, ZHANG W, WANG J, et al. Seqgan: Sequence generative adversarial nets with policy gradient[C]//Proceedings of the AAAI conference on artificial intelligence: volume 31. 2017.
[35] RADFORD A, METZ L, CHINTALA S. Unsupervised representation learning with deep convolutional generative adversarial networks[M]. arXiv preprint arXiv:1511.06434, 2015.
[36] KARRAS T, AILA T, LAINE S, et al. Progressive growing of gans for improved quality, stability, and variation[M]. arXiv preprint arXiv:1710.10196, 2017.
[37] KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4401-4410.
[38] ZHU J, SHEN Y, ZHAO D, et al. In-domain gan inversion for real image editing[C]//Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020: 592-608.
[39] CHENG Y C, LIN C H, LEE H Y, et al. InOut: diverse image outpainting via GAN inversion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 11431-11440.
[40] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning internal representations by error propagation[R]. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[41] RENSINK R A. The dynamic representation of scenes[J]. Visual cognition, 2000, 7(1-3):17-42.
[42] CORBETTA M, SHULMAN G L. Control of goal-directed and stimulus-driven attention in the brain[J]. Nature reviews neuroscience, 2002, 3(3): 201-215.
[43] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[M]. arXiv preprint arXiv:1409.0473, 2014.
[44] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2015: 2048-2057.
[45] WOO S, PARK J, LEE J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.
[46] BLANZ V, SCHERBAUM K, VETTER T, et al. Exchanging faces in images[C]//Computer Graphics Forum: volume 23. Wiley Online Library, 2004: 669-676.
[47] NIRKIN Y, MASI I, TUAN A T, et al. On face segmentation, face swapping, and face perception[C]//2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG2018). IEEE, 2018: 98-105.
[48] NIRKIN Y, KELLER Y, HASSNER T. Fsgan: Subject agnostic face swapping and reenactment[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 7184-7193.
[49] BAO J, CHEN D, WEN F, et al. Towards open-set identity preserving face synthesis[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6713-6722.
[50] LI L, BAO J, YANG H, et al. Faceshifter: Towards high fidelity and occlusion aware face swapping[M]. arXiv preprint arXiv:1912.13457, 2019.
[51] CHUNG J S, NAGRANI A, ZISSERMAN A. Voxceleb2: Deep speaker recognition[M]. arXiv preprint arXiv:1806.05622, 2018.
[52] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//Asian conference on computer vision. Springer, 2016: 87-103.
[53] AFOURAS T, CHUNG J S, SENIOR A, et al. Deep audio-visual speech recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2018.
[54] AFOURAS T, CHUNG J S, ZISSERMAN A. LRS3-TED: a large-scale dataset for visual speech recognition[M]. arXiv preprint arXiv:1809.00496, 2018.
[55] PARKHI O M, VEDALDI A, ZISSERMAN A. Deep face recognition[M]. British Machine Vision Association, 2015.
[56] DAVIS S, MERMELSTEIN P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J]. IEEE transactions on acoustics, speech, and signal processing, 1980, 28(4): 357-366.
[57] CHUNG J S, ZISSERMAN A. Out of time: automated lip sync in the wild[C]//Asian conference on computer vision. Springer, 2016: 251-263.
[58] JOHNSON J, ALAHI A, FEI-FEI L. Perceptual losses for real-time style transfer and superresolution[C]//European conference on computer vision. Springer, 2016: 694-711.
[59] YANG T, REN P, XIE X, et al. GAN prior embedded network for blind face restoration in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021: 672-681.
[60] HORE A, ZIOU D. Image quality metrics: PSNR vs. SSIM[C]//2010 20th international conference on pattern recognition. IEEE, 2010: 2366-2369.
[61] NEWMARCH J, NEWMARCH J. Ffmpeg/libav[J]. Linux Sound Programming, 2017: 227-234.
[62] ZHANG S, ZHU X, LEI Z, et al. S3fd: Single shot scale-invariant face detector[C]//Proceedings of the IEEE international conference on computer vision. 2017: 192-201.
[63] KINGMA D P, BA J. Adam: A method for stochastic optimization[M]. arXiv preprint arXiv:1412.6980, 2014.
[64] CHEN L, CUI G, LIU C, et al. Talking-head generation with rhythmic head motion[C]//European Conference on Computer Vision. Springer, 2020: 35-51.
[65] CHENG K, CUN X, ZHANG Y, et al. VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild[C]//SIGGRAPH Asia 2022 Conference Papers. 2022:1-9.
[66] SONG Y, ZHU J, LI D, et al. Talking face generation by conditional recurrent adversarial network[M]. arXiv preprint arXiv:1804.04786, 2018.49
[67] VOUGIOUKAS K, PETRIDIS S, PANTIC M. Realistic speech-driven facial animation with gans[J]. International Journal of Computer Vision, 2020, 128: 1398-1413.
[68] MUKHERJEE S, ASNANI H, LIN E, et al. Clustergan: Latent space clustering in generative adversarial networks[C]//Proceedings of the AAAI conference on artificial intelligence: volume 33. 2019: 4610-4617.
[69] KARRAS T, LAINE S, AITTALA M, et al. Analyzing and improving the image quality of stylegan[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 8110-8119.

所在学位评定分委会
电子科学与技术
国内图书分类号
TP391.4
来源库
人工提交
成果类型学位论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/544505
专题工学院_计算机科学与工程系
推荐引用方式
GB/T 7714
汪汀. 高自然度说话人脸视频生成研究[D]. 深圳. 南方科技大学,2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可 操作
12032474-汪汀-计算机科学与工程(11847KB)----限制开放--请求全文
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[汪汀]的文章
百度学术
百度学术中相似的文章
[汪汀]的文章
必应学术
必应学术中相似的文章
[汪汀]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。