[1] SHA T, ZHANG W, SHEN T, et al. Deep Person Generation: A Survey from the Perspective of Face, Pose, and Cloth Synthesis[J]. ACM Computing Surveys, 2023, 55(12): 1-37.
[2] JI X, ZHOU H, WANG K, et al. Audio-driven emotional video portraits[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 14080-14089.
[3] SUWAJANAKORN S, SEITZ S M, KEMELMACHER-SHLIZERMAN I. Synthesizing obama: learning lip sync from audio[J]. ACM Transactions on Graphics (ToG), 2017, 36(4): 1-13.
[4] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[5] KARRAS T, AILA T, LAINE S, et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.
[6] GUO Y, CHEN K, LIANG S, et al. Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5784-5794.
[7] LIU X, XU Y, WU Q, et al. Semantic-aware implicit neural audio-driven video portrait generation[C]//Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022: 106-125.
[8] YAO S, ZHONG R, YAN Y, et al. DFA-NERF: personalized talking head generation via disentangled face attributes neural rendering[M]. arXiv preprint arXiv:2201.00791, 2022.
[9] MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. Nerf: Representing scenes as neuralradiance fields for view synthesis[C]//European conference on computer vision. Springer, 2020: 405-421.
[10] KR P, MUKHOPADHYAY R, PHILIP J, et al. Towards automatic face-to-face translation[C]//Proceedings of the 27th ACM International Conference on Multimedia. 2019: 1428-1436.
[11] ZHOU Y, HAN X, SHECHTMAN E, et al. Makelttalk: speaker-aware talking-head animation[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 1-15.
[12] ZHOU H, SUN Y, WU W, et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4176-4186.
[13] CHEN L, LI Z, MADDOX R K, et al. Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 520-535.
[14] JAMALUDIN A, CHUNG J S, ZISSERMAN A. You said that?: Synthesising talking faces from audio[J]. International Journal of Computer Vision, 2019, 127(11): 1767-1779.
[15] ESKIMEZ S E, MADDOX R K, XU C, et al. End-to-end generation of talking faces from noisy speech[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 1948-1952.
[16] CHEN L, MADDOX R K, DUAN Z, et al. Hierarchical cross-modal talking face generation withdynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 7832-7841.
[17] YE Z, XIA M, YI R, et al. Audio-driven talking face video generation with dynamic convolution kernels[J]. IEEE Transactions on Multimedia, 2022.
[18] CHEN Y, DAI X, LIU M, et al. Dynamic convolution: Attention over convolution kernels[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020:11030-11039.
[19] QIU Z, ZHUANG Y, YAN F, et al. RGB-DI images and full convolution neural network-based outdoor scene understanding for mobile robots[J]. IEEE Transactions on Instrumentation and Measurement, 2018, 68(1): 27-37.
[20] HONG F T, ZHANG L, SHEN L, et al. Depth-aware generative adversarial network for talking head video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3397-3406.
[21] LIANG B, PAN Y, GUO Z, et al. Expressive talking head generation with granular audiovisual control[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3387-3396.
[22] WANG D, DENG Y, YIN Z, et al. Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis[M]. arXiv preprint arXiv:2211.14506, 2022.
[23] JANG Y, RHO K, WOO J B, et al. That’s What I Said: Fully-Controllable Talking Face Generation[M]. arXiv preprint arXiv:2304.03275, 2023.
[24] ZHU H, HUANG H, LI Y, et al. Arbitrary talking face generation via attentional audio-visual coherence learning[M]. arXiv preprint arXiv:1812.06589, 2018.
[25] ZHANG L, CHEN Q, LIU Z. Talking Head Generation for Media Interaction System with Feature Disentanglement[C]//2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 2023: 403-410.
[26] YU L, YU J, LING Q. Mining audio, text and visual information for talking face generation [C]//2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019: 787-795.
[27] PRAJWAL K, MUKHOPADHYAY R, NAMBOODIRI V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 484-492.
[28] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The bulletin of mathematical biophysics, 1943, 5: 115-133.
[29] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. science, 2006, 313(5786): 504-507.
[30] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[31] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[J].Advances in neural information processing systems, 2014, 27.
[32] TIAN Y, WANG Q, HUANG Z, et al. Off-policy reinforcement learning for efficient and effective gan architecture search[C]//Proceedings of the European Conference on Computer Vision(ECCV). Springer, 2020: 175-192.
[33] EIFFERT S, LI K, SHAN M, et al. Probabilistic crowd GAN: Multimodal pedestrian trajectory prediction using a graph vehicle-pedestrian attention network[J]. IEEE Robotics and Automation Letters, 2020, 5(4): 5026-5033.
[34] YU L, ZHANG W, WANG J, et al. Seqgan: Sequence generative adversarial nets with policy gradient[C]//Proceedings of the AAAI conference on artificial intelligence: volume 31. 2017.
[35] RADFORD A, METZ L, CHINTALA S. Unsupervised representation learning with deep convolutional generative adversarial networks[M]. arXiv preprint arXiv:1511.06434, 2015.
[36] KARRAS T, AILA T, LAINE S, et al. Progressive growing of gans for improved quality, stability, and variation[M]. arXiv preprint arXiv:1710.10196, 2017.
[37] KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4401-4410.
[38] ZHU J, SHEN Y, ZHAO D, et al. In-domain gan inversion for real image editing[C]//Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020: 592-608.
[39] CHENG Y C, LIN C H, LEE H Y, et al. InOut: diverse image outpainting via GAN inversion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 11431-11440.
[40] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning internal representations by error propagation[R]. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[41] RENSINK R A. The dynamic representation of scenes[J]. Visual cognition, 2000, 7(1-3):17-42.
[42] CORBETTA M, SHULMAN G L. Control of goal-directed and stimulus-driven attention in the brain[J]. Nature reviews neuroscience, 2002, 3(3): 201-215.
[43] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[M]. arXiv preprint arXiv:1409.0473, 2014.
[44] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2015: 2048-2057.
[45] WOO S, PARK J, LEE J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.
[46] BLANZ V, SCHERBAUM K, VETTER T, et al. Exchanging faces in images[C]//Computer Graphics Forum: volume 23. Wiley Online Library, 2004: 669-676.
[47] NIRKIN Y, MASI I, TUAN A T, et al. On face segmentation, face swapping, and face perception[C]//2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG2018). IEEE, 2018: 98-105.
[48] NIRKIN Y, KELLER Y, HASSNER T. Fsgan: Subject agnostic face swapping and reenactment[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 7184-7193.
[49] BAO J, CHEN D, WEN F, et al. Towards open-set identity preserving face synthesis[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6713-6722.
[50] LI L, BAO J, YANG H, et al. Faceshifter: Towards high fidelity and occlusion aware face swapping[M]. arXiv preprint arXiv:1912.13457, 2019.
[51] CHUNG J S, NAGRANI A, ZISSERMAN A. Voxceleb2: Deep speaker recognition[M]. arXiv preprint arXiv:1806.05622, 2018.
[52] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//Asian conference on computer vision. Springer, 2016: 87-103.
[53] AFOURAS T, CHUNG J S, SENIOR A, et al. Deep audio-visual speech recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2018.
[54] AFOURAS T, CHUNG J S, ZISSERMAN A. LRS3-TED: a large-scale dataset for visual speech recognition[M]. arXiv preprint arXiv:1809.00496, 2018.
[55] PARKHI O M, VEDALDI A, ZISSERMAN A. Deep face recognition[M]. British Machine Vision Association, 2015.
[56] DAVIS S, MERMELSTEIN P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J]. IEEE transactions on acoustics, speech, and signal processing, 1980, 28(4): 357-366.
[57] CHUNG J S, ZISSERMAN A. Out of time: automated lip sync in the wild[C]//Asian conference on computer vision. Springer, 2016: 251-263.
[58] JOHNSON J, ALAHI A, FEI-FEI L. Perceptual losses for real-time style transfer and superresolution[C]//European conference on computer vision. Springer, 2016: 694-711.
[59] YANG T, REN P, XIE X, et al. GAN prior embedded network for blind face restoration in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021: 672-681.
[60] HORE A, ZIOU D. Image quality metrics: PSNR vs. SSIM[C]//2010 20th international conference on pattern recognition. IEEE, 2010: 2366-2369.
[61] NEWMARCH J, NEWMARCH J. Ffmpeg/libav[J]. Linux Sound Programming, 2017: 227-234.
[62] ZHANG S, ZHU X, LEI Z, et al. S3fd: Single shot scale-invariant face detector[C]//Proceedings of the IEEE international conference on computer vision. 2017: 192-201.
[63] KINGMA D P, BA J. Adam: A method for stochastic optimization[M]. arXiv preprint arXiv:1412.6980, 2014.
[64] CHEN L, CUI G, LIU C, et al. Talking-head generation with rhythmic head motion[C]//European Conference on Computer Vision. Springer, 2020: 35-51.
[65] CHENG K, CUN X, ZHANG Y, et al. VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild[C]//SIGGRAPH Asia 2022 Conference Papers. 2022:1-9.
[66] SONG Y, ZHU J, LI D, et al. Talking face generation by conditional recurrent adversarial network[M]. arXiv preprint arXiv:1804.04786, 2018.49
[67] VOUGIOUKAS K, PETRIDIS S, PANTIC M. Realistic speech-driven facial animation with gans[J]. International Journal of Computer Vision, 2020, 128: 1398-1413.
[68] MUKHERJEE S, ASNANI H, LIN E, et al. Clustergan: Latent space clustering in generative adversarial networks[C]//Proceedings of the AAAI conference on artificial intelligence: volume 33. 2019: 4610-4617.
[69] KARRAS T, LAINE S, AITTALA M, et al. Analyzing and improving the image quality of stylegan[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 8110-8119.
修改评论