南方科技大学知识苑(SUSTech KC): MOTION-DRIVEN CUSTOMIZATION: FINE-TUNING TEMPORAL LAYER AND CONTROLLING TEXT-TO-VIDEO DIFFUSION MODEL

题名	MOTION-DRIVEN CUSTOMIZATION: FINE-TUNING TEMPORAL LAYER AND CONTROLLING TEXT-TO-VIDEO DIFFUSION MODEL
其他题名	基于运动驱动的定制：微调时间层与控制文本到视频扩散模型
姓名	江竞舟
姓名拼音	JIANG Jingzhou
学号	12232879
学位类型	硕士
学位专业	0701 数学
学科门类/专业学位类别	07 理学
导师	荆炳义
导师单位	统计与数据科学系
论文答辩日期	2024-05-12
论文提交日期	2024-06-18
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	Although the mass pre-trained diffusion model excels in the field of video generation, there is limited exploration on specific tasks such as motion customization. This paper focuses on Vincennes video’s motion customization task and explores how to adapt an existing text-to-video diffusion model to learn and reproduce a specific motion pattern in a set of videos. It aims to generate new videos that retain the desired motion feature while having different contexts. For instance, the model can transform the train’s railroad motion into a snake moving through the jungle. While drawing inspiration from text-toimage adaptation methods, it should be noted that incorporating time dimension in videos adds complexity. Consequently, commonly used techniques like model retuning, efficient parameter adjustment, and low-order adaptive methods face challenges when reproducing video motions and creating visual changes. Particularly, applying static image methods directly to videos often results in intricate intertwining between appearance and motion data. To tackle these challenges, the Motion-Driven Video Customization(MDVC) Framework is proposed. This method achieves precise alignment of residuals between predicted and real latent variables through fine-tuning the time-attention layer of text-to-video generation models. In addition, we introduce the motion-guided mask as additional information, and the video frame mask is extracted by the optical stream algorithm and added to the model generation process. Furthermore, we have incorporated a video stability control module to enhance coherence and smoothness among video frames. We tested advanced video generation models in various real-world environments to validate our method. Visualization results demonstrated that our approach effectively learns the motion patterns of input videos while maintaining object consistency. We also introduced a new video generation evaluation metric, VBench, specifically designed to assess several aspects of motion-customized tasks, including object consistency, background consistency, dynamic degree, and motion smoothness. Quantitative results further confirmed the outstanding performance of our framework.
关键词	Diffusion Model Video Motion Customization Optical Flow Algorithm Video Consistency VBench
语种	英语
培养类别	独立培养
入学年份	2022
学位授予年份	2024-07
参考文献列表	[1] BAIN M, NAGRANI A, VAROL G, et al., 2021. Frozen in time: A joint video and imageencoder for end-to-end retrieval[C]//Proceedings of the IEEE/CVF International Conferenceon Computer Vision. 1728-1738. [2] BIAN W, HUANG Z, SHI X, et al., 2024. Context-pips: Persistent independent particles demands context features[J]. Advances in Neural Information Processing Systems, 36. [3] BLATTMANN A, ROMBACH R, LING H, et al., 2023. Align your latents: High-resolutionvideo synthesis with latent diffusion models[C]//IEEE Conference on Computer Vision andPattern Recognition (CVPR). [4] BROOKS T, HELLSTEN J, AITTALA M, et al., 2022. Generating long videos of dynamicscenes[J]. Advances in Neural Information Processing Systems, 35: 31769-31781. [5] CARON M, TOUVRON H, MISRA I, et al., 2021. Emerging properties in self-supervisedvision transformers[C]//Proceedings of the IEEE/CVF International Conference on ComputerVision. 9650-9660. [6] CHEN H, ZHANG Y, CUN X, et al., 2024. Videocrafter2: Overcoming data limitations forhigh-quality video diffusion models: 2401.09047[A]. [7] CHEN W, WU J, XIE P, et al., 2023. Control-a-video: Controllable text-to-video generationwith diffusion models: 2305.13840[A]. [8] Civitai, 2022. Civitai[EB/OL]. https://civitai.com/. [9] DING M, YANG Z, HONG W, et al., 2021. Cogview: Mastering text-to-image generation viatransformers[J]. Advances in Neural Information Processing Systems, 34: 19822-19835. [10] DING M, ZHENG W, HONG W, et al., 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers[J]. Advances in Neural Information Processing Systems,35: 16890-16902. [11] DOERSCH C, GUPTA A, MARKEEVA L, et al., 2022. Tap-vid: A benchmark for tracking anypoint in a video[J]. Advances in Neural Information Processing Systems, 35: 13610-13626. [12] DOSOVITSKIY A, FISCHER P, ILG E, et al., 2015. Flownet: Learning optical flow withconvolutional networks[C]//Proceedings of the IEEE International Conference on ComputerVision. 2758-2766. [13] EFRON B, 2011. Tweedie’s formula and selection bias[J]. Journal of the American StatisticalAssociation, 106(496): 1602-1614. [14] ESSER P, ROMBACH R, OMMER B, 2021. Taming transformers for high-resolution image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR). 12873-12883. [15] ESSER P, CHIU J, ATIGHEHCHIAN P, et al., 2023. Structure and content-guided video synthesis with diffusion models[C]//Proceedings of the IEEE/CVF International Conference onComputer Vision. 7346-7356.48REFERENCES [16] FACE H, 2022. The hugging face course, 2022[EB/OL]. https://huggingface.co/course. [17] FEICHTENHOFER C, LI Y, HE K, et al., 2022. Masked autoencoders as spatiotemporal learners[J]. Advances in Neural Information Processing Systems, 35: 35946-35958. [18] FINN C, GOODFELLOW I, LEVINE S, 2016. Unsupervised learning for physical interactionthrough video prediction[J]. Advances in Neural Information Processing Systems, 29. [19] GAFNI O, POLYAK A, ASHUAL O, et al., 2022. Make-a-scene: Scene-based text-to-imagegeneration with human priors[C]//European Conference on Computer Vision. Springer: 89-106. [20] GE S, HAYES T, YANG H, et al., 2022. Long video generation with time-agnostic vqgan andtime-sensitive transformer[C]//European Conference on Computer Vision. Springer: 102-118. [21] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al., 2014. Generative adversarialnets[C]//Proceedings of the 27th International Conference on Neural Information ProcessingSystems - Volume 2. Cambridge, MA, USA: MIT Press: 2672–2680. [22] GU Y, WANG X, WU J Z, et al., 2024. Mix-of-show: Decentralized low-rank adaptation formulti-concept customization of diffusion models[J]. Advances in Neural Information Processing Systems, 36. [23] HARLEY A W, FANG Z, FRAGKIADAKI K, 2022. Particle video revisited: Tracking throughocclusions using point trajectories[C]//European Conference on Computer Vision. Springer:59-75. [24] HARVEY W, NADERIPARIZI S, MASRANI V, et al., 2022. Flexible diffusion modeling oflong videos[J]. Advances in Neural Information Processing Systems, 35: 27953-27965. [25] HE Y, YANG T, ZHANG Y, et al., 2022. Latent video diffusion models for high-fidelity videogeneration with arbitrary lengths: 2211.13221[A]. [26] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al., 2017. Gans trained by a two time-scaleupdate rule converge to a local nash equilibrium[J]. Advances in Neural Information ProcessingSystems, 30. [27] HO J, CHAN W, SAHARIA C, et al., 2022. Imagen video: High definition video generationwith diffusion models: 2210.02303[A]. [28] HO J, SALIMANS T, GRITSENKO A, et al., 2022. Video diffusion models[J]. Advances inNeural Information Processing Systems, 35: 8633-8646. [29] HONG W, DING M, ZHENG W, et al., 2023. Cogvideo: Large-scale pretraining for textto-video generation via transformers[C]//The Eleventh International Conference on LearningRepresentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. [30] HORN B K, SCHUNCK B G, 1981. Determining optical flow[J]. Artificial intelligence, 17(1-3): 185-203. [31] HU E J, SHEN Y, WALLIS P, et al., 2022. LoRA: Low-rank adaptation of large languagemodels[C]//International Conference on Learning Representations. [32] HUANG B, ZHAO Z, ZHANG G, et al., 2023. Mgmae: Motion guided masking for videomasked autoencoding[C]//Proceedings of the IEEE/CVF International Conference on ComputerVision. 13493-13504.49REFERENCES [33] HUANG Z, SHI X, ZHANG C, et al., 2022. Flowformer: A transformer architecture for opticalflow[C]//European Conference on Computer Vision. Springer: 668-685. [34] HUANG Z, HE Y, YU J, et al., 2024. VBench: Comprehensive benchmark suite for video generative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. [35] JAEGLE A, BORGEAUD S, ALAYRAC J B, et al., 2021. Perceiver io: A general architecturefor structured inputs & outputs[C]//International Conference on Learning Representations. [36] JAEGLE A, GIMENO F, BROCK A, et al., 2021. Perceiver: General perception with iterativeattention[C]//International Conference on Machine Learning. PMLR: 4651-4664. [37] JEONG H, PARK G Y, YE J C, 2024. Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models[M]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition. [38] JIA D, WANG K, LUO S, et al., 2021. Braft: Recurrent all-pairs field transforms for opticalflow based on correlation blocks[J]. IEEE Signal Processing Letters, 28: 1575-1579. [39] KARAEV N, ROCCO I, GRAHAM B, et al., 2023. Cotracker: It is better to track together:2307.07635[A]. [40] KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al., 2023. Text2video-zero: Textto-image diffusion models are zero-shot video generators[C]//Proceedings of the IEEE/CVFInternational Conference on Computer Vision (ICCV). 15954-15964. [41] LE MOING G, PONCE J, SCHMID C, 2024. Dense optical tracking: Connecting the dots[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [42] MA Y, YANG H, LIU B, et al., 2022. Ai illustrator: Translating raw descriptions into images by prompt-based cross-modal generation[J]. Proceedings of the 30th ACM InternationalConference on Multimedia. [43] MA Y, YANG H, WANG W, et al., 2022. Unified multi-modal latent diffusion for joint subjectand text conditional image generation: 2303.09319[A]. [44] MA Y, HE Y, CUN X, et al., 2024. Follow your pose: Pose-guided text-to-video generation using pose-free videos[C]//Proceedings of the AAAI Conference on Artificial Intelligence:Vol. 38. 4117-4125. [45] MANSIMOV E, PARISOTTO E, BA L J, et al., 2016. Generating images from captions withattention[C]//4th International Conference on Learning Representations, ICLR 2016, San Juan,Puerto Rico, May 2-4, 2016, Conference Track Proceedings. [46] MITTAL G, MARWAH T, BALASUBRAMANIAN V N, 2017. Sync-draw: Automatic videogeneration using deep recurrent attentive architectures[C]//Proceedings of the 25th ACM International Conference on Multimedia. 1096-1104. [47] NICHOL A Q, DHARIWAL P, RAMESH A, et al., 2022. GLIDE: Towards photorealistic imagegeneration and editing with text-guided diffusion models[C]//CHAUDHURI K, JEGELKA S,SONG L, et al. Proceedings of Machine Learning Research: Vol. 162 Proceedings of the 39thInternational Conference on Machine Learning. PMLR: 16784-16804.50REFERENCES [48] PONT-TUSET J, PERAZZI F, CAELLES S, et al., 2017. The 2017 davis challenge on videoobject segmentation: 1704.00675[A]. [49] QIN X, ZHANG Z, HUANG C, et al., 2020. U2-net: Going deeper with nested u-structure forsalient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition: Vol. 106. 107404. [50] RADFORD A, KIM J W, HALLACY C, et al., 2021. Learning transferable visual modelsfrom natural language supervision[C]//International Conference on Machine Learning. PMLR:8748-8763. [51] RAMESH A, PAVLOV M, GOH G, et al., 2021. Zero-shot text-to-image generation[C]//International Conference on Machine Learning. PMLR: 8821-8831. [52] RAMESH A, DHARIWAL P, NICHOL A, et al., 2022. Hierarchical text-conditional imagegeneration with clip latents: 2204.06125[A]. [53] REED S, AKATA Z, YAN X, et al., 2016. Generative adversarial text to image synthesis[C]//ICML’16: Proceedings of the 33rd International Conference on International Conference onMachine Learning - Volume 48. New York, NY, USA: JMLR.org: 1060-1069. [54] ROMBACH R, BLATTMANN A, LORENZ D, et al., 2022. High-resolution image synthesiswith latent diffusion models[C]//2022 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR). 10674-10685. [55] RUIZ N, LI Y, JAMPANI V, et al., 2023. Dreambooth: Fine tuning text-to-image diffusionmodels for subject-driven generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [56] SAHARIA C, CHAN W, SAXENA S, et al., 2022. Photorealistic text-to-image diffusion modelswith deep language understanding: 2205.11487[A]. [57] SAITO M, MATSUMOTO E, SAITO S, 2017. Temporal generative adversarial nets with singular value clipping[C]//Proceedings of the IEEE international Conference on Computer Vision.2830-2839. [58] SALIMANS T, HO J, 2022. Progressive distillation for fast sampling of diffusion models[C]//International Conference on Learning Representations. [59] SALIMANS T, GOODFELLOW I, ZAREMBA W, et al., 2016. Improved techniques for training gans[J]. Advances in Neural Information Processing Systems, 29. [60] SCHUHMANN C, BEAUMONT R, VENCU R, et al., 2022. Laion-5b: An open large-scaledataset for training next generation image-text models[C]//KOYEJO S, MOHAMED S, AGARWAL A, et al. Advances in Neural Information Processing Systems: Vol. 35. Curran Associates,Inc.: 25278-25294. [61] SCHUHMANN C, BEAUMONT R, VENCU R, et al., 2022. Laion-5b: An open large-scaledataset for training next generation image-text models[J]. Advances in Neural InformationProcessing Systems, 35: 25278-25294. [62] SHI X, HUANG Z, BIAN W, et al., 2023. Videoflow: Exploiting temporal cues for multiframe optical flow estimation[C]//Proceedings of the IEEE/CVF International Conference onComputer Vision. 12469-12480.51REFERENCES [63] SHI X, HUANG Z, LI D, et al., 2023. Flowformer++: Masked cost volume autoencoding forpretraining optical flow estimation[C]//Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. 1599-1610. [64] SINGER U, POLYAK A, HAYES T, et al., 2023. Make-a-video: Text-to-video generation without text-video data[C]//The Eleventh International Conference on Learning Representations,ICLR 2023, Kigali, Rwanda, May 1-5, 2023. [65] SMITH J S, HSU Y C, ZHANG L, et al., 2023. Continual diffusion: Continual customizationof text-to-image diffusion with c-lora: 2304.06027[A]. [66] SONG J, MENG C, ERMON S, 2020. Denoising diffusion implicit models[C]//InternationalConference on Learning Representations. [67] SONG Y, ERMON S, 2019. Generative modeling by estimating gradients of the data distribution[J]. Advances in Neural Information Processing Systems, 32. [68] SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al., 2021. Score-based generative modelingthrough stochastic differential equations[C]//International Conference on Learning Representations. [69] SUN J, SHEN Z, WANG Y, et al., 2021. Loftr: Detector-free local feature matching withtransformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. 8922-8931. [70] TEED Z, DENG J, 2020. Raft: Recurrent all-pairs field transforms for optical flow[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,Proceedings, Part II 16. Springer: 402-419. [71] TONG Z, SONG Y, WANG J, et al., 2022. Videomae: Masked autoencoders are data-efficientlearners for self-supervised video pre-training[J]. Advances in Neural Information ProcessingSystems, 35: 10078-10093. [72] TU Z, WANG Y, BIRKBECK N, et al., 2021. Ugc-vqa: Benchmarking blind video qualityassessment for user generated content[J]. IEEE Transactions on Image Processing, 30: 4449-4464. [73] UNTERTHINER T, VAN STEENKISTE S, KURACH K, et al., 2019. Fvd: A new metric forvideo generation[J]. International Conference on Learning Representations. [74] VASWANI A, SHAZEER N, PARMAR N, et al., 2017. Attention is all you need[J]. Advancesin Neural Information Processing Systems, 30. [75] VILLEGAS R, BABAEIZADEH M, KINDERMANS P J, et al., 2022. Phenaki: Variable lengthvideo generation from open domain textual descriptions[C]//International Conference on Learning Representations. [76] VONDRICK C, PIRSIAVASH H, TORRALBA A, 2016. Generating videos with scene dynamics[J]. Advances in Neural Information Processing Systems, 29. [77] WAH C, BRANSON S, WELINDER P, et al., 2011. The caltech-ucsd birds-200-2011 dataset[M]. California Institute of Technology. [78] WAH C, BRANSON S, WELINDER P, et al., 2011. The caltech-ucsd birds-200-2011 dataset[M]. California Institute of Technology.52REFERENCES [79] WANG X, YUAN H, ZHANG S, et al., 2024. Videocomposer: Compositional video synthesiswith motion controllability[J]. Advances in Neural Information Processing Systems, 36. [80] WANG Y, LONG M, WANG J, et al., 2017. Predrnn: Recurrent neural networks for predictivelearning using spatiotemporal lstms[J]. Advances in Neural Information Processing Systems,30. [81] WANG Z, YUAN Z, WANG X, et al., 2023. Motionctrl: A unified and flexible motion controllerfor video generation: 2312.03641[A]. [82] WU C, LIANG J, JI L, et al., 2022. Nüwa: Visual synthesis pre-training for neural visual worldcreation[C]//European Conference on Computer Vision. Springer: 720-736. [83] WU J Z, GE Y, WANG X, et al., 2023. Tune-a-video: One-shot tuning of image diffusionmodels for text-to-video generation[C]//Proceedings of the IEEE/CVF International Conferenceon Computer Vision. 7623-7633. [84] WU R, CHEN L, YANG T, et al., 2024. Lamp: Learn a motion pattern for few-shot videogeneration[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [85] XIA W, YANG Y, XUE J H, et al., 2021. Tedigan: Text-guided diverse face image generationand manipulation[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [86] XUE H, HANG T, ZENG Y, et al., 2022. Advancing high-resolution video-language representation with large-scale video transcriptions[C]//Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition. 5036-5045. [87] YAN W, ZHANG Y, ABBEEL P, et al., 2021. Videogpt: Video generation using vq-vae andtransformers: 2104.10157[A]. [88] YANG B, GU S, ZHANG B, et al., 2023. Paint by example: Exemplar-based image editingwith diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition. 18381-18391. [89] YANG G, RAMANAN D, 2019. Volumetric correspondence networks for optical flow[J]. Advances in Neural Information Processing Systems, 32. [90] YIN S, WU C, LIANG J, et al., 2023. Dragnuwa: Fine-grained control in video generation byintegrating text, image, and trajectory: 2308.08089[A]. [91] YU J, XU Y, KOH J Y, et al., 2022. Scaling autoregressive models for content-rich text-to-imagegeneration[J]. Transactions on Machine Learning Research. [92] ZHANG D J, WU J Z, LIU J W, et al., 2023. Show-1: Marrying pixel and latent diffusionmodels for text-to-video generation: 2309.15818[A]. [93] ZHAO R, GU Y, WU J Z, et al., 2023. Motiondirector: Motion customization of text-to-videodiffusion models: 2310.08465[A]. [94] ZHOU D, WANG W, YAN H, et al., 2023. Magicvideo: Efficient video generation with latentdiffusion models[A]. arXiv: 2211.11018.
所在学位评定分委会	数学
国内图书分类号	TP18
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/765642
专题	南方科技大学理学院_统计与数据科学系
推荐引用方式 GB/T 7714	Jiang JZ. MOTION-DRIVEN CUSTOMIZATION: FINE-TUNING TEMPORAL LAYER AND CONTROLLING TEXT-TO-VIDEO DIFFUSION MODEL[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12232879-江竞舟-统计与数据科学（7782KB）	--	--	限制开放	--	请求全文