[1] BAIN M, NAGRANI A, VAROL G, et al., 2021. Frozen in time: A joint video and imageencoder for end-to-end retrieval[C]//Proceedings of the IEEE/CVF International Conferenceon Computer Vision. 1728-1738.
[2] BIAN W, HUANG Z, SHI X, et al., 2024. Context-pips: Persistent independent particles demands context features[J]. Advances in Neural Information Processing Systems, 36.
[3] BLATTMANN A, ROMBACH R, LING H, et al., 2023. Align your latents: High-resolutionvideo synthesis with latent diffusion models[C]//IEEE Conference on Computer Vision andPattern Recognition (CVPR).
[4] BROOKS T, HELLSTEN J, AITTALA M, et al., 2022. Generating long videos of dynamicscenes[J]. Advances in Neural Information Processing Systems, 35: 31769-31781.
[5] CARON M, TOUVRON H, MISRA I, et al., 2021. Emerging properties in self-supervisedvision transformers[C]//Proceedings of the IEEE/CVF International Conference on ComputerVision. 9650-9660.
[6] CHEN H, ZHANG Y, CUN X, et al., 2024. Videocrafter2: Overcoming data limitations forhigh-quality video diffusion models: 2401.09047[A].
[7] CHEN W, WU J, XIE P, et al., 2023. Control-a-video: Controllable text-to-video generationwith diffusion models: 2305.13840[A].
[8] Civitai, 2022. Civitai[EB/OL]. https://civitai.com/.
[9] DING M, YANG Z, HONG W, et al., 2021. Cogview: Mastering text-to-image generation viatransformers[J]. Advances in Neural Information Processing Systems, 34: 19822-19835.
[10] DING M, ZHENG W, HONG W, et al., 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers[J]. Advances in Neural Information Processing Systems,35: 16890-16902.
[11] DOERSCH C, GUPTA A, MARKEEVA L, et al., 2022. Tap-vid: A benchmark for tracking anypoint in a video[J]. Advances in Neural Information Processing Systems, 35: 13610-13626.
[12] DOSOVITSKIY A, FISCHER P, ILG E, et al., 2015. Flownet: Learning optical flow withconvolutional networks[C]//Proceedings of the IEEE International Conference on ComputerVision. 2758-2766.
[13] EFRON B, 2011. Tweedie’s formula and selection bias[J]. Journal of the American StatisticalAssociation, 106(496): 1602-1614.
[14] ESSER P, ROMBACH R, OMMER B, 2021. Taming transformers for high-resolution image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR). 12873-12883.
[15] ESSER P, CHIU J, ATIGHEHCHIAN P, et al., 2023. Structure and content-guided video synthesis with diffusion models[C]//Proceedings of the IEEE/CVF International Conference onComputer Vision. 7346-7356.48REFERENCES
[16] FACE H, 2022. The hugging face course, 2022[EB/OL]. https://huggingface.co/course.
[17] FEICHTENHOFER C, LI Y, HE K, et al., 2022. Masked autoencoders as spatiotemporal learners[J]. Advances in Neural Information Processing Systems, 35: 35946-35958.
[18] FINN C, GOODFELLOW I, LEVINE S, 2016. Unsupervised learning for physical interactionthrough video prediction[J]. Advances in Neural Information Processing Systems, 29.
[19] GAFNI O, POLYAK A, ASHUAL O, et al., 2022. Make-a-scene: Scene-based text-to-imagegeneration with human priors[C]//European Conference on Computer Vision. Springer: 89-106.
[20] GE S, HAYES T, YANG H, et al., 2022. Long video generation with time-agnostic vqgan andtime-sensitive transformer[C]//European Conference on Computer Vision. Springer: 102-118.
[21] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al., 2014. Generative adversarialnets[C]//Proceedings of the 27th International Conference on Neural Information ProcessingSystems - Volume 2. Cambridge, MA, USA: MIT Press: 2672–2680.
[22] GU Y, WANG X, WU J Z, et al., 2024. Mix-of-show: Decentralized low-rank adaptation formulti-concept customization of diffusion models[J]. Advances in Neural Information Processing Systems, 36.
[23] HARLEY A W, FANG Z, FRAGKIADAKI K, 2022. Particle video revisited: Tracking throughocclusions using point trajectories[C]//European Conference on Computer Vision. Springer:59-75.
[24] HARVEY W, NADERIPARIZI S, MASRANI V, et al., 2022. Flexible diffusion modeling oflong videos[J]. Advances in Neural Information Processing Systems, 35: 27953-27965.
[25] HE Y, YANG T, ZHANG Y, et al., 2022. Latent video diffusion models for high-fidelity videogeneration with arbitrary lengths: 2211.13221[A].
[26] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al., 2017. Gans trained by a two time-scaleupdate rule converge to a local nash equilibrium[J]. Advances in Neural Information ProcessingSystems, 30.
[27] HO J, CHAN W, SAHARIA C, et al., 2022. Imagen video: High definition video generationwith diffusion models: 2210.02303[A].
[28] HO J, SALIMANS T, GRITSENKO A, et al., 2022. Video diffusion models[J]. Advances inNeural Information Processing Systems, 35: 8633-8646.
[29] HONG W, DING M, ZHENG W, et al., 2023. Cogvideo: Large-scale pretraining for textto-video generation via transformers[C]//The Eleventh International Conference on LearningRepresentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
[30] HORN B K, SCHUNCK B G, 1981. Determining optical flow[J]. Artificial intelligence, 17(1-3): 185-203.
[31] HU E J, SHEN Y, WALLIS P, et al., 2022. LoRA: Low-rank adaptation of large languagemodels[C]//International Conference on Learning Representations.
[32] HUANG B, ZHAO Z, ZHANG G, et al., 2023. Mgmae: Motion guided masking for videomasked autoencoding[C]//Proceedings of the IEEE/CVF International Conference on ComputerVision. 13493-13504.49REFERENCES
[33] HUANG Z, SHI X, ZHANG C, et al., 2022. Flowformer: A transformer architecture for opticalflow[C]//European Conference on Computer Vision. Springer: 668-685.
[34] HUANG Z, HE Y, YU J, et al., 2024. VBench: Comprehensive benchmark suite for video generative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition.
[35] JAEGLE A, BORGEAUD S, ALAYRAC J B, et al., 2021. Perceiver io: A general architecturefor structured inputs & outputs[C]//International Conference on Learning Representations.
[36] JAEGLE A, GIMENO F, BROCK A, et al., 2021. Perceiver: General perception with iterativeattention[C]//International Conference on Machine Learning. PMLR: 4651-4664.
[37] JEONG H, PARK G Y, YE J C, 2024. Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models[M]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.
[38] JIA D, WANG K, LUO S, et al., 2021. Braft: Recurrent all-pairs field transforms for opticalflow based on correlation blocks[J]. IEEE Signal Processing Letters, 28: 1575-1579.
[39] KARAEV N, ROCCO I, GRAHAM B, et al., 2023. Cotracker: It is better to track together:2307.07635[A].
[40] KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al., 2023. Text2video-zero: Textto-image diffusion models are zero-shot video generators[C]//Proceedings of the IEEE/CVFInternational Conference on Computer Vision (ICCV). 15954-15964.
[41] LE MOING G, PONCE J, SCHMID C, 2024. Dense optical tracking: Connecting the dots[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[42] MA Y, YANG H, LIU B, et al., 2022. Ai illustrator: Translating raw descriptions into images by prompt-based cross-modal generation[J]. Proceedings of the 30th ACM InternationalConference on Multimedia.
[43] MA Y, YANG H, WANG W, et al., 2022. Unified multi-modal latent diffusion for joint subjectand text conditional image generation: 2303.09319[A].
[44] MA Y, HE Y, CUN X, et al., 2024. Follow your pose: Pose-guided text-to-video generation using pose-free videos[C]//Proceedings of the AAAI Conference on Artificial Intelligence:Vol. 38. 4117-4125.
[45] MANSIMOV E, PARISOTTO E, BA L J, et al., 2016. Generating images from captions withattention[C]//4th International Conference on Learning Representations, ICLR 2016, San Juan,Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
[46] MITTAL G, MARWAH T, BALASUBRAMANIAN V N, 2017. Sync-draw: Automatic videogeneration using deep recurrent attentive architectures[C]//Proceedings of the 25th ACM International Conference on Multimedia. 1096-1104.
[47] NICHOL A Q, DHARIWAL P, RAMESH A, et al., 2022. GLIDE: Towards photorealistic imagegeneration and editing with text-guided diffusion models[C]//CHAUDHURI K, JEGELKA S,SONG L, et al. Proceedings of Machine Learning Research: Vol. 162 Proceedings of the 39thInternational Conference on Machine Learning. PMLR: 16784-16804.50REFERENCES
[48] PONT-TUSET J, PERAZZI F, CAELLES S, et al., 2017. The 2017 davis challenge on videoobject segmentation: 1704.00675[A].
[49] QIN X, ZHANG Z, HUANG C, et al., 2020. U2-net: Going deeper with nested u-structure forsalient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition: Vol. 106. 107404.
[50] RADFORD A, KIM J W, HALLACY C, et al., 2021. Learning transferable visual modelsfrom natural language supervision[C]//International Conference on Machine Learning. PMLR:8748-8763.
[51] RAMESH A, PAVLOV M, GOH G, et al., 2021. Zero-shot text-to-image generation[C]//International Conference on Machine Learning. PMLR: 8821-8831.
[52] RAMESH A, DHARIWAL P, NICHOL A, et al., 2022. Hierarchical text-conditional imagegeneration with clip latents: 2204.06125[A].
[53] REED S, AKATA Z, YAN X, et al., 2016. Generative adversarial text to image synthesis[C]//ICML’16: Proceedings of the 33rd International Conference on International Conference onMachine Learning - Volume 48. New York, NY, USA: JMLR.org: 1060-1069.
[54] ROMBACH R, BLATTMANN A, LORENZ D, et al., 2022. High-resolution image synthesiswith latent diffusion models[C]//2022 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR). 10674-10685.
[55] RUIZ N, LI Y, JAMPANI V, et al., 2023. Dreambooth: Fine tuning text-to-image diffusionmodels for subject-driven generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[56] SAHARIA C, CHAN W, SAXENA S, et al., 2022. Photorealistic text-to-image diffusion modelswith deep language understanding: 2205.11487[A].
[57] SAITO M, MATSUMOTO E, SAITO S, 2017. Temporal generative adversarial nets with singular value clipping[C]//Proceedings of the IEEE international Conference on Computer Vision.2830-2839.
[58] SALIMANS T, HO J, 2022. Progressive distillation for fast sampling of diffusion models[C]//International Conference on Learning Representations.
[59] SALIMANS T, GOODFELLOW I, ZAREMBA W, et al., 2016. Improved techniques for training gans[J]. Advances in Neural Information Processing Systems, 29.
[60] SCHUHMANN C, BEAUMONT R, VENCU R, et al., 2022. Laion-5b: An open large-scaledataset for training next generation image-text models[C]//KOYEJO S, MOHAMED S, AGARWAL A, et al. Advances in Neural Information Processing Systems: Vol. 35. Curran Associates,Inc.: 25278-25294.
[61] SCHUHMANN C, BEAUMONT R, VENCU R, et al., 2022. Laion-5b: An open large-scaledataset for training next generation image-text models[J]. Advances in Neural InformationProcessing Systems, 35: 25278-25294.
[62] SHI X, HUANG Z, BIAN W, et al., 2023. Videoflow: Exploiting temporal cues for multiframe optical flow estimation[C]//Proceedings of the IEEE/CVF International Conference onComputer Vision. 12469-12480.51REFERENCES
[63] SHI X, HUANG Z, LI D, et al., 2023. Flowformer++: Masked cost volume autoencoding forpretraining optical flow estimation[C]//Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. 1599-1610.
[64] SINGER U, POLYAK A, HAYES T, et al., 2023. Make-a-video: Text-to-video generation without text-video data[C]//The Eleventh International Conference on Learning Representations,ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
[65] SMITH J S, HSU Y C, ZHANG L, et al., 2023. Continual diffusion: Continual customizationof text-to-image diffusion with c-lora: 2304.06027[A].
[66] SONG J, MENG C, ERMON S, 2020. Denoising diffusion implicit models[C]//InternationalConference on Learning Representations.
[67] SONG Y, ERMON S, 2019. Generative modeling by estimating gradients of the data distribution[J]. Advances in Neural Information Processing Systems, 32.
[68] SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al., 2021. Score-based generative modelingthrough stochastic differential equations[C]//International Conference on Learning Representations.
[69] SUN J, SHEN Z, WANG Y, et al., 2021. Loftr: Detector-free local feature matching withtransformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. 8922-8931.
[70] TEED Z, DENG J, 2020. Raft: Recurrent all-pairs field transforms for optical flow[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,Proceedings, Part II 16. Springer: 402-419.
[71] TONG Z, SONG Y, WANG J, et al., 2022. Videomae: Masked autoencoders are data-efficientlearners for self-supervised video pre-training[J]. Advances in Neural Information ProcessingSystems, 35: 10078-10093.
[72] TU Z, WANG Y, BIRKBECK N, et al., 2021. Ugc-vqa: Benchmarking blind video qualityassessment for user generated content[J]. IEEE Transactions on Image Processing, 30: 4449-4464.
[73] UNTERTHINER T, VAN STEENKISTE S, KURACH K, et al., 2019. Fvd: A new metric forvideo generation[J]. International Conference on Learning Representations.
[74] VASWANI A, SHAZEER N, PARMAR N, et al., 2017. Attention is all you need[J]. Advancesin Neural Information Processing Systems, 30.
[75] VILLEGAS R, BABAEIZADEH M, KINDERMANS P J, et al., 2022. Phenaki: Variable lengthvideo generation from open domain textual descriptions[C]//International Conference on Learning Representations.
[76] VONDRICK C, PIRSIAVASH H, TORRALBA A, 2016. Generating videos with scene dynamics[J]. Advances in Neural Information Processing Systems, 29.
[77] WAH C, BRANSON S, WELINDER P, et al., 2011. The caltech-ucsd birds-200-2011 dataset[M]. California Institute of Technology.
[78] WAH C, BRANSON S, WELINDER P, et al., 2011. The caltech-ucsd birds-200-2011 dataset[M]. California Institute of Technology.52REFERENCES
[79] WANG X, YUAN H, ZHANG S, et al., 2024. Videocomposer: Compositional video synthesiswith motion controllability[J]. Advances in Neural Information Processing Systems, 36.
[80] WANG Y, LONG M, WANG J, et al., 2017. Predrnn: Recurrent neural networks for predictivelearning using spatiotemporal lstms[J]. Advances in Neural Information Processing Systems,30.
[81] WANG Z, YUAN Z, WANG X, et al., 2023. Motionctrl: A unified and flexible motion controllerfor video generation: 2312.03641[A].
[82] WU C, LIANG J, JI L, et al., 2022. Nüwa: Visual synthesis pre-training for neural visual worldcreation[C]//European Conference on Computer Vision. Springer: 720-736.
[83] WU J Z, GE Y, WANG X, et al., 2023. Tune-a-video: One-shot tuning of image diffusionmodels for text-to-video generation[C]//Proceedings of the IEEE/CVF International Conferenceon Computer Vision. 7623-7633.
[84] WU R, CHEN L, YANG T, et al., 2024. Lamp: Learn a motion pattern for few-shot videogeneration[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[85] XIA W, YANG Y, XUE J H, et al., 2021. Tedigan: Text-guided diverse face image generationand manipulation[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[86] XUE H, HANG T, ZENG Y, et al., 2022. Advancing high-resolution video-language representation with large-scale video transcriptions[C]//Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition. 5036-5045.
[87] YAN W, ZHANG Y, ABBEEL P, et al., 2021. Videogpt: Video generation using vq-vae andtransformers: 2104.10157[A].
[88] YANG B, GU S, ZHANG B, et al., 2023. Paint by example: Exemplar-based image editingwith diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition. 18381-18391.
[89] YANG G, RAMANAN D, 2019. Volumetric correspondence networks for optical flow[J]. Advances in Neural Information Processing Systems, 32.
[90] YIN S, WU C, LIANG J, et al., 2023. Dragnuwa: Fine-grained control in video generation byintegrating text, image, and trajectory: 2308.08089[A].
[91] YU J, XU Y, KOH J Y, et al., 2022. Scaling autoregressive models for content-rich text-to-imagegeneration[J]. Transactions on Machine Learning Research.
[92] ZHANG D J, WU J Z, LIU J W, et al., 2023. Show-1: Marrying pixel and latent diffusionmodels for text-to-video generation: 2309.15818[A].
[93] ZHAO R, GU Y, WU J Z, et al., 2023. Motiondirector: Motion customization of text-to-videodiffusion models: 2310.08465[A].
[94] ZHOU D, WANG W, YAN H, et al., 2023. Magicvideo: Efficient video generation with latentdiffusion models[A]. arXiv: 2211.11018.
修改评论