南方科技大学知识苑(SUSTech KC): 基于隐变量生成模型的多样化视频描述算法研究

题名	基于隐变量生成模型的多样化视频描述算法研究
其他题名	TOWARDS HUMAN-LIKE DIVERSE VIDEO CAPTIONING VIA A LATENT GENERATIVE MODEL
姓名	刘柱
姓名拼音	LIU Zhu
学号	11930377
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	郑锋
导师单位	计算机科学与工程系
论文答辩日期	2022-05-08
论文提交日期	2022-06-16
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	视频描述任务旨在生成描述视频内容的文本语句，在短视频描述、新闻摘要、人机辅助、智能助理等重要领域都有着广泛的应用前景。由于视频场景充满复杂的交互和不同层级的细节，因此该任务通常输出多句可表达不同视觉概念的描述。然而大多数的视频描述模型致力于生成单句准确的描述，尽管这些方法在当前评价指标上都表现出比人类更高的水平，但是忽视了视频描述的多样性需求。此外，现有的评价指标也无法全面反映多句描述整体的性能。本文开展了面向多样化视频描述任务的算法研究，并基于条件变分自编码器模型，提出一系列训练方式和模型架构。本方法首先构建了一个动作和上下文分离的结构化隐空间，用于捕捉视频场景中物体与物体、物体与环境之间复杂的交互关系。具体而言，模型先通过去除了动词的上下文学习到一个代表模板信息的上下文隐变量，之后以该隐变量为条件，模型进一步经由动词学习到代表交互信息的隐变量，从而构造出复杂的结构化隐变量空间来增加模型拟合能力。除此之外，对比学习方式可以进一步提高句子间的差异性，并且有效缓解变分框架常见的后验坍塌问题。在前一阶段的模型基础上，本方法进一步设计并实现了双阶段渐进训练方式，具体的训练过程包括：第一阶段，模型在区分度较大的语句集合中进行训练，用于捕获一个稀疏的话题相关的空间。第二阶段立足于前一阶段，将模型用于整个数据集上进行训练，旨在增加语言表达的丰富性。本文通过大量实验从定性和定量两个方面论证了两类方法的有效性：它们可以在不损害准确性的前提下，有效提高生成的描述的多样性。为了衡量生成字幕集合的整体表现性能，本方法提出两个新的指标，来同时考虑描述的准确性和多样性。实验结果表明，相较于原有的评价指标，本文提出的指标都具有与人类评估更高的相关性，这对于模型评估和模型选择都有重要意义。
其他摘要	Video captioning aims at generating natural language sentences to describe a short video, which has a wide range of applications, e.g., short-video description, news summarization, human-computer interaction, and intelligent agents. A set consisting of several sentences with different levels of visual concepts and details is favoured due to complicated video scenes. Most current video captioning models only articulate one accurate caption, which even outperform humans in terms of de-facto precision-based metrics, but ignore the innate demand for diverse descriptions mentioned above. Furthermore, we argue that the metrics fail to reflect the overall performance for a caption set. The thesis deals with the study of diverse video captioning and proposes a series of training strategies and model architectures based on variational auto-encoders (VAE). First, we construct a structured latent space with a split of action and context to capture the complicated interactions. In specific, the model first learns latent contextual variables from the separated context and then, conditioned on that variable, it further learns the verbal variables related to the interaction in the scene. Besides, contrastive learning can further improve the diversity among sentences by alleviating the common issue in VAE, i.e., posterior collapse. We design a two-stage progressive training mechanism based on the first model in our second section. Specifically, we leverage a distinctive sentence subset to learn to capture a sparse topic-related space in the first stage. In contrast, in the next stage, we access the whole dataset to increase the expressiveness of utterances. We provide an in-depth quantitative and qualitative analysis of our proposed models and conclude that they could improve the diversity of captions generated by a large margin with little to no sacrifice of accuracy. Moreover, we offer two new metrics to consider both accuracy and diversity. We prove that our metrics have stronger correlations with human evaluation, which provides guidance in evaluation and model selection.
关键词	多样化视频描述变分自编码器评价指标结构化隐空间
其他关键词	Diversity Video Captioning Variational Autoencoders Structured Latent Space Metrics
语种	中文
培养类别	独立培养
入学年份	2019
学位授予年份	2022-06
参考文献列表	[1] BUSCHMAN T J, MILLER E K. Top-down versus bottom-up control of attention in the prefrontaland posterior parietal cortices[J]. science, 2007, 315(5820): 1860-1862. [2] KOLLER D, HEINZE N, NAGEL H H. Algorithmic characterization of vehicle trajectoriesfrom image sequences by motion verbs[C]//IEEE Conference on Computer Vision and PatternRecognition. IEEE, 1991: 90-91. [3] BRAND M. The “inverse hollywood proble”: From video to scripts and storyboards via causalanalysis[C]//Fourteenth AAAI Conference on Artificial Intelligence and Ninth Conference onInnovative Applications of Artificial Intelligence. AAAI Press, 1997: 132-137. [4] VIOLA P, JONES M. Rapid object detection using a boosted cascade of simple features[C]//Computer Vision and Pattern Recognition. IEEE, 2001: I-I. [5] TORRALBA A, MURPHY K P, FREEMAN W T, et al. Context-based vision system for placeand object recognition[C]//Proceedings Ninth IEEE International Conference on Computer Vision:volume 2. IEEE, 2003: 273-273. [6] LOWE D G. Object recognition from local scale-invariant features[C]//IEEE International Conferenceon Computer Vision: volume 2. IEEE, 1999: 1150-1157. [7] FELZENSZWALB P F, GIRSHICK R B, MCALLESTER D, et al. Object detection with discriminativelytrained part-based models[J]. IEEE Transactions on Pattern Analysis & MachineIntelligence, 2009, 32(9): 1627-1645. [8] FELZENSZWALB P, MCALLESTER D, RAMANAN D. A discriminatively trained, multiscale,deformable part model[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2008: 1-8. [9] FELZENSZWALB P F, GIRSHICK R B, MCALLESTER D. Cascade object detection withdeformable part models[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2010: 2241-2248. [10] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//IEEE Conferenceon Computer Vision and Pattern Recognition. IEEE, 2005: 886-893. [11] CHAUDHRY R, RAVICHANDRAN A, HAGER G, et al. Histograms of oriented optical flowand binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009: 1932-1939. [12] HONGENG S, BRÉMOND F, NEVATIA R. Bayesian framework for video surveillance application[C]//International Conference on Pattern Recognition. IEEE, 2000: 164-170. [13] GONG S, XIANG T. Recognition of group activities using dynamic probabilistic networks[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2003: 742-749. [14] BOBICK A F, WILSON A D. A state-based approach to the representation and recognitionof gesture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(12):1325-1337. [15] ZHU S C, MUMFORD D. A stochastic grammar of images[M]. Now Publishers Inc, 2007. [16] MOORE D, ESSA I. Recognizing multitasked activities from video using stochastic contextfreegrammar[C]//AAAI Conference on Artificial Intelligence. 2002: 770-776. [17] POLLARD C, SAG I A. Head-driven phrase structure grammar[M]. University of ChicagoPress, 1994. [18] NISHIDA F, TAKAMATSU S. Japanese-English translation through internal expressions[C]//Ninth International Conference on Computational Linguistics. 1982. [19] NISHIDA F, TAKAMATSU S, TANI T, et al. Feedback of correcting information in posteditingto a machine translation system[C]//International Conference on Computational Linguistics.1988. [20] HAKEEM A, SHEIKH Y, SHAH M. CASE: a hierarchical event representation for the analysisof videos[C]//AAAI Conference on Artificial Intelligence. 2004: 263-268. [21] KHAN M U G, ZHANG L, GOTOH Y. Human focused video description[C]//IEEE InternationalConference on Computer Vision Workshops. IEEE, 2011: 1480-1487. [22] LEE M W, HAKEEM A, HAERING N, et al. Save: A framework for semantic annotation ofvisual events[C]//IEEE Conference on Computer Vision and Pattern Recognition Workshops.IEEE, 2008: 1-8. [23] NEVATIA R, HOBBS J, BOLLES B. An ontology for video event representation[C]//IEEEConference on Computer Vision and Pattern Recognition Workshop. IEEE, 2004: 119-119. [24] GUADARRAMA S, KRISHNAMOORTHY N, MALKARNENKAR G, et al. Youtube2text:Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//IEEE International Conference on Computer Vision. 2013: 2712-2719. [25] THOMASON J, VENUGOPALAN S, GUADARRAMA S, et al. Integrating language andvision to generate natural language descriptions of videos in the wild[R]. University of Texas atAustin Austin United States, 2014. [26] CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//The 49thAnnual Meeting of the Association for Computational Linguistics. 2011: 190-200. [27] ROHRBACH A, ROHRBACH M, QIU W, et al. Coherent multi-sentence video descriptionwith variable level of detail[C]//German Conference on Pattern Recognition. Springer, 2014:184-195. [28] ROHRBACH A, ROHRBACH M, TANDON N, et al. A dataset for movie description[C]//IEEEConference on Computer Vision and Pattern Recognition. 2015: 3202-3212. [29] TORABI A, PAL C, LAROCHELLE H, et al. Using descriptive video services to create a largedata source for video annotation research[A]. 2015. [30] XU J, MEI T, YAO T, et al. Msr-vtt: A large video description dataset for bridging video andlanguage[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5288-5296. [31] WANG X, WU J, CHEN J, et al. Vatex: A large-scale, high-quality multilingual dataset forvideo-and-language research[C]//IEEE International Conference on Computer Vision. 2019:4581-4591. [32] ROHRBACH M, QIU W, TITOV I, et al. Translating video content to natural language descriptions[C]//IEEE International Conference on Computer Vision. 2013: 433-440. [33] KOEHN P, HOANG H, BIRCH A, et al. Moses: Open source toolkit for statistical machinetranslation[C]//The 45th Annual Meeting of the Association for Computational Linguistics ofthe demo and poster sessions. 2007: 177-180. [34] KOJIMA A, TAMURA T, FUKUNAGA K. Natural language description of human activitiesfrom video images based on concept hierarchy of actions[J]. IJCV, 2002, 50(2): 171-184. [35] DAS P, XU C, DOELL R F, et al. A thousand frames in just a few words: Lingual descriptionof videos through latent topics and sparse object stitching[C]//IEEE Conference on ComputerVision and Pattern Recognition. 2013: 2634-2641. [36] KRISHNAMOORTHY N, MALKARNENKAR G, MOONEY R, et al. Generating naturallanguagevideo descriptions using text-mined knowledge[C]//AAAI Conference on ArtificialIntelligence. 2013. [37] XU R, XIONG C, CHEN W, et al. Jointly modeling deep video and compositional text to bridgevision and language in a unified framework[C]//AAAI Conference on Artificial Intelligence:volume 29. 2015. [38] YU H, SISKIND J M. Learning to describe video with weak supervision by exploiting negativesentential information[C]//Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015. [39] CORSO J. Gbs: Guidance by semantics-using high-level visual inference to improve visionbasedmobile robot localization[R]. STATE UNIV OF NEW YORK AT BUFFALO AMHERST,2015. [40] SUN C, NEVATIA R. Semantic aware video transcription using random forest classifiers[C]//European Conference on Computer Vision. Springer, 2014: 772-786. [41] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutionalneural networks[J]. Conference on Neural Information Processing Systems, 2012, 25. [42] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale imagerecognition[A]. 2014. [43] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//IEEE Conference onComputer Vision and Pattern Recognition. 2015: 1-9. [44] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997, 9(8): 1735-1780. [45] CHO K, VAN MERRIËNBOER B, BAHDANAU D, et al. On the properties of neural machinetranslation: Encoder-decoder approaches[A]. 2014. [46] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[J]. Conference on Neural Information Processing Systems, 2014, 27. [47] GRAVES A, JAITLY N. Towards end-to-end speech recognition with recurrent neural networks[C]//International Conference on Machine Learning. PMLR, 2014: 1764-1772. [48] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutionalnetworks for visual recognition and description[C]//IEEE Conference on Computer Vision andPattern Recognition. IEEE, 2015: 2625-2634. [49] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3156-3164. [50] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Conference onNeural Information Processing Systems, 2017, 30. [51] TAN G, LIU D, WANG M, et al. Learning to discretely compose reasoning module networksfor video captioning[A]. 2020. [52] ZHENG Q, WANG C, TAO D. Syntax-aware action targeting for video captioning[C]//IEEEConference on Computer Vision and Pattern Recognition. 2020: 13096-13105. [53] PAN B, CAI H, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledgedistillation[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2020: 10870-10879. [54] PEREZ-MARTIN J, BUSTOS B, PÉREZ J. Improving video captioning with temporal compositionof a visual-syntactic embedding[C]//Workshop on Applications of Computer Vision.2021: 3039-3049. [55] DESHPANDE A, ANEJA J, WANG L, et al. Fast, Diverse and Accurate Image CaptioningGuided by Part-Of-Speech[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2019: 10695-10704. [56] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]//IEEE InternationalConference on Computer Vision. 2017: 706-715. [57] JOHNSON J, KARPATHY A, FEI-FEI L. Densecap: Fully convolutional localization networksfor dense captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2016:4565-4574. [58] WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for densevideo captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018:7190-7198. [59] LIU W, ANGUELOV D, ERHAN D, et al. Ssd: Single shot multibox detector[C]//EuropeanConference on Computer Vision. Springer, 2016: 21-37. [60] WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for densevideo captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018:7190-7198. [61] YANG D, YUAN C. Hierarchical context encoding for events captioning in videos[C]//IEEEInternational Conference on Image Processing. IEEE, 2018: 1288-1292. [62] WANG T, ZHENG H, YU M, et al. Event-centric hierarchical representation for dense videocaptioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(5):1890-1900. [63] IASHIN V, RAHTU E. Multi-modal dense video captioning[C]//IEEE Conference on ComputerVision and Pattern Recognition Workshops. 2020: 958-959. [64] IASHIN V, RAHTU E. A better use of audio-visual cues: Dense video captioning with bi-modaltransformer[A]. 2020. [65] LI Y, YAO T, PAN Y, et al. Jointly localizing and describing events for dense video captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7492-7500. [66] ZHOU L, ZHOU Y, CORSO J J, et al. End-to-end dense video captioning with masked transformer[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8739-8748. [67] WANG T, ZHANG R, LU Z, et al. End-to-End Dense Video Captioning with Parallel Decoding[C]//IEEE International Conference on Computer Vision. 2021: 6847-6857. [68] KRAUSE J, JOHNSON J, KRISHNA R, et al. A hierarchical approach for generating descriptiveimage paragraphs[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2017:317-325. [69] LUO Y, HUANG Z, ZHANG Z, et al. Curiosity-driven reinforcement learning for diverse visualparagraph generation[C]//ACM International Conference on Multimedia. 2019: 2341-2350. [70] MELAS-KYRIAZI L, RUSH A M, HAN G. Training for diversity in image paragraph captioning[C]//Conference on Empirical Methods in Natural Language Processing. 2018: 757-761. [71] XIONG Y, DAI B, LIN D. Move forward and tell: A progressive generator of video descriptions[C]//European Conference on Computer Vision. 2018: 468-483. [72] SONG Y, CHEN S, JIN Q. Towards Diverse Paragraph Captioning for Untrimmed Videos[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2021: 11245-11254. [73] QIAN J, DONG L, SHEN Y, et al. Controllable Natural Language Generation with ContrastivePrefixes[J]. CoRR, 2022, abs/2202.13257. [74] CHEN L, JIANG Z, XIAO J, et al. Human-like Controllable Image Captioning with VerbspecificSemantic Roles[C]//IEEE Conference on Computer Vision and Pattern Recognition.2021: 16846-16856. [75] KIM D J, CHOI J, OH T H, et al. Dense relational captioning: Triple-stream networks forrelationship-based captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019: 6271-6280. [76] CORNIA M, BARALDI L, CUCCHIARA R. Show, control and tell: A framework for generatingcontrollable and grounded captions[C]//IEEE Conference on Computer Vision and PatternRecognition. 2019: 8307-8316. [77] LINDH A, ROSS R J, KELLEHER J D. Language-driven region pointer advancement forcontrollable image captioning[A]. 2020. [78] CHEN S, JIN Q, WANG P, et al. Say as you wish: Fine-grained control of image captiongeneration with abstract scene graphs[C]//IEEE Conference on Computer Vision and PatternRecognition. 2020: 9962-9971. [79] ZHONG Y, WANG L, CHEN J, et al. Comprehensive Image Captioning via Scene GraphDecomposition[C]//Lecture Notes in Computer Science: volume 12359 European Conferenceon Computer Vision. Springer, 2020: 211-229. [80] PONT-TUSET J, UIJLINGS J R R, CHANGPINYO S, et al. Connecting Vision and Languagewith Localized Narratives[C]//Lecture Notes in Computer Science: volume 12350 EuropeanConference on Computer Vision. Springer, 2020: 647-664. [81] DENG C, DING N, TAN M, et al. Length-Controllable Image Captioning[C]//Lecture Notesin Computer Science: volume 12358 European Conference on Computer Vision. Springer,2020: 712-729. [82] DAI B, FIDLER S, URTASUN R, et al. Towards diverse and natural image descriptions via aconditional gan[C]//IEEE International Conference on Computer Vision. 2017: 2970-2979. [83] SHETTY R, ROHRBACH M, ANNE HENDRICKS L, et al. Speaking the same language:Matching machine to human captions by adversarial training[C]//IEEE International Conferenceon Computer Vision. 2017: 4135-4144. [84] LI D, HUANG Q, HE X, et al. Generating diverse and accurate visual captions by comparativeadversarial learning[A]. 2018. [85] ANEJA J, AGRAWAL H, BATRA D, et al. Sequential Latent Spaces for Modeling the IntentionDuring Diverse Image Captioning[C]//IEEE International Conference on Computer Vision.IEEE, 2019: 4260-4269. [86] MAHAJAN S, ROTH S. Diverse Image Captioning with Context-Object Split Latent Spaces[C]//Conference on Neural Information Processing Systems. 2020. [87] CHEN F, JI R, JI J, et al. Variational structured semantic inference for diverse image captioning[Z]. 2019. [88] BLEI D M, KUCUKELBIR A, MCAULIFFE J D. Variational Inference: A Review for Statisticians[J]. CoRR, 2016, abs/1601.00670. [89] WANG L, SCHWING A G, LAZEBNIK S. Diverse and Accurate Image Description Using aVariational Auto-Encoder with an Additive Gaussian Encoding Space[C]//Conference on NeuralInformation Processing Systems. 2017: 5756-5766. [90] MAHAJAN S, GUREVYCH I, ROTH S. Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings[C]//International Conference on Learning Representations. 2020. [91] VIJAYAKUMAR A K, COGSWELL M, SELVARAJU R R, et al. Diverse Beam Search forImproved Description of Complex Scenes[C]//Twenty-Ninth AAAI Conference on ArtificialIntelligence. AAAI Press, 2018: 7371-7379. [92] WANG Z, WU F, LU W, et al. Diverse Image Captioning via GroupTalk.[C]//International JointConference on Artificial Intelligence. 2016: 2957-2964. [93] CHEN F, JI R, SUN X, et al. Groupcap: Group-based image captioning with structured relevanceand diversity constraints[C]//IEEE Conference on Computer Vision and Pattern Recognition.2018: 1345-1353. [94] BISHOP C M, NASRABADI N M. Pattern Recognition and Machine Learning: volume 4[M].Springer, 2006. [95] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by backpropagatingerrors[J]. Nature, 1986, 323(6088): 533-536. [96] DOERSCH C. Tutorial on variational autoencoders[A]. 2016. [97] BEPLER T, ZHONG E, KELLEY K, et al. Explicitly disentangling image content from translationand rotation with spatial-VAE[J]. Advances in Neural Information Processing Systems,2019, 32. [98] ZHENG Z, SUN L. Disentangling latent space for vae by label relevant/irrelevant dimensions[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2019: 12192-12201. [99] LOCATELLO F, BAUER S, LUCIC M, et al. Challenging common assumptions in the unsupervisedlearning of disentangled representations[C]//International Conference on MachineLearning. PMLR, 2019: 4114-4124. [100] HIGGINS I, MATTHEY L, PAL A, et al. beta-vae: Learning basic visual concepts with aconstrained variational framework[Z]. 2016. [101] BLEI D M, KUCUKELBIR A, MCAULIFFE J D. Variational inference: A review for statisticians[J]. Journal of the American statistical Association, 2017, 112(518): 859-877. [102] KINGMA D P, WELLING M. Auto-Encoding Variational Bayes[C]//International Conferenceon Learning Representations. 2014. [103] KRAMER M A. Nonlinear principal component analysis using autoassociative neural networks[J]. AIChE journal, 1991, 37(2): 233-243. [104] SHEKHOVTSOV A, SCHLESINGER D, FLACH B. VAE Approximation Error: ELBO andExponential Families[C]//International Conference on Learning Representations. 2021. [105] RASMUSSEN C E, WILLIAMS C K I. Adaptive computation and machine learning: Gaussianprocesses for machine learning[M]. MIT Press, 2006. [106] KINGMA D P, WELLING M. An introduction to variational autoencoders[A]. 2019. [107] SOHN K, LEE H, YAN X. Learning Structured Output Representation using Deep ConditionalGenerative Models[C]//Conference on Neural Information Processing Systems. 2015: 3483-3491. [108] LUCAS J, TUCKER G, GROSSE R B, et al. Understanding Posterior Collapse in GenerativeLatent Variable Models[C]//International Conference on Learning Representations. 2019. [109] LUCAS J, TUCKER G, GROSSE R B, et al. Don’t blame the Elbo! a linear Vae perspectiveon posterior collapse[J]. Advances in Neural Information Processing Systems, 2019, 32: 9408-9418. [110] ABDAR M, POURPANAH F, HUSSAIN S, et al. A review of uncertainty quantification in deeplearning: Techniques, applications and challenges[J]. Information Fusion, 2021, 76: 243-297. [111] HÜLLERMEIER E, WAEGEMAN W. Aleatoric and epistemic uncertainty in machine learning:An introduction to concepts and methods[J]. Machine Learning, 2021, 110(3): 457-506. [112] BENDER E M, KOLLER A. Climbing towards NLU: On meaning, form, and understanding inthe age of data[C]//Fifty-eighth Annual Meeting of the Association for Computational Linguistics.2020: 5185-5198. [113] KENDALL A, GAL Y. What uncertainties do we need in bayesian deep learning for computervision?[J]. Conference on Neural Information Processing Systems, 2017, 30. [114] CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//The 49thAnnual Meeting of the Association for Computational Linguistics: Human Language Technologies.2011: 190-200. [115] ZHANG Z, SHI Y, YUAN C, et al. Object relational graph with teacher-recommended learningfor video captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2020:13278-13288. [116] XU G, NIU S, TAN M, et al. Towards Accurate Text-Based Image Captioning With Content DiversityExploration[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE,2021: 12637-12646. [117] YANG X, ZHANG H, CAI J. Learning to collocate neural modules for image captioning[C]//IEEE International Conference on Computer Vision. 2019: 4250-4260. [118] BOWMAN S R, VILNIS L, VINYALS O, et al. Generating Sentences from a Continuous Space[C]//Conference on Computational Natural Language Learning. ACL, 2016: 10-21. [119] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics Human Action Video Dataset[J].CoRR, 2017, abs/1705.06950. [120] FANG H, XIONG P, XU L, et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP[J]. CoRR, 2021, abs/2106.11097. [121] WANG Q, CHAN A B. Describing like humans: on diversity in image captioning[C]//IEEEConference on Computer Vision and Pattern Recognition. 2019: 4195-4203. [122] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society,2009: 248-255. [123] SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-resnet and the impactof residual connections on learning[C]//Thirty-first AAAI Conference on Artificial Intelligence.2017. [124] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kineticsdataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017: 6299-6308. [125] LIU L, TANG J, WAN X, et al. Generating diverse and descriptive image captions using visualparaphrases[C]//IEEE International Conference on Computer Vision. 2019: 4240-4249. [126] DEVLIN J, CHANG M, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformersfor Language Understanding[C]//North American Chapter of the Association for ComputationalLinguistics. Association for Computational Linguistics, 2019: 4171-4186. [127] QI P, ZHANG Y, ZHANG Y, et al. Stanza: A python natural language processing toolkit formany human languages[A]. 2020. [128] CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes Dataset for Semantic Urban SceneUnderstanding[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE ComputerSociety, 2016: 3213-3223. [129] AAFAQ N, AKHTAR N, LIU W, et al. Spatio-temporal dynamics and semantic attribute enrichedvisual encoding for video captioning[C]//IEEE Conference on Computer Vision and PatternRecognition. 2019: 12487-12496. [130] REIMERS N, GUREVYCH I. Sentence-BERT: Sentence Embeddings using Siamese BERTNetworks[C]//Empirical Methods in Natural Language Processing. Association for ComputationalLinguistics, 2019: 3980-3990. [131] WU X, DYER E, NEYSHABUR B. When Do Curricula Work?[C]//International Conferenceon Learning Representations. 2021. [132] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machinetranslation[C]//Annual Meeting of the Association for Computational Linguistics. 2002: 311-318. [133] LIN C Y. Rouge: A package for automatic evaluation of summaries[C]//Text summarizationbranches out. 2004: 74-81. [134] BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improvedcorrelation with human judgments[C]//Annual Meeting of the Association for ComputationalLinguistics Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translationand/or Summarization. 2005: 65-72. [135] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: Consensus-based image descriptionevaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575. [136] ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: Semantic propositional imagecaption evaluation[C]//European Conference on Computer Vision. Springer, 2016: 382-398. [137] FELLBAUM C. WordNet[M]//Theory and applications of ontology: computer applications.Springer, 2010: 231-243. [138] ROBERTSON S. Understanding inverse document frequency: on theoretical arguments for IDF[J]. Journal of Documentation, 2004. [139] CROUSE D F. On implementing 2D rectangular assignment algorithms[J]. IEEE Transactionson Aerospace and Electronic Systems, 2016, 52(4): 1679-1696.
所在学位评定分委会	计算机科学与工程系
国内图书分类号	TP181
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/335853
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	刘柱. 基于隐变量生成模型的多样化视频描述算法研究[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
11930377-刘柱-计算机科学与工程（4109KB）	--	--	限制开放	--	请求全文