[1] BUSCHMAN T J, MILLER E K. Top-down versus bottom-up control of attention in the prefrontaland posterior parietal cortices[J]. science, 2007, 315(5820): 1860-1862.
[2] KOLLER D, HEINZE N, NAGEL H H. Algorithmic characterization of vehicle trajectoriesfrom image sequences by motion verbs[C]//IEEE Conference on Computer Vision and PatternRecognition. IEEE, 1991: 90-91.
[3] BRAND M. The “inverse hollywood proble”: From video to scripts and storyboards via causalanalysis[C]//Fourteenth AAAI Conference on Artificial Intelligence and Ninth Conference onInnovative Applications of Artificial Intelligence. AAAI Press, 1997: 132-137.
[4] VIOLA P, JONES M. Rapid object detection using a boosted cascade of simple features[C]//Computer Vision and Pattern Recognition. IEEE, 2001: I-I.
[5] TORRALBA A, MURPHY K P, FREEMAN W T, et al. Context-based vision system for placeand object recognition[C]//Proceedings Ninth IEEE International Conference on Computer Vision:volume 2. IEEE, 2003: 273-273.
[6] LOWE D G. Object recognition from local scale-invariant features[C]//IEEE International Conferenceon Computer Vision: volume 2. IEEE, 1999: 1150-1157.
[7] FELZENSZWALB P F, GIRSHICK R B, MCALLESTER D, et al. Object detection with discriminativelytrained part-based models[J]. IEEE Transactions on Pattern Analysis & MachineIntelligence, 2009, 32(9): 1627-1645.
[8] FELZENSZWALB P, MCALLESTER D, RAMANAN D. A discriminatively trained, multiscale,deformable part model[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2008: 1-8.
[9] FELZENSZWALB P F, GIRSHICK R B, MCALLESTER D. Cascade object detection withdeformable part models[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2010: 2241-2248.
[10] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//IEEE Conferenceon Computer Vision and Pattern Recognition. IEEE, 2005: 886-893.
[11] CHAUDHRY R, RAVICHANDRAN A, HAGER G, et al. Histograms of oriented optical flowand binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009: 1932-1939.
[12] HONGENG S, BRÉMOND F, NEVATIA R. Bayesian framework for video surveillance application[C]//International Conference on Pattern Recognition. IEEE, 2000: 164-170.
[13] GONG S, XIANG T. Recognition of group activities using dynamic probabilistic networks[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2003: 742-749.
[14] BOBICK A F, WILSON A D. A state-based approach to the representation and recognitionof gesture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(12):1325-1337.
[15] ZHU S C, MUMFORD D. A stochastic grammar of images[M]. Now Publishers Inc, 2007.
[16] MOORE D, ESSA I. Recognizing multitasked activities from video using stochastic contextfreegrammar[C]//AAAI Conference on Artificial Intelligence. 2002: 770-776.
[17] POLLARD C, SAG I A. Head-driven phrase structure grammar[M]. University of ChicagoPress, 1994.
[18] NISHIDA F, TAKAMATSU S. Japanese-English translation through internal expressions[C]//Ninth International Conference on Computational Linguistics. 1982.
[19] NISHIDA F, TAKAMATSU S, TANI T, et al. Feedback of correcting information in posteditingto a machine translation system[C]//International Conference on Computational Linguistics.1988.
[20] HAKEEM A, SHEIKH Y, SHAH M. CASE: a hierarchical event representation for the analysisof videos[C]//AAAI Conference on Artificial Intelligence. 2004: 263-268.
[21] KHAN M U G, ZHANG L, GOTOH Y. Human focused video description[C]//IEEE InternationalConference on Computer Vision Workshops. IEEE, 2011: 1480-1487.
[22] LEE M W, HAKEEM A, HAERING N, et al. Save: A framework for semantic annotation ofvisual events[C]//IEEE Conference on Computer Vision and Pattern Recognition Workshops.IEEE, 2008: 1-8.
[23] NEVATIA R, HOBBS J, BOLLES B. An ontology for video event representation[C]//IEEEConference on Computer Vision and Pattern Recognition Workshop. IEEE, 2004: 119-119.
[24] GUADARRAMA S, KRISHNAMOORTHY N, MALKARNENKAR G, et al. Youtube2text:Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//IEEE International Conference on Computer Vision. 2013: 2712-2719.
[25] THOMASON J, VENUGOPALAN S, GUADARRAMA S, et al. Integrating language andvision to generate natural language descriptions of videos in the wild[R]. University of Texas atAustin Austin United States, 2014.
[26] CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//The 49thAnnual Meeting of the Association for Computational Linguistics. 2011: 190-200.
[27] ROHRBACH A, ROHRBACH M, QIU W, et al. Coherent multi-sentence video descriptionwith variable level of detail[C]//German Conference on Pattern Recognition. Springer, 2014:184-195.
[28] ROHRBACH A, ROHRBACH M, TANDON N, et al. A dataset for movie description[C]//IEEEConference on Computer Vision and Pattern Recognition. 2015: 3202-3212.
[29] TORABI A, PAL C, LAROCHELLE H, et al. Using descriptive video services to create a largedata source for video annotation research[A]. 2015.
[30] XU J, MEI T, YAO T, et al. Msr-vtt: A large video description dataset for bridging video andlanguage[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5288-5296.
[31] WANG X, WU J, CHEN J, et al. Vatex: A large-scale, high-quality multilingual dataset forvideo-and-language research[C]//IEEE International Conference on Computer Vision. 2019:4581-4591.
[32] ROHRBACH M, QIU W, TITOV I, et al. Translating video content to natural language descriptions[C]//IEEE International Conference on Computer Vision. 2013: 433-440.
[33] KOEHN P, HOANG H, BIRCH A, et al. Moses: Open source toolkit for statistical machinetranslation[C]//The 45th Annual Meeting of the Association for Computational Linguistics ofthe demo and poster sessions. 2007: 177-180.
[34] KOJIMA A, TAMURA T, FUKUNAGA K. Natural language description of human activitiesfrom video images based on concept hierarchy of actions[J]. IJCV, 2002, 50(2): 171-184.
[35] DAS P, XU C, DOELL R F, et al. A thousand frames in just a few words: Lingual descriptionof videos through latent topics and sparse object stitching[C]//IEEE Conference on ComputerVision and Pattern Recognition. 2013: 2634-2641.
[36] KRISHNAMOORTHY N, MALKARNENKAR G, MOONEY R, et al. Generating naturallanguagevideo descriptions using text-mined knowledge[C]//AAAI Conference on ArtificialIntelligence. 2013.
[37] XU R, XIONG C, CHEN W, et al. Jointly modeling deep video and compositional text to bridgevision and language in a unified framework[C]//AAAI Conference on Artificial Intelligence:volume 29. 2015.
[38] YU H, SISKIND J M. Learning to describe video with weak supervision by exploiting negativesentential information[C]//Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
[39] CORSO J. Gbs: Guidance by semantics-using high-level visual inference to improve visionbasedmobile robot localization[R]. STATE UNIV OF NEW YORK AT BUFFALO AMHERST,2015.
[40] SUN C, NEVATIA R. Semantic aware video transcription using random forest classifiers[C]//European Conference on Computer Vision. Springer, 2014: 772-786.
[41] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutionalneural networks[J]. Conference on Neural Information Processing Systems, 2012, 25.
[42] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale imagerecognition[A]. 2014.
[43] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//IEEE Conference onComputer Vision and Pattern Recognition. 2015: 1-9.
[44] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997, 9(8): 1735-1780.
[45] CHO K, VAN MERRIËNBOER B, BAHDANAU D, et al. On the properties of neural machinetranslation: Encoder-decoder approaches[A]. 2014.
[46] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[J]. Conference on Neural Information Processing Systems, 2014, 27.
[47] GRAVES A, JAITLY N. Towards end-to-end speech recognition with recurrent neural networks[C]//International Conference on Machine Learning. PMLR, 2014: 1764-1772.
[48] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutionalnetworks for visual recognition and description[C]//IEEE Conference on Computer Vision andPattern Recognition. IEEE, 2015: 2625-2634.
[49] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3156-3164.
[50] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Conference onNeural Information Processing Systems, 2017, 30.
[51] TAN G, LIU D, WANG M, et al. Learning to discretely compose reasoning module networksfor video captioning[A]. 2020.
[52] ZHENG Q, WANG C, TAO D. Syntax-aware action targeting for video captioning[C]//IEEEConference on Computer Vision and Pattern Recognition. 2020: 13096-13105.
[53] PAN B, CAI H, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledgedistillation[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2020: 10870-10879.
[54] PEREZ-MARTIN J, BUSTOS B, PÉREZ J. Improving video captioning with temporal compositionof a visual-syntactic embedding[C]//Workshop on Applications of Computer Vision.2021: 3039-3049.
[55] DESHPANDE A, ANEJA J, WANG L, et al. Fast, Diverse and Accurate Image CaptioningGuided by Part-Of-Speech[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2019: 10695-10704.
[56] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]//IEEE InternationalConference on Computer Vision. 2017: 706-715.
[57] JOHNSON J, KARPATHY A, FEI-FEI L. Densecap: Fully convolutional localization networksfor dense captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2016:4565-4574.
[58] WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for densevideo captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018:7190-7198.
[59] LIU W, ANGUELOV D, ERHAN D, et al. Ssd: Single shot multibox detector[C]//EuropeanConference on Computer Vision. Springer, 2016: 21-37.
[60] WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for densevideo captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018:7190-7198.
[61] YANG D, YUAN C. Hierarchical context encoding for events captioning in videos[C]//IEEEInternational Conference on Image Processing. IEEE, 2018: 1288-1292.
[62] WANG T, ZHENG H, YU M, et al. Event-centric hierarchical representation for dense videocaptioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(5):1890-1900.
[63] IASHIN V, RAHTU E. Multi-modal dense video captioning[C]//IEEE Conference on ComputerVision and Pattern Recognition Workshops. 2020: 958-959.
[64] IASHIN V, RAHTU E. A better use of audio-visual cues: Dense video captioning with bi-modaltransformer[A]. 2020.
[65] LI Y, YAO T, PAN Y, et al. Jointly localizing and describing events for dense video captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7492-7500.
[66] ZHOU L, ZHOU Y, CORSO J J, et al. End-to-end dense video captioning with masked transformer[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8739-8748.
[67] WANG T, ZHANG R, LU Z, et al. End-to-End Dense Video Captioning with Parallel Decoding[C]//IEEE International Conference on Computer Vision. 2021: 6847-6857.
[68] KRAUSE J, JOHNSON J, KRISHNA R, et al. A hierarchical approach for generating descriptiveimage paragraphs[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2017:317-325.
[69] LUO Y, HUANG Z, ZHANG Z, et al. Curiosity-driven reinforcement learning for diverse visualparagraph generation[C]//ACM International Conference on Multimedia. 2019: 2341-2350.
[70] MELAS-KYRIAZI L, RUSH A M, HAN G. Training for diversity in image paragraph captioning[C]//Conference on Empirical Methods in Natural Language Processing. 2018: 757-761.
[71] XIONG Y, DAI B, LIN D. Move forward and tell: A progressive generator of video descriptions[C]//European Conference on Computer Vision. 2018: 468-483.
[72] SONG Y, CHEN S, JIN Q. Towards Diverse Paragraph Captioning for Untrimmed Videos[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2021: 11245-11254.
[73] QIAN J, DONG L, SHEN Y, et al. Controllable Natural Language Generation with ContrastivePrefixes[J]. CoRR, 2022, abs/2202.13257.
[74] CHEN L, JIANG Z, XIAO J, et al. Human-like Controllable Image Captioning with VerbspecificSemantic Roles[C]//IEEE Conference on Computer Vision and Pattern Recognition.2021: 16846-16856.
[75] KIM D J, CHOI J, OH T H, et al. Dense relational captioning: Triple-stream networks forrelationship-based captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019: 6271-6280.
[76] CORNIA M, BARALDI L, CUCCHIARA R. Show, control and tell: A framework for generatingcontrollable and grounded captions[C]//IEEE Conference on Computer Vision and PatternRecognition. 2019: 8307-8316.
[77] LINDH A, ROSS R J, KELLEHER J D. Language-driven region pointer advancement forcontrollable image captioning[A]. 2020.
[78] CHEN S, JIN Q, WANG P, et al. Say as you wish: Fine-grained control of image captiongeneration with abstract scene graphs[C]//IEEE Conference on Computer Vision and PatternRecognition. 2020: 9962-9971.
[79] ZHONG Y, WANG L, CHEN J, et al. Comprehensive Image Captioning via Scene GraphDecomposition[C]//Lecture Notes in Computer Science: volume 12359 European Conferenceon Computer Vision. Springer, 2020: 211-229.
[80] PONT-TUSET J, UIJLINGS J R R, CHANGPINYO S, et al. Connecting Vision and Languagewith Localized Narratives[C]//Lecture Notes in Computer Science: volume 12350 EuropeanConference on Computer Vision. Springer, 2020: 647-664.
[81] DENG C, DING N, TAN M, et al. Length-Controllable Image Captioning[C]//Lecture Notesin Computer Science: volume 12358 European Conference on Computer Vision. Springer,2020: 712-729.
[82] DAI B, FIDLER S, URTASUN R, et al. Towards diverse and natural image descriptions via aconditional gan[C]//IEEE International Conference on Computer Vision. 2017: 2970-2979.
[83] SHETTY R, ROHRBACH M, ANNE HENDRICKS L, et al. Speaking the same language:Matching machine to human captions by adversarial training[C]//IEEE International Conferenceon Computer Vision. 2017: 4135-4144.
[84] LI D, HUANG Q, HE X, et al. Generating diverse and accurate visual captions by comparativeadversarial learning[A]. 2018.
[85] ANEJA J, AGRAWAL H, BATRA D, et al. Sequential Latent Spaces for Modeling the IntentionDuring Diverse Image Captioning[C]//IEEE International Conference on Computer Vision.IEEE, 2019: 4260-4269.
[86] MAHAJAN S, ROTH S. Diverse Image Captioning with Context-Object Split Latent Spaces[C]//Conference on Neural Information Processing Systems. 2020.
[87] CHEN F, JI R, JI J, et al. Variational structured semantic inference for diverse image captioning[Z]. 2019.
[88] BLEI D M, KUCUKELBIR A, MCAULIFFE J D. Variational Inference: A Review for Statisticians[J]. CoRR, 2016, abs/1601.00670.
[89] WANG L, SCHWING A G, LAZEBNIK S. Diverse and Accurate Image Description Using aVariational Auto-Encoder with an Additive Gaussian Encoding Space[C]//Conference on NeuralInformation Processing Systems. 2017: 5756-5766.
[90] MAHAJAN S, GUREVYCH I, ROTH S. Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings[C]//International Conference on Learning Representations. 2020.
[91] VIJAYAKUMAR A K, COGSWELL M, SELVARAJU R R, et al. Diverse Beam Search forImproved Description of Complex Scenes[C]//Twenty-Ninth AAAI Conference on ArtificialIntelligence. AAAI Press, 2018: 7371-7379.
[92] WANG Z, WU F, LU W, et al. Diverse Image Captioning via GroupTalk.[C]//International JointConference on Artificial Intelligence. 2016: 2957-2964.
[93] CHEN F, JI R, SUN X, et al. Groupcap: Group-based image captioning with structured relevanceand diversity constraints[C]//IEEE Conference on Computer Vision and Pattern Recognition.2018: 1345-1353.
[94] BISHOP C M, NASRABADI N M. Pattern Recognition and Machine Learning: volume 4[M].Springer, 2006.
[95] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by backpropagatingerrors[J]. Nature, 1986, 323(6088): 533-536.
[96] DOERSCH C. Tutorial on variational autoencoders[A]. 2016.
[97] BEPLER T, ZHONG E, KELLEY K, et al. Explicitly disentangling image content from translationand rotation with spatial-VAE[J]. Advances in Neural Information Processing Systems,2019, 32.
[98] ZHENG Z, SUN L. Disentangling latent space for vae by label relevant/irrelevant dimensions[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2019: 12192-12201.
[99] LOCATELLO F, BAUER S, LUCIC M, et al. Challenging common assumptions in the unsupervisedlearning of disentangled representations[C]//International Conference on MachineLearning. PMLR, 2019: 4114-4124.
[100] HIGGINS I, MATTHEY L, PAL A, et al. beta-vae: Learning basic visual concepts with aconstrained variational framework[Z]. 2016.
[101] BLEI D M, KUCUKELBIR A, MCAULIFFE J D. Variational inference: A review for statisticians[J]. Journal of the American statistical Association, 2017, 112(518): 859-877.
[102] KINGMA D P, WELLING M. Auto-Encoding Variational Bayes[C]//International Conferenceon Learning Representations. 2014.
[103] KRAMER M A. Nonlinear principal component analysis using autoassociative neural networks[J]. AIChE journal, 1991, 37(2): 233-243.
[104] SHEKHOVTSOV A, SCHLESINGER D, FLACH B. VAE Approximation Error: ELBO andExponential Families[C]//International Conference on Learning Representations. 2021.
[105] RASMUSSEN C E, WILLIAMS C K I. Adaptive computation and machine learning: Gaussianprocesses for machine learning[M]. MIT Press, 2006.
[106] KINGMA D P, WELLING M. An introduction to variational autoencoders[A]. 2019.
[107] SOHN K, LEE H, YAN X. Learning Structured Output Representation using Deep ConditionalGenerative Models[C]//Conference on Neural Information Processing Systems. 2015: 3483-3491.
[108] LUCAS J, TUCKER G, GROSSE R B, et al. Understanding Posterior Collapse in GenerativeLatent Variable Models[C]//International Conference on Learning Representations. 2019.
[109] LUCAS J, TUCKER G, GROSSE R B, et al. Don’t blame the Elbo! a linear Vae perspectiveon posterior collapse[J]. Advances in Neural Information Processing Systems, 2019, 32: 9408-9418.
[110] ABDAR M, POURPANAH F, HUSSAIN S, et al. A review of uncertainty quantification in deeplearning: Techniques, applications and challenges[J]. Information Fusion, 2021, 76: 243-297.
[111] HÜLLERMEIER E, WAEGEMAN W. Aleatoric and epistemic uncertainty in machine learning:An introduction to concepts and methods[J]. Machine Learning, 2021, 110(3): 457-506.
[112] BENDER E M, KOLLER A. Climbing towards NLU: On meaning, form, and understanding inthe age of data[C]//Fifty-eighth Annual Meeting of the Association for Computational Linguistics.2020: 5185-5198.
[113] KENDALL A, GAL Y. What uncertainties do we need in bayesian deep learning for computervision?[J]. Conference on Neural Information Processing Systems, 2017, 30.
[114] CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//The 49thAnnual Meeting of the Association for Computational Linguistics: Human Language Technologies.2011: 190-200.
[115] ZHANG Z, SHI Y, YUAN C, et al. Object relational graph with teacher-recommended learningfor video captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2020:13278-13288.
[116] XU G, NIU S, TAN M, et al. Towards Accurate Text-Based Image Captioning With Content DiversityExploration[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE,2021: 12637-12646.
[117] YANG X, ZHANG H, CAI J. Learning to collocate neural modules for image captioning[C]//IEEE International Conference on Computer Vision. 2019: 4250-4260.
[118] BOWMAN S R, VILNIS L, VINYALS O, et al. Generating Sentences from a Continuous Space[C]//Conference on Computational Natural Language Learning. ACL, 2016: 10-21.
[119] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics Human Action Video Dataset[J].CoRR, 2017, abs/1705.06950.
[120] FANG H, XIONG P, XU L, et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP[J]. CoRR, 2021, abs/2106.11097.
[121] WANG Q, CHAN A B. Describing like humans: on diversity in image captioning[C]//IEEEConference on Computer Vision and Pattern Recognition. 2019: 4195-4203.
[122] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society,2009: 248-255.
[123] SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-resnet and the impactof residual connections on learning[C]//Thirty-first AAAI Conference on Artificial Intelligence.2017.
[124] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kineticsdataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017: 6299-6308.
[125] LIU L, TANG J, WAN X, et al. Generating diverse and descriptive image captions using visualparaphrases[C]//IEEE International Conference on Computer Vision. 2019: 4240-4249.
[126] DEVLIN J, CHANG M, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformersfor Language Understanding[C]//North American Chapter of the Association for ComputationalLinguistics. Association for Computational Linguistics, 2019: 4171-4186.
[127] QI P, ZHANG Y, ZHANG Y, et al. Stanza: A python natural language processing toolkit formany human languages[A]. 2020.
[128] CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes Dataset for Semantic Urban SceneUnderstanding[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE ComputerSociety, 2016: 3213-3223.
[129] AAFAQ N, AKHTAR N, LIU W, et al. Spatio-temporal dynamics and semantic attribute enrichedvisual encoding for video captioning[C]//IEEE Conference on Computer Vision and PatternRecognition. 2019: 12487-12496.
[130] REIMERS N, GUREVYCH I. Sentence-BERT: Sentence Embeddings using Siamese BERTNetworks[C]//Empirical Methods in Natural Language Processing. Association for ComputationalLinguistics, 2019: 3980-3990.
[131] WU X, DYER E, NEYSHABUR B. When Do Curricula Work?[C]//International Conferenceon Learning Representations. 2021.
[132] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machinetranslation[C]//Annual Meeting of the Association for Computational Linguistics. 2002: 311-318.
[133] LIN C Y. Rouge: A package for automatic evaluation of summaries[C]//Text summarizationbranches out. 2004: 74-81.
[134] BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improvedcorrelation with human judgments[C]//Annual Meeting of the Association for ComputationalLinguistics Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translationand/or Summarization. 2005: 65-72.
[135] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: Consensus-based image descriptionevaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.
[136] ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: Semantic propositional imagecaption evaluation[C]//European Conference on Computer Vision. Springer, 2016: 382-398.
[137] FELLBAUM C. WordNet[M]//Theory and applications of ontology: computer applications.Springer, 2010: 231-243.
[138] ROBERTSON S. Understanding inverse document frequency: on theoretical arguments for IDF[J]. Journal of Documentation, 2004.
[139] CROUSE D F. On implementing 2D rectangular assignment algorithms[J]. IEEE Transactionson Aerospace and Electronic Systems, 2016, 52(4): 1679-1696.
修改评论