南方科技大学知识苑(SUSTech KC): MODULAR TRANSFORMER MODEL FOR VISUAL QUESTION ANSWERING

题名	MODULAR TRANSFORMER MODEL FOR VISUAL QUESTION ANSWERING
其他题名	基于 Transformer 架构的模块化视觉问答模型
姓名	李宗蔚
姓名拼音	LI Zongwei
学号	11930653
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	郑锋
导师单位	计算机科学与工程系
论文答辩日期	2022-05-08
论文提交日期	2022-06-19
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	Visual question answering (VQA) is a popular task in the field of Artificial Intelligence. The task is defined as follows: given an image and a question about that image, the goal is to provide the correct answer to the corresponding question. Lots of existing VQA models are based on the Transformer architecture and have achieved excellent performance, while there are still some problems to be solved. On the one hand, the attention mechanism is the core of the Transformer model, but how to use it to model the interaction between visual and textual will affect the overall performance of the model. On the other hand, many existing methods improve the performance by using large corpus pre-training, while the mismatch of model structure and tasks limit the further improvement of model performance. This thesis revisits two attention-based cross-modal interaction modules. To fairly compare their performance on the VQA task, we build a modular framework, which achieves state-of-the-art performance and can easily replace cross-modal interaction modules for fair comparisons. Furthermore, we introduce a gating mechanism in the original attention mechanism to cope with the situation where the two modalities are not perfectly aligned. In the experiments, we compared the performance of two modules, proved the effectiveness of the gating module in the attention mechanism, and analyzed its working mechanism through visualization. At the same time, it can also enhance the interpretability of the model. Further, We designed a unified model structure for upstream and downstream tasks to reduce the gap between the pre-training and downstream tasks. We use the encoder-decoder structure to build the model, which unifies the pre-training task and the visual question answering task into a text generation task. We proposed two pre-training tasks to further unify the upstream and downstream tasks. These improvements can effectively improve the performance of the model on the visual question answering task in the traditional setting. We design experiments to verify the performance of the model under different fine-tuning data. The experiments show that under the condition of less fine-tuning data, our proposed method achieves superior performance compared with the baseline model.
其他摘要	视觉问答是人工智能领域的热门任务。给定一张图片和一个关于该图片的问题，视觉问答任务的目标是提供对应问题的正确答案。目前的视觉问答模型通常基于Transformer架构，这些模型尽管取得了优异的表现，但仍然存在一些亟待解决的问题。一方面，注意力机制是Transformer模型的核心，如何使用它建模图片和文本的交互会影响模型的整体性能；另一方面，现有的方法很多采用大语料库预训练、下游任务微调的方式提升性能，然而这两个过程中模型结构和任务的不匹配限制了模型表现的进一步提升。本文重新审视了目前流行的两种基于注意力机制的跨模态交互模块。为了公平地比较其在视觉问答任务中的表现，本文搭建了一个模块化的视觉问答模型框架。该框架达到了先进的表现，并可以便捷地替换跨模态交互模块以进行公平的对比。此外，本文在原始的注意力机制中引入了一种门控机制以应对两种模态不能完全对齐的情况。在实验中，本文比较了两种跨模态交互模块的表现，证明了注意力机制中门控模块的有效性，并通过可视化分析了其工作机制，证实了门控机制在提升模型表现的同时也能一定程度上增强模型的可解释性。本文设计了一种上下游任务统一的模型结构与训练范式，以减小从预训练迁移至下游任务过程中的损失。本文中使用编码器-解码器结构搭建模型，该结构将预训练任务和视觉问答任务统一为文本生成任务，从形式上减小上下游任务间的差距；基于文本生成和视觉问答任务的特性重新设计了两种新型的预训练任务，从内容层面进一步统一上下游任务。这些改进能有效地提高模型在传统设定下的视觉问答任务的表现。本文验证了在不同微调数据量下的模型表现，实验表明在更少的微调数据条件下，本文提出的方法达到了超越基准方法的表现，这表明该方法具有更高效的上下游任务迁移能力，即模型能在预训练过程中学习到更适用于视觉问答任务的知识。
关键词	Visual Question Answering Attention Pre-training Fine-tune
其他关键词	视觉问答注意力机制预训练微调
语种	英语
培养类别	独立培养
入学年份	2019
学位授予年份	2022-06
参考文献列表	[1] STAHLBERG F. Neural machine translation: A review[J]. Journal of Artificial IntelligenceResearch, 2020, 69: 343-418. [2] DENG J, DONG W, SOCHER R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255. [3] TapTapSee[EB/OL]. https://taptapseeapp.com/. [4] ORGANIZATION W H, et al. World report on vision[J]. 2019. [5] 中国盲人协会. 主要数据公报[EB/OL]. http://www.zgmx.org.cn/newsdetail/d-13367-0.html. [6] GURARI D, LI Q, STANGL A J, et al. Vizwiz grand challenge: Answering visual questionsfrom blind people[C]//Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. 2018: 3608-3617. [7] WU Q, TENEY D, WANG P, et al. Visual question answering: A survey of methods and datasets[J]. Computer Vision and Image Understanding, 2017, 163: 21-40. [8] MALINOWSKI M, FRITZ M. A multi-world approach to question answering about real-worldscenes based on uncertain input[J]. Advances in neural information processing systems, 2014,27. [9] ANTOL S, AGRAWAL A, LU J, et al. Vqa: Visual question answering[C]//Proceedings of theIEEE international conference on computer vision. 2015: 2425-2433. [10] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the v in vqa matter: Elevating the roleof image understanding in visual question answering[C]//Proceedings of the IEEE conferenceon computer vision and pattern recognition. 2017: 6904-6913. [11] CHEN K, WANG J, CHEN L C, et al. Abc-cnn: An attention based convolutional neural networkfor visual question answering[J]. arXiv preprint arXiv:1511.05960, 2015. [12] YANG Z, HE X, GAO J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29. [13] XU H, SAENKO K. Ask, attend and answer: Exploring question-guided spatial attention forvisual question answering[C]//European Conference on Computer Vision. Springer, 2016: 451-466. [14] LU J, YANG J, BATRA D, et al. Hierarchical question-image co-attention for visual questionanswering[J]. Advances in neural information processing systems, 2016, 29. [15] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computervision and pattern recognition. 2018: 6077-6086. [16] LU J, BATRA D, PARIKH D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J]. Advances in neural information processing systems,2019, 32. [17] TAN H, BANSAL M. Lxmert: Learning cross-modality encoder representations from trans-formers[J]. arXiv preprint arXiv:1908.07490, 2019. [18] LU J, GOSWAMI V, ROHRBACH M, et al. 12-in-1: Multi-task vision and language represen-tation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. 2020: 10437-10446. [19] SU W, ZHU X, CAO Y, et al. Vl-bert: Pre-training of generic visual-linguistic representations[J]. arXiv preprint arXiv:1908.08530, 2019. [20] LI L H, YATSKAR M, YIN D, et al. Visualbert: A simple and performant baseline for visionand language[J]. arXiv preprint arXiv:1908.03557, 2019. [21] CHEN Y C, LI L, YU L, et al. Uniter: Learning universal image-text representations[J]. 2019. [22] LI X, YIN X, LI C, et al. Oscar: Object-semantics aligned pre-training for vision-languagetasks[C]//European Conference on Computer Vision. Springer, 2020: 121-137. [23] KIM J H, JUN J, ZHANG B T. Bilinear attention networks[J]. Advances in Neural InformationProcessing Systems, 2018, 31. [24] YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answer-ing[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019: 6281-6290. [25] YAN M, XIA J, WU C, et al. A deep cascade model for multi-document reading comprehension[C]//Proceedings of the AAAI conference on artificial intelligence: volume 33. 2019: 7354-7361. [26] QI P, LIN X, MEHR L, et al. Answering complex open-domain questions through iterativequery generation[J]. arXiv preprint arXiv:1910.07000, 2019. [27] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale imagerecognition[J]. arXiv preprint arXiv:1409.1556, 2014. [28] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedingsof the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [29] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words:Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020. [30] WANG W, YAO L, CHEN L, et al. CrossFormer: A versatile Vision transformer hinging oncross-scale attention[J]. arXiv preprint arXiv:2108.00154, 2021. [31] REN S, HE K, GIRSHICK R, et al. Faster r-cnn: Towards real-time object detection with regionproposal networks[J]. Advances in neural information processing systems, 2015, 28. [32] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time objectdetection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016: 779-788. [33] PRESTI L L, LA CASCIA M. 3D skeleton-based human action classification: A survey[J].Pattern Recognition, 2016, 53: 130-147. [34] REN M, KIROS R, ZEMEL R. Exploring models and data for image question answering[J].Advances in neural information processing systems, 2015, 28. [35] JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question an-swering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-nition. 2020: 10267-10276. [36] KIM W, SON B, KIM I. Vilt: Vision-and-language transformer without convolution or regionsupervision[C]//International Conference on Machine Learning. PMLR, 2021: 5583-5594. [37] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation,1997, 9(8): 1735-1780. [38] QI P, LIN X, MEHR L, et al. Answering complex open-domain questions through iterativequery generation[J]. arXiv preprint arXiv:1910.07000, 2019. [39] GAO H, MAO J, ZHOU J, et al. Are you talking to a machine? dataset and methods for multi-lingual image question[J]. Advances in neural information processing systems, 2015, 28. [40] TENENBAUM J B, FREEMAN W T. Separating style and content with bilinear models[J].Neural computation, 2000, 12(6): 1247-1283. [41] LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear cnn models for fine-grained visual recog-nition[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1449-1457. [42] FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual questionanswering and visual grounding[J]. arXiv preprint arXiv:1606.01847, 2016. [43] CHARIKAR M, CHEN K, FARACH-COLTON M. Finding frequent items in data streams[C]//International Colloquium on Automata, Languages, and Programming. Springer, 2002:693-703. [44] PIRSIAVASH H, RAMANAN D, FOWLKES C. Bilinear classifiers for visual recognition[J].Advances in neural information processing systems, 2009, 22. [45] YU Z, YU J, FAN J, et al. Multi-modal factorized bilinear pooling with co-attention learning forvisual question answering[C]//Proceedings of the IEEE international conference on computervision. 2017: 1821-1830. [46] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances inneural information processing systems, 2017, 30. [47] QIU X, SUN T, XU Y, et al. Pre-trained models for natural language processing: A survey[J].Science China Technological Sciences, 2020, 63(10): 1872-1897. [48] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Internationalconference on machine learning. PMLR, 2014: 1188-1196. [49] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neuralnetworks[J]. science, 2006, 313(5786): 504-507. [50] ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[C]//European conference oncomputer vision. Springer, 2016: 649-666. [51] LEDIG C, THEIS L, HUSZÁR F, et al. Photo-realistic single image super-resolution using agenerative adversarial network[C]//Proceedings of the IEEE conference on computer vision andpattern recognition. 2017: 4681-4690. [52] PATHAK D, KRAHENBUHL P, DONAHUE J, et al. Context encoders: Feature learning byinpainting[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016: 2536-2544. [53] NOROOZI M, FAVARO P. Unsupervised learning of visual representations by solving jigsawpuzzles[C]//European conference on computer vision. Springer, 2016: 69-84. [54] GIDARIS S, SINGH P, KOMODAKIS N. Unsupervised representation learning by predictingimage rotations[J]. arXiv preprint arXiv:1803.07728, 2018. [55] DOERSCH C, GUPTA A, EFROS A A. Unsupervised visual representation learning by contextprediction[C]//Proceedings of the IEEE international conference on computer vision. 2015:1422-1430. [56] CARON M, BOJANOWSKI P, JOULIN A, et al. Deep clustering for unsupervised learning ofvisual features[C]//Proceedings of the European conference on computer vision (ECCV). 2018:132-149. [57] XU H, YAN M, LI C, et al. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced byVisual Learning[J]. arXiv preprint arXiv:2106.01804, 2021. [58] GAO T, FISCH A, CHEN D. Making pre-trained language models better few-shot learners[J].arXiv preprint arXiv:2012.15723, 2020. [59] CHO J, LEI J, TAN H, et al. Unifying vision-and-language tasks via text generation[C]//International Conference on Machine Learning. PMLR, 2021: 1931-1942. [60] TOUVRON H, CORD M, DOUZE M, et al. Training data-eﬀicient image transformers & dis-tillation through attention[C]//International Conference on Machine Learning. PMLR, 2021:10347-10357. [61] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformersfor language understanding[J]. arXiv preprint arXiv:1810.04805, 2018. [62] SHIBATA Y, KIDA T, FUKAMACHI S, et al. Byte Pair encoding: A text compression schemethat accelerates pattern matching[J]. 1999. [63] LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: Vision and language repre-sentation learning with momentum distillation[J]. Advances in Neural Information ProcessingSystems, 2021, 34. [64] CUBUK E D, ZOPH B, SHLENS J, et al. Randaugment: Practical automated data augmentationwith a reduced search space[C]//Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition Workshops. 2020: 702-703. [65] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[J]. arXiv preprintarXiv:1711.05101, 2017. [66] SHARMA P, DING N, GOODMAN S, et al. Conceptual captions: A cleaned, hypernymed, im-age alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2556-2565. [67] ORDONEZ V, KULKARNI G, BERG T. Im2text: Describing images using 1 million captionedphotographs[J]. Advances in neural information processing systems, 2011, 24. [68] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//European conference on computer vision. Springer, 2014: 740-755. [69] KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision usingcrowdsourced dense image annotations[J]. International journal of computer vision, 2017, 123(1): 32-73. [70] HUANG Z, ZENG Z, LIU B, et al. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers[J]. arXiv preprint arXiv:2004.00849, 2020. [71] XUE H, HUANG Y, LIU B, et al. Probing Inter-modality: Visual Parsing with Self-Attentionfor Vision-and-Language Pre-training[J]. Advances in Neural Information Processing Systems,2021, 34. [72] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with aunified text-to-text transformer[J]. arXiv preprint arXiv:1910.10683, 2019. [73] YU Z, YU J, XIANG C, et al. Beyond bilinear: Generalized multimodal factorized high-orderpooling for visual question answering[J]. IEEE transactions on neural networks and learningsystems, 2018, 29(12): 5947-5959.
所在学位评定分委会	计算机科学与工程系
国内图书分类号	TM301.2
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/335996
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	Li ZY. MODULAR TRANSFORMER MODEL FOR VISUAL QUESTION ANSWERING[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
11930653-李宗蔚-计算机科学与工（12245KB）	--	--	限制开放	--	请求全文