[1] STAHLBERG F. Neural machine translation: A review[J]. Journal of Artificial IntelligenceResearch, 2020, 69: 343-418.
[2] DENG J, DONG W, SOCHER R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255.
[3] TapTapSee[EB/OL]. https://taptapseeapp.com/.
[4] ORGANIZATION W H, et al. World report on vision[J]. 2019.
[5] 中国盲人协会. 主要数据公报[EB/OL]. http://www.zgmx.org.cn/newsdetail/d-13367-0.html.
[6] GURARI D, LI Q, STANGL A J, et al. Vizwiz grand challenge: Answering visual questionsfrom blind people[C]//Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. 2018: 3608-3617.
[7] WU Q, TENEY D, WANG P, et al. Visual question answering: A survey of methods and datasets[J]. Computer Vision and Image Understanding, 2017, 163: 21-40.
[8] MALINOWSKI M, FRITZ M. A multi-world approach to question answering about real-worldscenes based on uncertain input[J]. Advances in neural information processing systems, 2014,27.
[9] ANTOL S, AGRAWAL A, LU J, et al. Vqa: Visual question answering[C]//Proceedings of theIEEE international conference on computer vision. 2015: 2425-2433.
[10] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the v in vqa matter: Elevating the roleof image understanding in visual question answering[C]//Proceedings of the IEEE conferenceon computer vision and pattern recognition. 2017: 6904-6913.
[11] CHEN K, WANG J, CHEN L C, et al. Abc-cnn: An attention based convolutional neural networkfor visual question answering[J]. arXiv preprint arXiv:1511.05960, 2015.
[12] YANG Z, HE X, GAO J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29.
[13] XU H, SAENKO K. Ask, attend and answer: Exploring question-guided spatial attention forvisual question answering[C]//European Conference on Computer Vision. Springer, 2016: 451-466.
[14] LU J, YANG J, BATRA D, et al. Hierarchical question-image co-attention for visual questionanswering[J]. Advances in neural information processing systems, 2016, 29.
[15] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computervision and pattern recognition. 2018: 6077-6086.
[16] LU J, BATRA D, PARIKH D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J]. Advances in neural information processing systems,2019, 32.
[17] TAN H, BANSAL M. Lxmert: Learning cross-modality encoder representations from trans-formers[J]. arXiv preprint arXiv:1908.07490, 2019.
[18] LU J, GOSWAMI V, ROHRBACH M, et al. 12-in-1: Multi-task vision and language represen-tation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. 2020: 10437-10446.
[19] SU W, ZHU X, CAO Y, et al. Vl-bert: Pre-training of generic visual-linguistic representations[J]. arXiv preprint arXiv:1908.08530, 2019.
[20] LI L H, YATSKAR M, YIN D, et al. Visualbert: A simple and performant baseline for visionand language[J]. arXiv preprint arXiv:1908.03557, 2019.
[21] CHEN Y C, LI L, YU L, et al. Uniter: Learning universal image-text representations[J]. 2019.
[22] LI X, YIN X, LI C, et al. Oscar: Object-semantics aligned pre-training for vision-languagetasks[C]//European Conference on Computer Vision. Springer, 2020: 121-137.
[23] KIM J H, JUN J, ZHANG B T. Bilinear attention networks[J]. Advances in Neural InformationProcessing Systems, 2018, 31.
[24] YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answer-ing[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019: 6281-6290.
[25] YAN M, XIA J, WU C, et al. A deep cascade model for multi-document reading comprehension[C]//Proceedings of the AAAI conference on artificial intelligence: volume 33. 2019: 7354-7361.
[26] QI P, LIN X, MEHR L, et al. Answering complex open-domain questions through iterativequery generation[J]. arXiv preprint arXiv:1910.07000, 2019.
[27] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale imagerecognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[28] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedingsof the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
[29] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words:Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
[30] WANG W, YAO L, CHEN L, et al. CrossFormer: A versatile Vision transformer hinging oncross-scale attention[J]. arXiv preprint arXiv:2108.00154, 2021.
[31] REN S, HE K, GIRSHICK R, et al. Faster r-cnn: Towards real-time object detection with regionproposal networks[J]. Advances in neural information processing systems, 2015, 28.
[32] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time objectdetection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016: 779-788.
[33] PRESTI L L, LA CASCIA M. 3D skeleton-based human action classification: A survey[J].Pattern Recognition, 2016, 53: 130-147.
[34] REN M, KIROS R, ZEMEL R. Exploring models and data for image question answering[J].Advances in neural information processing systems, 2015, 28.
[35] JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question an-swering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-nition. 2020: 10267-10276.
[36] KIM W, SON B, KIM I. Vilt: Vision-and-language transformer without convolution or regionsupervision[C]//International Conference on Machine Learning. PMLR, 2021: 5583-5594.
[37] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation,1997, 9(8): 1735-1780.
[38] QI P, LIN X, MEHR L, et al. Answering complex open-domain questions through iterativequery generation[J]. arXiv preprint arXiv:1910.07000, 2019.
[39] GAO H, MAO J, ZHOU J, et al. Are you talking to a machine? dataset and methods for multi-lingual image question[J]. Advances in neural information processing systems, 2015, 28.
[40] TENENBAUM J B, FREEMAN W T. Separating style and content with bilinear models[J].Neural computation, 2000, 12(6): 1247-1283.
[41] LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear cnn models for fine-grained visual recog-nition[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1449-1457.
[42] FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual questionanswering and visual grounding[J]. arXiv preprint arXiv:1606.01847, 2016.
[43] CHARIKAR M, CHEN K, FARACH-COLTON M. Finding frequent items in data streams[C]//International Colloquium on Automata, Languages, and Programming. Springer, 2002:693-703.
[44] PIRSIAVASH H, RAMANAN D, FOWLKES C. Bilinear classifiers for visual recognition[J].Advances in neural information processing systems, 2009, 22.
[45] YU Z, YU J, FAN J, et al. Multi-modal factorized bilinear pooling with co-attention learning forvisual question answering[C]//Proceedings of the IEEE international conference on computervision. 2017: 1821-1830.
[46] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances inneural information processing systems, 2017, 30.
[47] QIU X, SUN T, XU Y, et al. Pre-trained models for natural language processing: A survey[J].Science China Technological Sciences, 2020, 63(10): 1872-1897.
[48] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Internationalconference on machine learning. PMLR, 2014: 1188-1196.
[49] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neuralnetworks[J]. science, 2006, 313(5786): 504-507.
[50] ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[C]//European conference oncomputer vision. Springer, 2016: 649-666.
[51] LEDIG C, THEIS L, HUSZÁR F, et al. Photo-realistic single image super-resolution using agenerative adversarial network[C]//Proceedings of the IEEE conference on computer vision andpattern recognition. 2017: 4681-4690.
[52] PATHAK D, KRAHENBUHL P, DONAHUE J, et al. Context encoders: Feature learning byinpainting[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016: 2536-2544.
[53] NOROOZI M, FAVARO P. Unsupervised learning of visual representations by solving jigsawpuzzles[C]//European conference on computer vision. Springer, 2016: 69-84.
[54] GIDARIS S, SINGH P, KOMODAKIS N. Unsupervised representation learning by predictingimage rotations[J]. arXiv preprint arXiv:1803.07728, 2018.
[55] DOERSCH C, GUPTA A, EFROS A A. Unsupervised visual representation learning by contextprediction[C]//Proceedings of the IEEE international conference on computer vision. 2015:1422-1430.
[56] CARON M, BOJANOWSKI P, JOULIN A, et al. Deep clustering for unsupervised learning ofvisual features[C]//Proceedings of the European conference on computer vision (ECCV). 2018:132-149.
[57] XU H, YAN M, LI C, et al. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced byVisual Learning[J]. arXiv preprint arXiv:2106.01804, 2021.
[58] GAO T, FISCH A, CHEN D. Making pre-trained language models better few-shot learners[J].arXiv preprint arXiv:2012.15723, 2020.
[59] CHO J, LEI J, TAN H, et al. Unifying vision-and-language tasks via text generation[C]//International Conference on Machine Learning. PMLR, 2021: 1931-1942.
[60] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & dis-tillation through attention[C]//International Conference on Machine Learning. PMLR, 2021:10347-10357.
[61] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformersfor language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[62] SHIBATA Y, KIDA T, FUKAMACHI S, et al. Byte Pair encoding: A text compression schemethat accelerates pattern matching[J]. 1999.
[63] LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: Vision and language repre-sentation learning with momentum distillation[J]. Advances in Neural Information ProcessingSystems, 2021, 34.
[64] CUBUK E D, ZOPH B, SHLENS J, et al. Randaugment: Practical automated data augmentationwith a reduced search space[C]//Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition Workshops. 2020: 702-703.
[65] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[J]. arXiv preprintarXiv:1711.05101, 2017.
[66] SHARMA P, DING N, GOODMAN S, et al. Conceptual captions: A cleaned, hypernymed, im-age alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2556-2565.
[67] ORDONEZ V, KULKARNI G, BERG T. Im2text: Describing images using 1 million captionedphotographs[J]. Advances in neural information processing systems, 2011, 24.
[68] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//European conference on computer vision. Springer, 2014: 740-755.
[69] KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision usingcrowdsourced dense image annotations[J]. International journal of computer vision, 2017, 123(1): 32-73.
[70] HUANG Z, ZENG Z, LIU B, et al. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers[J]. arXiv preprint arXiv:2004.00849, 2020.
[71] XUE H, HUANG Y, LIU B, et al. Probing Inter-modality: Visual Parsing with Self-Attentionfor Vision-and-Language Pre-training[J]. Advances in Neural Information Processing Systems,2021, 34.
[72] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with aunified text-to-text transformer[J]. arXiv preprint arXiv:1910.10683, 2019.
[73] YU Z, YU J, XIANG C, et al. Beyond bilinear: Generalized multimodal factorized high-orderpooling for visual question answering[J]. IEEE transactions on neural networks and learningsystems, 2018, 29(12): 5947-5959.
修改评论