中文版 | English
题名

Learning Vision-Language Representation for Multimodal Understanding

姓名
姓名拼音
WANG Teng
学号
12050030
学位类型
博士
学位专业
计算机
导师
郑锋
导师单位
计算机科学与工程系
论文答辩日期
2024-06-10
论文提交日期
2024-08-23
学位授予单位
香港大学
学位授予地点
香港
摘要

Humans comprehend and interact with their surroundings through the integration of multi-sensory information, including visual, linguistic, and auditory cues. The field of vision-language representation learning is dedicated to enabling machines to learn multimodal associations and interactions between visual and textual data. This thesis tackles three pivotal problems: scalability of the pretraining data, eciency of the pretraining objectives and fine-grained vision-language alignments. Regarding data scalability, we focus on scalable vision-language representation learning that leverages unpaired images and texts. To enhance the implicit alignments between modalities and augment data diversity, we introduce cross-modal cutmix, a technique for blending visual patches with sentences to create multimodal sentences, i.e., a multimodal view of a sentence. By incorporating diverse multimodal sentences into contrastive learning, instance-level alignments between textual and multimodal samples are eectively exploited. Our model circumvents the constraints of paired datasets, facilitating scalable multimodal representation learning with a broader and more varied collection of unpaired data.

In terms of learning eciency, we investigate the acceleration method of vision-language pretraining. We empirically find that an essential obstacle to training e-ciency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling, that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To overcome the limitation, we propose free language modeling (FLM), a new pretraining objective that decouples the predic-tion rate from the corruption rate in masked language modeling. Our method achieves faster convergence by allowing customization of corruption spans for each token, while maintaining competitive performance on downstream vision-language tasks. Concerning cross-modal alignment granularity, we delve into fine-grained align-ments between untrimmed videos and natural language. We propose a grounded vision-language learning (GVL) framework for untrimmed videos, focusing on detect-ing informative events and aligning multi-sentence descriptions with corresponding event segments. We introduce the parallel decoding paradigm for dense video cap-tioning (PDVC) to segment videos eectively, enhancing the coherence and readabil-ity of generated dense captions. Furthermore, two dual pretext tasks are proposed to encourage fine-grained segment-level alignments: text-to-event contrast and event-to-text generation. The framework is versatile and applicable to visually-grounded language understanding and generation tasks.

We conduct extensive experiments to validate our proposed methodologies. These eorts not only advance the frontiers of multimodal learning but also pave the way for more ecient and eective integration of vision and language in machine intelligence systems. (400 words)

关键词
语种
英语
培养类别
联合培养
入学年份
2020
学位授予年份
2024-08
参考文献列表

[1] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
[2] D. Omeiza, H. Webb, M. Jirotka, and L. Kunze, “Explanations in autonomousdriving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 10 142–10 162, 2021.
[3] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liao et al., “A survey on multimodal large language models for autonomous driving,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 958–979.
[4] L. Wen, X. Yang, D. Fu, X. Wang, P. Cai, X. Li, T. Ma, Y. Li, L. Xu, D. Shanget al., “On the road with gpt-4v (ision): Early explorations of visual-language model on autonomous driving,” arXiv preprint arXiv:2311.05332, 2023.
[5] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, “12-in-1: Multi-taskvision and language representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 437–10 446.
[6] J. Cho, J. Lei, H. Tan, and M. Bansal, “Unifying vision-and-language tasksvia text generation,” in International Conference on Machine Learning (ICML), 2021.
[7] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances inneural information processing systems, vol. 36, 2024.
[8] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark,and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” Advances in Neural Information Processing Sys-tems, vol. 35, pp. 2507–2521, 2022.
[9] T. Gupta and A. Kembhavi, “Visual programming: Compositional visual reason-ing without training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 953–14 962.
[10] A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V. Sindhwani et al., “Socratic models: Composing zero-shot multimodal reasoning with language,” arXiv preprint arXiv:2204.00598, 2022.
[11] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt:Talking, drawing and editing with visual foundation models,” arXiv preprint arXiv:2303.04671, 2023.
[12] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu,M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv preprint arXiv:2303.11381, 2023.
[13] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, andT. Mikolov, “Devise: A deep visual-semantic embedding model,” Advances in neural information processing systems, vol. 26, 2013.
[14] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural imagecaption generator,” in CVPR, 2015.
[15] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, andY. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning. PMLR, 2015, pp. 2048–2057.
[16] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang,“Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018.
[17] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L.Zitnick, “Microsoft COCO Captions: Data collection and evaluation server,” arXiv preprint, 2015.
[18] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, andD. Parikh, “VQA: Visual question answering,” in International Conference on Computer Vision (ICCV), 2015.
[19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
[20] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra,D. Parikh, S. Lee, and P. Anderson, “nocaps: novel object captioning at scale,” in ICCV, 2019.
[21] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: Pre-trainingof generic visual-linguistic representations,” arXiv preprint arXiv:1908.08530, 2019.

来源库
人工提交
成果类型学位论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/804475
专题工学院_计算机科学与工程系
推荐引用方式
GB/T 7714
Wang T. Learning Vision-Language Representation for Multimodal Understanding[D]. 香港. 香港大学,2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可 操作
12050030-王腾-计算机科学与工程(6823KB)----限制开放--请求全文
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[王腾]的文章
百度学术
百度学术中相似的文章
[王腾]的文章
必应学术
必应学术中相似的文章
[王腾]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。