南方科技大学知识苑(SUSTech KC): Leveraging per Image-Token Consistency for Vision-Language Pre-training

题名	Leveraging per Image-Token Consistency for Vision-Language Pre-training
作者	Yunhao Gou1 ; Tom Ko 2; Hansi Yang 3; James Kwok 3; Yu Zhang1 ; Mingxuan Wang 2
DOI	10.1109/CVPR52729.2023.01836
发表日期	2023
ISSN	1063-6919
ISBN	979-8-3503-0130-4
会议录名称	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
页码	19155-19164
会议日期	17-24 June 2023
会议地点	Vancouver, BC, Canada
摘要	Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Underutilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. Our coude is released at https://github.com/gyhdog99/epic
关键词	Multi-modal learning
学校署名	第一
相关链接	[IEEE记录]
收录类别	CPCI-S
WOS记录号	WOS:001062531303045
来源库	IEEE
全文链接	https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10205398
引用统计	被引频次[WOS]：0
成果类型	会议论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/559185
专题	南方科技大学
作者单位	1.Southern University of Science and Technology 2.ByteDance AI Lab 3.Hong Kong University of Science and Technology
第一作者单位	南方科技大学
第一作者的第一单位	南方科技大学
推荐引用方式 GB/T 7714	Yunhao Gou,Tom Ko,Hansi Yang,et al. Leveraging per Image-Token Consistency for Vision-Language Pre-training[C],2023:19155-19164.