题名 | Leveraging per Image-Token Consistency for Vision-Language Pre-training |
作者 | |
DOI | |
发表日期 | 2023
|
ISSN | 1063-6919
|
ISBN | 979-8-3503-0130-4
|
会议录名称 | |
页码 | 19155-19164
|
会议日期 | 17-24 June 2023
|
会议地点 | Vancouver, BC, Canada
|
摘要 | Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Underutilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. Our coude is released at https://github.com/gyhdog99/epic |
关键词 | |
学校署名 | 第一
|
相关链接 | [IEEE记录] |
收录类别 | |
WOS记录号 | WOS:001062531303045
|
来源库 | IEEE
|
全文链接 | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10205398 |
引用统计 |
被引频次[WOS]:0
|
成果类型 | 会议论文 |
条目标识符 | http://sustech.caswiz.com/handle/2SGJ60CL/559185 |
专题 | 南方科技大学 |
作者单位 | 1.Southern University of Science and Technology 2.ByteDance AI Lab 3.Hong Kong University of Science and Technology |
第一作者单位 | 南方科技大学 |
第一作者的第一单位 | 南方科技大学 |
推荐引用方式 GB/T 7714 |
Yunhao Gou,Tom Ko,Hansi Yang,et al. Leveraging per Image-Token Consistency for Vision-Language Pre-training[C],2023:19155-19164.
|
条目包含的文件 | 条目无相关文件。 |
|
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论