中文版 | English
题名

Leveraging per Image-Token Consistency for Vision-Language Pre-training

作者
DOI
发表日期
2023
ISSN
1063-6919
ISBN
979-8-3503-0130-4
会议录名称
页码
19155-19164
会议日期
17-24 June 2023
会议地点
Vancouver, BC, Canada
摘要
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Underutilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. Our coude is released at https://github.com/gyhdog99/epic
关键词
学校署名
第一
相关链接[IEEE记录]
收录类别
WOS记录号
WOS:001062531303045
来源库
IEEE
全文链接https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10205398
引用统计
被引频次[WOS]:0
成果类型会议论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/559185
专题南方科技大学
作者单位
1.Southern University of Science and Technology
2.ByteDance AI Lab
3.Hong Kong University of Science and Technology
第一作者单位南方科技大学
第一作者的第一单位南方科技大学
推荐引用方式
GB/T 7714
Yunhao Gou,Tom Ko,Hansi Yang,et al. Leveraging per Image-Token Consistency for Vision-Language Pre-training[C],2023:19155-19164.
条目包含的文件
条目无相关文件。
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[Yunhao Gou]的文章
[Tom Ko]的文章
[Hansi Yang]的文章
百度学术
百度学术中相似的文章
[Yunhao Gou]的文章
[Tom Ko]的文章
[Hansi Yang]的文章
必应学术
必应学术中相似的文章
[Yunhao Gou]的文章
[Tom Ko]的文章
[Hansi Yang]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。