题名 | Cross-Modal Concept Learning and Inference for Vision-Language Models |
作者 | |
通讯作者 | He,Zhihai |
发表日期 | 2024-05-28
|
DOI | |
发表期刊 | |
ISSN | 0925-2312
|
EISSN | 1872-8286
|
卷号 | 583 |
摘要 | Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this image-scale matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance of the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization. |
关键词 | |
相关链接 | [Scopus记录] |
收录类别 | |
语种 | 英语
|
学校署名 | 通讯
|
ESI学科分类 | COMPUTER SCIENCE
|
Scopus记录号 | 2-s2.0-85188781803
|
来源库 | Scopus
|
引用统计 | |
成果类型 | 期刊论文 |
条目标识符 | http://sustech.caswiz.com/handle/2SGJ60CL/741132 |
专题 | 南方科技大学 |
作者单位 | 1.Harbin Institute of Technology,Harbin,150001,China 2.Southern University of Science and Technology,Shenzhen,518055,China 3.Pengcheng Laboratory,Shenzhen,518000,China |
第一作者单位 | 南方科技大学 |
通讯作者单位 | 南方科技大学 |
推荐引用方式 GB/T 7714 |
Zhang,Yi,Zhang,Ce,Tang,Yushun,et al. Cross-Modal Concept Learning and Inference for Vision-Language Models[J]. Neurocomputing,2024,583.
|
APA |
Zhang,Yi,Zhang,Ce,Tang,Yushun,&He,Zhihai.(2024).Cross-Modal Concept Learning and Inference for Vision-Language Models.Neurocomputing,583.
|
MLA |
Zhang,Yi,et al."Cross-Modal Concept Learning and Inference for Vision-Language Models".Neurocomputing 583(2024).
|
条目包含的文件 | 条目无相关文件。 |
|
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论