南方科技大学知识苑(SUSTech KC): Cross-Modal Concept Learning and Inference for Vision-Language Models

题名	Cross-Modal Concept Learning and Inference for Vision-Language Models
作者	Zhang，Yi1,2 ; Zhang，Ce 2; Tang，Yushun 2 ; He，Zhihai2,3
通讯作者	He，Zhihai
发表日期	2024-05-28
DOI	10.1016/j.neucom.2024.127530
发表期刊	Neurocomputing 影响因子和分区
ISSN	0925-2312
EISSN	1872-8286
卷号	583
摘要	Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this image-scale matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance of the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.
关键词	Concept learning Domain generalization Few-shot learning Vision-Language Models
相关链接	[Scopus记录]
收录类别	SCI ; EI
语种	英语
学校署名	通讯
ESI学科分类	COMPUTER SCIENCE
Scopus记录号	2-s2.0-85188781803
来源库	Scopus
引用统计
成果类型	期刊论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/741132
专题	南方科技大学
作者单位	1.Harbin Institute of Technology,Harbin,150001,China 2.Southern University of Science and Technology,Shenzhen,518055,China 3.Pengcheng Laboratory,Shenzhen,518000,China
第一作者单位	南方科技大学
通讯作者单位	南方科技大学
推荐引用方式 GB/T 7714	Zhang，Yi,Zhang，Ce,Tang，Yushun,et al. Cross-Modal Concept Learning and Inference for Vision-Language Models[J]. Neurocomputing,2024,583.
APA	Zhang，Yi,Zhang，Ce,Tang，Yushun,&He，Zhihai.(2024).Cross-Modal Concept Learning and Inference for Vision-Language Models.Neurocomputing,583.
MLA	Zhang，Yi,et al."Cross-Modal Concept Learning and Inference for Vision-Language Models".Neurocomputing 583(2024).