题名 | Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples |
作者 | |
DOI | |
发表日期 | 2023
|
会议名称 | IEEE/CVF International Conference on Computer Vision (ICCV)
|
ISSN | 1550-5499
|
ISBN | 979-8-3503-0719-1
|
会议录名称 | |
页码 | 2684-2693
|
会议日期 | 1-6 Oct. 2023
|
会议地点 | Paris, France
|
出版地 | 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
|
出版者 | |
摘要 | Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene. However, in more realistic scenarios, only minimal annotations are available for a new scene, which poses significant challenges to existing RVOS methods. With this in mind, we propose a simple yet effective model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios. Since the proposed method targets limited samples for new scenes, we generalize the problem as - few-shot referring video object segmentation (FS-RVOS). To foster research in this direction, we build up a new FS-RVOS benchmark based on currently available datasets. The benchmark covers a wide range and includes multiple situations, which can maximally simulate real-world scenarios. Extensive experiments show that our model adapts well to different scenarios with only a few samples, reaching state-of-the-art performance on the benchmark. On Mini-Ref-YouTube-VOS, our model achieves an average performance of 53.1 and 54.8, which are 10% better than the baselines. Furthermore, we show impressive results of 77.7 and 74.8 on Mini-Ref-SAIL-VOS, which are significantly better than the baselines. Code is publicly available at https://github.com/hengliusky/Few_shot_RVOS. |
关键词 | |
学校署名 | 其他
|
语种 | 英语
|
相关链接 | [IEEE记录] |
收录类别 | |
资助项目 | National Key R&D Program of China[2022YFF1202903]
; National Natural Science Foundation of China["61971004","62122035"]
; Natural Science Foundation of Anhui Province, China[2008085MF190]
; Equipment Advanced Research Sharing Technology Project, China[80912020104]
|
WOS研究方向 | Computer Science
; Imaging Science & Photographic Technology
|
WOS类目 | Computer Science, Artificial Intelligence
; Computer Science, Theory & Methods
; Imaging Science & Photographic Technology
|
WOS记录号 | WOS:001159644302087
|
来源库 | IEEE
|
全文链接 | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10377326 |
引用统计 | |
成果类型 | 会议论文 |
条目标识符 | http://sustech.caswiz.com/handle/2SGJ60CL/719105 |
专题 | 南方科技大学 |
作者单位 | 1.Anhui University of Technology 2.Southern University of Science and Technology 3.United Imaging |
推荐引用方式 GB/T 7714 |
Guanghui Li,Mingqi Gao,Heng Liu,et al. Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples[C]. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA:IEEE COMPUTER SOC,2023:2684-2693.
|
条目包含的文件 | 条目无相关文件。 |
|
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论