南方科技大学知识苑(SUSTech KC): InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

题名	InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring
作者	Yuan，Zhihao 1; Yan，Xu 1; Liao，Yinghong 1; Zhang，Ruimao 1; Wang，Sheng 2; Li，Zhen 1; Cui，Shuguang 1
通讯作者	Li，Zhen
DOI	10.1109/ICCV48922.2021.00181
发表日期	2021
ISSN	1550-5499
ISBN	978-1-6654-2813-2
会议录名称	Proceedings of the IEEE International Conference on Computer Vision
页码	1771-1780
会议日期	10-17 Oct. 2021
会议地点	Montreal, QC, Canada
摘要	Compared with the visual grounding on 2D images, the natural-language-guided 3D object localization on point clouds is more challenging. In this paper, we propose a new model, named InstanceRefer, to achieve a superior 3D visual grounding through the grounding-by-matching strategy. In practice, our model first predicts the target category from the language descriptions using a simple language classification model. Then, based on the category, our model sifts out a small number of instance candidates (usually less than 20) from the panoptic segmentation on point clouds. Thus, the non-trivial 3D visual grounding task has been effectively re-formulated as a simplified instance-matching problem, considering that instance-level candidates are more rational than the redundant 3D object proposals. Subsequently, for each candidate, we perform the multi-level contextual inference, i.e., referring from instance attribute perception, instance-to-instance relation perception, and instance-to-background global localization perception, respectively. Eventually, the most relevant candidate is selected and localized by ranking confidence scores, which are obtained by the cooperative holistic visual-language feature matching. Experiments confirm that our method outperforms previous state-of-the-arts on ScanRefer online benchmark and Nr3D/Sr3D datasets.
关键词	Vision + language Detection and localization in 2D and 3D Scene analysis and understanding Visual reasoning and logical representation
学校署名	其他
语种	英语
相关链接	[Scopus记录]
收录类别	EI
资助项目	National Key Research and Development Program of China[2018YFB1800800];
EI入藏号	20221511951300
Scopus记录号	2-s2.0-85120974687
来源库	Scopus
全文链接	https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9711198
引用统计	被引频次[WOS]：26
成果类型	会议论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/329678
专题	南方科技大学
作者单位	1.The Chinese University of Hong Kong (Shenzhen),Shenzhen Research Institute of Big Data,Hong Kong 2.CryoEM Center,Southern University of Science and Technology,China
推荐引用方式 GB/T 7714	Yuan，Zhihao,Yan，Xu,Liao，Yinghong,et al. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring[C],2021:1771-1780.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
10.1109@ICCV48922.20（7707KB）	--	--	开放获取	--	浏览