南方科技大学知识苑(SUSTech KC): Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

题名	Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
作者	Liu, Xubo 1; Huang, Qiushi1,4 ; Mei, Xinhao 1; Liu, Haohe 1; Kong, Qiuqiang 2; Sun, Jianyuan 1; Li, Shengchen 3; Ko, Tom 2; Zhang, Yu4 ; Tang, Lilian H.1; Plumbley, Mark D.1; Kı; lı; ç; , Volkan; Wang, Wenwu 1
DOI	10.21437/Interspeech.2023-914
发表日期	2023
会议名称	24th International Speech Communication Association, Interspeech 2023
ISSN	2308-457X
EISSN	1990-9772
会议录名称	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷号	2023-August
页码	2838-2842
会议日期	August 20, 2023 - August 24, 2023
会议地点	Dublin, Ireland
会议录编者/会议主办者	Amazon Science; Apple; Dataocean AI; et al.; Google Research; Meta AI
出版地	C/O EMMANUELLE FOXONET, 4 RUE DES FAUVETTES, LIEU DIT LOUS TOURILS, BAIXAS, F-66390, FRANCE
出版者	International Speech Communication Association
摘要	Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics. © 2023 International Speech Communication Association. All rights reserved.
关键词	Audio captioning audio-visual learning attention mechanism multimodal learning
学校署名	其他
语种	英语
相关链接	[来源记录]
收录类别	EI ; CPCI-S
资助项目	This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", a Newton Institutional Links Award from the British Council, titled "Automated Captioning of Image and Audio for Visually and Hearing Impaired" (Grant number 623805725), British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
WOS研究方向	Acoustics ; Audiology & Speech-Language Pathology ; Computer Science
WOS类目	Acoustics ; Audiology & Speech-Language Pathology ; Computer Science, Artificial Intelligence ; Computer Science, Software Engineering
WOS记录号	WOS:001186650302204
EI入藏号	20233814760715
EI主题词	Audio acoustics ; Audio signal processing ; Behavioral research
EI分类号	Ergonomics and Human Factors Engineering:461.4 ; Information Theory and Signal Processing:716.1 ; Acoustic Waves:751.1 ; Speech:751.5 ; Social Sciences:971
来源库	EV Compendex
引用统计
成果类型	会议论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/673880
专题	南方科技大学
作者单位	1.University of Surrey, United Kingdom 2.ByteDance, China 3.Xi'an Jiaotong-Liverpool University, China 4.Southern University of Science and Technology, China 5.Izmir Katip Celebi University, Turkey
推荐引用方式 GB/T 7714	Liu, Xubo,Huang, Qiushi,Mei, Xinhao,et al. Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention[C]//Amazon Science; Apple; Dataocean AI; et al.; Google Research; Meta AI. C/O EMMANUELLE FOXONET, 4 RUE DES FAUVETTES, LIEU DIT LOUS TOURILS, BAIXAS, F-66390, FRANCE:International Speech Communication Association,2023:2838-2842.