题名 | Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention |
作者 | |
DOI | |
发表日期 | 2023
|
会议名称 | 24th International Speech Communication Association, Interspeech 2023
|
ISSN | 2308-457X
|
EISSN | 1990-9772
|
会议录名称 | |
卷号 | 2023-August
|
页码 | 2838-2842
|
会议日期 | August 20, 2023 - August 24, 2023
|
会议地点 | Dublin, Ireland
|
会议录编者/会议主办者 | Amazon Science; Apple; Dataocean AI; et al.; Google Research; Meta AI
|
出版地 | C/O EMMANUELLE FOXONET, 4 RUE DES FAUVETTES, LIEU DIT LOUS TOURILS, BAIXAS, F-66390, FRANCE
|
出版者 | |
摘要 | Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics. © 2023 International Speech Communication Association. All rights reserved. |
关键词 | |
学校署名 | 其他
|
语种 | 英语
|
相关链接 | [来源记录] |
收录类别 | |
资助项目 | This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", a Newton Institutional Links Award from the British Council, titled "Automated Captioning of Image and Audio for Visually and Hearing Impaired" (Grant number 623805725), British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
|
WOS研究方向 | Acoustics
; Audiology & Speech-Language Pathology
; Computer Science
|
WOS类目 | Acoustics
; Audiology & Speech-Language Pathology
; Computer Science, Artificial Intelligence
; Computer Science, Software Engineering
|
WOS记录号 | WOS:001186650302204
|
EI入藏号 | 20233814760715
|
EI主题词 | Audio acoustics
; Audio signal processing
; Behavioral research
|
EI分类号 | Ergonomics and Human Factors Engineering:461.4
; Information Theory and Signal Processing:716.1
; Acoustic Waves:751.1
; Speech:751.5
; Social Sciences:971
|
来源库 | EV Compendex
|
引用统计 | |
成果类型 | 会议论文 |
条目标识符 | http://sustech.caswiz.com/handle/2SGJ60CL/673880 |
专题 | 南方科技大学 |
作者单位 | 1.University of Surrey, United Kingdom 2.ByteDance, China 3.Xi'an Jiaotong-Liverpool University, China 4.Southern University of Science and Technology, China 5.Izmir Katip Celebi University, Turkey |
推荐引用方式 GB/T 7714 |
Liu, Xubo,Huang, Qiushi,Mei, Xinhao,et al. Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention[C]//Amazon Science; Apple; Dataocean AI; et al.; Google Research; Meta AI. C/O EMMANUELLE FOXONET, 4 RUE DES FAUVETTES, LIEU DIT LOUS TOURILS, BAIXAS, F-66390, FRANCE:International Speech Communication Association,2023:2838-2842.
|
条目包含的文件 | 条目无相关文件。 |
|
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论