中文版 | English
题名

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

作者
DOI
发表日期
2023
会议名称
24th International Speech Communication Association, Interspeech 2023
ISSN
2308-457X
EISSN
1990-9772
会议录名称
卷号
2023-August
页码
2838-2842
会议日期
August 20, 2023 - August 24, 2023
会议地点
Dublin, Ireland
会议录编者/会议主办者
Amazon Science; Apple; Dataocean AI; et al.; Google Research; Meta AI
出版地
C/O EMMANUELLE FOXONET, 4 RUE DES FAUVETTES, LIEU DIT LOUS TOURILS, BAIXAS, F-66390, FRANCE
出版者
摘要
Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
© 2023 International Speech Communication Association. All rights reserved.
关键词
学校署名
其他
语种
英语
相关链接[来源记录]
收录类别
资助项目
This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", a Newton Institutional Links Award from the British Council, titled "Automated Captioning of Image and Audio for Visually and Hearing Impaired" (Grant number 623805725), British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
WOS研究方向
Acoustics ; Audiology & Speech-Language Pathology ; Computer Science
WOS类目
Acoustics ; Audiology & Speech-Language Pathology ; Computer Science, Artificial Intelligence ; Computer Science, Software Engineering
WOS记录号
WOS:001186650302204
EI入藏号
20233814760715
EI主题词
Audio acoustics ; Audio signal processing ; Behavioral research
EI分类号
Ergonomics and Human Factors Engineering:461.4 ; Information Theory and Signal Processing:716.1 ; Acoustic Waves:751.1 ; Speech:751.5 ; Social Sciences:971
来源库
EV Compendex
引用统计
成果类型会议论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/673880
专题南方科技大学
作者单位
1.University of Surrey, United Kingdom
2.ByteDance, China
3.Xi'an Jiaotong-Liverpool University, China
4.Southern University of Science and Technology, China
5.Izmir Katip Celebi University, Turkey
推荐引用方式
GB/T 7714
Liu, Xubo,Huang, Qiushi,Mei, Xinhao,et al. Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention[C]//Amazon Science; Apple; Dataocean AI; et al.; Google Research; Meta AI. C/O EMMANUELLE FOXONET, 4 RUE DES FAUVETTES, LIEU DIT LOUS TOURILS, BAIXAS, F-66390, FRANCE:International Speech Communication Association,2023:2838-2842.
条目包含的文件
条目无相关文件。
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[Liu, Xubo]的文章
[Huang, Qiushi]的文章
[Mei, Xinhao]的文章
百度学术
百度学术中相似的文章
[Liu, Xubo]的文章
[Huang, Qiushi]的文章
[Mei, Xinhao]的文章
必应学术
必应学术中相似的文章
[Liu, Xubo]的文章
[Huang, Qiushi]的文章
[Mei, Xinhao]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。