A Cross-modal Audio Search Engine based on Joint Audio-Text Embeddings

1 884

33.1

Microsoft Research336 тыс

Следующее

07.09.18 – 2 6591:03:34

AI Infrastructure and Tools

Популярные

119 дней – 1 2746:36

Project Aurora: The first large-scale foundation model of the atmosphere

05.12.23 – 2 44518:42

AI Forum 2023 | AI4Science: Accelerating Scientific Discovery with Artificial Intelligence

Опубликовано 7 сентября 2018, 15:21

Ad-hoc audio clips, such as those from smart speakers, social media apps, security cameras and podcasts, are being recorded and shared online on a daily basis. For a variety of applications, it is important to be able to search effectively through these recordings. Web-based multimedia search engines – that independently index content or textual tags -- are not suitable for ad-hoc audio recordings. This is because of the absence of reliable human or machine-generated tags, and low specificity of audio content in such recordings.

In this work, we propose to connect audio and text modalities through a joint-embedding framework that allows the two modalities to exchange semantic information with each other within a shared latent space. Thus, we enable content- and text-based features associated with ad-hoc audio recordings to be mapped together and compared directly for cross-modal search and retrieval. We also show that these jointly-learnt embeddings outperform solo embeddings of any one modality. Thus, our results break ground for a cross-modal Audio Search Engine that permits searching through ad-hoc recordings with either text or audio queries.

See more at microsoft.com/en-us/research/v...

Свежие видео