A Cross-modal Audio Search Engine based on Joint Audio-Text Embeddings

1 816
31.9
Следующее
07.09.18 – 2 6421:03:34
AI Infrastructure and Tools
Популярные
Опубликовано 7 сентября 2018, 15:21
Ad-hoc audio clips, such as those from smart speakers, social media apps, security cameras and podcasts, are being recorded and shared online on a daily basis. For a variety of applications, it is important to be able to search effectively through these recordings. Web-based multimedia search engines – that independently index content or textual tags -- are not suitable for ad-hoc audio recordings. This is because of the absence of reliable human or machine-generated tags, and low specificity of audio content in such recordings.

In this work, we propose to connect audio and text modalities through a joint-embedding framework that allows the two modalities to exchange semantic information with each other within a shared latent space. Thus, we enable content- and text-based features associated with ad-hoc audio recordings to be mapped together and compared directly for cross-modal search and retrieval. We also show that these jointly-learnt embeddings outperform solo embeddings of any one modality. Thus, our results break ground for a cross-modal Audio Search Engine that permits searching through ad-hoc recordings with either text or audio queries.

See more at microsoft.com/en-us/research/v...
Случайные видео
140 дней – 79 5060:50
Why we fired Kevin | Samsung
168 дней – 4 5460:16
Ecommerce website solution badges?
186 дней – 1 847 03030:51
NVIDIA SC23 Special Address
15.05.11 – 28 88510:56
Infuse 4G Software Tour | Pocketnow
автотехномузыкадетское