A Cross-modal Audio Search Engine based on Joint Audio-Text Embeddings

1 884
33.1
Опубликовано 7 сентября 2018, 15:21
Ad-hoc audio clips, such as those from smart speakers, social media apps, security cameras and podcasts, are being recorded and shared online on a daily basis. For a variety of applications, it is important to be able to search effectively through these recordings. Web-based multimedia search engines – that independently index content or textual tags -- are not suitable for ad-hoc audio recordings. This is because of the absence of reliable human or machine-generated tags, and low specificity of audio content in such recordings.

In this work, we propose to connect audio and text modalities through a joint-embedding framework that allows the two modalities to exchange semantic information with each other within a shared latent space. Thus, we enable content- and text-based features associated with ad-hoc audio recordings to be mapped together and compared directly for cross-modal search and retrieval. We also show that these jointly-learnt embeddings outperform solo embeddings of any one modality. Thus, our results break ground for a cross-modal Audio Search Engine that permits searching through ad-hoc recordings with either text or audio queries.

See more at microsoft.com/en-us/research/v...
автотехномузыкадетское