Synchronized Audio-Visual Generation with a Joint Generative Diffusion Model and Contrastive Loss

999

9.5

Microsoft Research335 тыс

Следующее

11.11.23 – 5571:08:45

Research intern talk: Unified speech enhancement approach for speech degradation & noise suppression

Популярные

05.12.23 – 26532:52

AI Forum 2023 | The Emergence of General AI for Medicine

12.12.22 – 69731:52

Efficient Machine Learning at the Edge in Parallel

Опубликовано 10 ноября 2023, 22:43

Speakers: Ruihan Yang
Host: Sebastian Braun

The rapid development of deep learning techniques has led to significant advancements in the fields of multimedia generation and synthesis. However, generating coherent and temporally aligned audio and video content remains a challenging task due to the complex relationships between visual and auditory information. In this work, we propose a joint generative diffusion model that addresses this challenge by simultaneously generating video and audio content, thus enabling better synchronization and temporal alignment. Our approach is based on guided sampling, which allows for more flexibility in conditional generation and improves the overall quality of the generated content. Furthermore, we introduce a joint contrastive loss, inspired by previous work that has successfully employed contrastive loss in conditional diffusion models. By incorporating this joint contrastive loss, our model achieves better performance in terms of quality and temporal alignment. Through extensive evaluations using both subjective and objective metrics, we demonstrate the effectiveness of our proposed joint generative diffusion model in generating high-quality, temporally aligned audio and video content.

Learn more: microsoft.com/en-us/research/v...

Свежие видео