Grounded Visual Generation

539
16.3
Опубликовано 15 сентября 2021, 18:44
Multi-modal data provides an exciting opportunity to train grounded generative models that synthesize images consistent with real world phenomena. In this talk, I will share several of our recent efforts towards creating grounded visual generation models: (1) introducing user attention grounding for text-to-image synthesis, (2) improving text-to-image generation results with stronger language grounding, and (3) taking steps towards creating spatially grounded world models for embodied vision-and-language tasks.

Speaker: Jing Yu Koh, Google

MSR Deep Learning team: microsoft.com/en-us/research/g...
автотехномузыкадетское