Microsoft Research333 тыс
Опубликовано 9 июля 2021, 0:37
𝗧𝗶𝘁𝗹𝗲: Recent Advances in Image Captioning, Image-Text Retrieval and Visual Question Answering using Scene Graph Parsing, What Next?
𝗦𝗽𝗲𝗮𝗸𝗲𝗿: Hamid Palangi
𝗗𝗮𝘁𝗲: July 9, 2019
Creating appropriate representation of data is the key for many recent breakthroughs in both language and vision. In natural language, from the structured representations like parse trees to BERT and Transformers pretrained on large scale data. In computer vision, with a slightly different trend, from scale-invariant feature transform (SIFT) to CNNs pretrained on large scale data back to more structured representation of images using scene graphs. Building appropriate models for parsing scene into graphs has unique challenges which has led to the task of Scene Graph Generation (SGG) with various subtasks from object detection, to scene graph classification and detection. An orthogonal challenge to SGG task is the effectiveness of generated scene graphs in downstream language and vision tasks that can benefit from these pretrained models. In this talk, we present our recent work to pretrain large scale SGGs, and two new models to exploit them which has resulted in significant improvement for downstream tasks of image captioning and image-text retrieval. We further present the challenges and opportunities ahead for SGGs and new downstream tasks like visual question answering.
𝗦𝗹𝗶𝗱𝗲𝘀: microsoft.com/en-us/research/u...
𝗦𝗽𝗲𝗮𝗸𝗲𝗿: Hamid Palangi
𝗗𝗮𝘁𝗲: July 9, 2019
Creating appropriate representation of data is the key for many recent breakthroughs in both language and vision. In natural language, from the structured representations like parse trees to BERT and Transformers pretrained on large scale data. In computer vision, with a slightly different trend, from scale-invariant feature transform (SIFT) to CNNs pretrained on large scale data back to more structured representation of images using scene graphs. Building appropriate models for parsing scene into graphs has unique challenges which has led to the task of Scene Graph Generation (SGG) with various subtasks from object detection, to scene graph classification and detection. An orthogonal challenge to SGG task is the effectiveness of generated scene graphs in downstream language and vision tasks that can benefit from these pretrained models. In this talk, we present our recent work to pretrain large scale SGGs, and two new models to exploit them which has resulted in significant improvement for downstream tasks of image captioning and image-text retrieval. We further present the challenges and opportunities ahead for SGGs and new downstream tasks like visual question answering.
𝗦𝗹𝗶𝗱𝗲𝘀: microsoft.com/en-us/research/u...
Свежие видео
Случайные видео