Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jun 23, 2016
Tell, Explain, Locate, and Answer: Relating Natural Language and Visual Recognition
Abstract: Language is the most important channel for humans to communicate about what they see. To allow an intelligent system to effectively communicate with humans it is thus important to enable it to relate information in words and sentences with the visual world. For this a system should be compositional, so it is e.g. not surprised when it encounters a novel object and can still talk about it. It should be able to explain in natural language, why it recognized a given object in an image as certain class, to allow a human to trust and understand it. However, it should not only be able to generate natural language, but also understand it, and locate sentences and linguistic references in the visual world. In my talk, I will discuss how we approach these different fronts by looking at the tasks of language generation about images, visual grounding, and visual question answering. I will conclude with a discussion of the challenges ahead.