Beyond Nouns and Verbs: Learning Visually Grounded Stories of Images and Videos using Language and Vision

Prof. Larry Davis


Abstract

This talk describes recent work on combining language and vision for learning visually grounded contextual structures, which focuses on simultaneously learning visual appearance and contextual models from richly annotated, weakly labeled datasets.

The first part of the talk addresses how linguistic annotations can be used to constrain the learning of visually grounded models of nouns, prepositions and comparative adjectives from weakly labeled data, and how such visually grounded models can be utilized as contextual models for scene analysis.

The second part concerns learning and utilization of "storyline models" for video interpretation. Storyline models go beyond pair-wise contextual models and represent higher order constraints on activities in space and time. Visual inference using storyline models involves inferring the "plot" of the video (spatial/temporal plot of actions) and recognizing individual activities in the plot. The approach is applied and illustrated on baseball videos from the 2008 World Series.


Maintained by Dian Gong