Beyond Nouns and Verbs: Learning Visually Grounded Stories of Images and Videos using Language and Vision
Prof. Larry Davis
Abstract
This talk describes recent work on combining language and vision for
learning visually grounded contextual structures, which focuses on
simultaneously learning visual appearance and contextual models from
richly annotated, weakly labeled datasets.
The first part of the talk addresses how linguistic annotations can be
used to constrain the learning of visually grounded models of nouns,
prepositions and comparative adjectives from weakly labeled data, and
how such visually grounded models can be utilized as contextual models
for scene analysis.
The second part concerns learning and utilization of "storyline models"
for video interpretation. Storyline models go beyond pair-wise
contextual models and represent higher order constraints on activities
in space and time. Visual inference using storyline models involves
inferring the "plot" of the video (spatial/temporal plot of actions) and
recognizing individual activities in the plot. The approach is applied
and illustrated on baseball videos from the 2008 World Series.