PhD Thesis Defense

Somboon Hongeng


Automatic generation of a description of video data has been one of the most active research areas in recent years. Many of the current video analysis systems describe a video stream in terms of layers of background and foreground objects, and their motion descriptions, providing a compact representation of videos. However, such a representation lacks the knowledge of the video content, which is the most crucial information in many applications such as video annotation, semantic-based video summarization, video surveillance and advanced human-machine interface.

To understand the content of videos, a computer vision system must be capable of bridging the gap between the dynamic pixel-level information of image sequences and the high-level event descriptions. First, objects in the scene must be detected and recognized from the video. Finding and recognizing objects in the clutter of image features with noise, shadows and occlusion is one of the most challenging problems in computer vision. Second, a description of actions involving individual objects and the global situation must be produced in some representation scheme. Determining an event representation suitable for machine perception is one of the key issues. An event representation should be generic enough to model a variety of event types, and accurate enough to allow similar events to be discriminated. Similar to other pattern recognition tasks, a pattern of interesting events (i.e. ones that match the event models) must be segmented from continuous video streams. This is a particularly difficult task because of the uncertain nature of both the input data (i.e. detection and tracking noise) and the event models. While different events may be similar in appearance, the same events may appear differently depending on how they are executed. A large variation in the time scales of some events makes it more difficult to segment them than in many other pattern recognition tasks.

In this thesis, we show that automatic event understanding from video streams may be achieved for a large class of events based on the observation of the trajectories and the shapes of objects. We propose a new formalism, in which events are described by a hierarchical representation consisting of image features, mobile object properties and event scenarios. Taking image features of tracked moving regions from an image sequence as input, mobile object properties are first computed by specific methods while noise is suppressed by statistical methods. Events are viewed as consisting of single or multiple threads. In a single thread event, relevant actions occur along a linear time scale. They are recognized from mobile object properties using Bayesian networks and stochastic finite automata. For multiple thread events, several threads of events are related by logical and temporal constraints. These constraints are verified in a probabilistic framework where all possible durations are considered for an optimal estimate. An Event Recognition Language is proposed for describing these events in a natural way. This particular design is based on our intention to be able to describe events at the symbolic level as well as to provide the optimal recognition of these events from low level visual facts in presence of variations in event execution styles and tracking noise, native to the analysis of real image sequences.

Maintained by Philippos Mordohai