This paper presents our work on learning high level structure from human motion sequences, and its applications in human figure tracking. We use a structured representation (ˇ°primitivesˇ± and their transitions) of complex motion and propose a two-step unsupervised learning approach to recover the natural ˇ°primitivesˇ± from unsegmented 3D-motion captured sequences of complex human motion. The structure recovery is done under the MDL (minimum description length) paradigm. Then the learnt dynamic model of human motion is used in the CONDENSATION framework to successfully track human motion in a video sequence. Experimental results of ballet dancing sequences demonstrate that our approach works well. The learnt structure is also used to synthesize new video sequences.
Many kinds of complex motions are made up of some "primitives". We are trying to recover the hidden structure from the continous motion captured sequences with unsupervised learning. We assume neither any homogeneity within the primitives nor any low velocity point between two adjacent primitive instances.
The above problem is analogical to this problem: you are given an article with all the white spaces and marks removed and you are asked to recover the vocabulary.
The structure discovery/recovery problem is solved by finding the minimum description length (MDL) of the article/motion sequences as the following:
We apply the method to the arm motion of ballet dancing and result in the following structure. Each ellipse is a "word" or motion primitive. The recovery corresponds to human knowledge of ballet (4 standard poses and the transitions among them) very well.

Frontal sequence: AVI file is available here (386K).
45 degree side sequence: AVI file is available here (297K).
The results from the frontal sequence rendered from different viewpoints:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Synthesis result is here (410K).