Efficient Articulated Pose Estimation from a Single Image or a Stereo Pair or a Sequence


We address the estimation of human poses in single images and in sequences. This is an important problem that has a range of applications in human computer interaction, security and surveillance monitoring, image understanding, and motion capture.
Of special consideration in the design of this system are real-time and robustness issues. To keep up with the demands of an interactive system, we combine efficient limb/forearm tracking modules with more detailed pose tracking modules.
Our 2D limb tracking scheme uses skin color and edge information. Multiple 2D limb models are used to enhance tracking of the underlying 3D structure. This includes models for lateral forearm views (waving) as well as for pointing gestures.
In our single view pose tracking framework, we first find candidate 2D articulated model configurations that can be upgraded into 3D postures. Candidate 2D poses are found by searching for locally optimal configurations under a weak but computationally manageable fitness function. This is accomplished by first parameterizing 2D poses by its joint locations organized in a tree structure. Candidate configurations can then efficiently and exhaustively be assembled in a bottom-up manner. Working from the leaves of the tree to its root, we maintain a list of locally optimal, yet sufficiently distinct candidate configurations for the body pose.
We then adapt this algorithm for use on sequences of images by considering configurations that are either near their position in the previous frame, or overlap areas of interest in subsequent frames. This way, the number of partial configurations generated and evaluated is significantly reduced, and both smooth and abrupt motions can be accommodated. This work is validated on test and standard datasets.

We then develop a method to automatically construct this fitness function from annotated image data. We propose a set of generic features and use real-valued AdaBoost to construct a strong detector that generalizes over multiple people with different clothing and appearances.
Finally, we consider using depth data from a stereo camera as such as the Bumble Stereo Camera from PointGrey. We track the movement of a user by parameterizing an articulated upper body model using limb lengths and joint angles. We then define an objective function that evaluates the saliency of this upper body model with a stereo depth image and track the arms of a user by numerically maintaining the optimum using an annealed particle filter.
The future directions include robust candidate selection, 2D to 3D pose inference, and feature selection in the construction of top-down image saliency metrics. In analyzing sequences, we also plan to adapt our generic models to the specific subject being tracked in terms of both appearance and geometry. We will also extensively evaluate our tracking system using standard and test data sets and study the effects of different depth sources including time-of-flight based sensors(Canesta). กก

Maintained by Qian Yu