We address the estimation of human poses in single images and
in sequences. This is an important problem that has a range of applications in
human computer interaction, security and surveillance monitoring, image
understanding, and motion capture.
Of special consideration in the design of this system are real-time and
robustness issues. To keep up with the demands of an interactive system, we
combine efficient limb/forearm tracking modules with more detailed pose tracking
modules.
Our 2D limb tracking scheme uses skin color and edge information. Multiple 2D
limb models are used to enhance tracking of the underlying 3D structure. This
includes models for lateral forearm views (waving) as well as for pointing
gestures.
In our single view pose tracking framework, we first find candidate 2D
articulated model configurations that can be upgraded into 3D postures.
Candidate 2D poses are found by searching for locally optimal
configurations under a weak but computationally manageable fitness
function. This is accomplished by first parameterizing 2D poses by
its joint locations organized in a tree structure. Candidate configurations can
then efficiently and exhaustively be assembled in a bottom-up manner. Working
from the leaves of the tree to its root, we maintain a list of locally optimal,
yet sufficiently distinct candidate configurations for the body pose.
We then adapt this algorithm for use on sequences of images by considering
configurations that are either near their position in the previous frame, or
overlap areas of interest in subsequent frames. This way, the number of partial
configurations generated and evaluated is significantly reduced, and both smooth
and abrupt motions can be accommodated. This work is validated on test and
standard datasets.
We then develop a method to automatically construct this fitness function from
annotated image data. We propose a set of generic features and use real-valued
AdaBoost to construct a strong detector that generalizes over multiple people
with different clothing and appearances.
Finally, we consider using depth data from a stereo camera as such as
the Bumble Stereo Camera from PointGrey. We track the movement of a user by
parameterizing an articulated upper body model using limb lengths and joint
angles. We then define an objective function that evaluates the saliency of this
upper body model with a stereo depth image and track the arms of a user by
numerically maintaining the optimum using an annealed particle filter.
The future directions include robust candidate selection, 2D to 3D pose
inference, and feature selection in the construction of top-down image saliency
metrics. In analyzing sequences, we also plan to adapt our generic models to the
specific subject being tracked in terms of both appearance and geometry. We will
also extensively evaluate our tracking system using standard and test data sets
and study the effects of different depth sources including time-of-flight based
sensors(Canesta).
กก