Movies to Geometric 3D Models: the Structure-from-Motion Problem
Abstract
I describe some of my recent results on the Structure-from-Motion problem
(SFM). Given a sequence of photographic images of a fixed 3D scene, taken by
a camera at several unknown positions and orientations, the problem is to
recover 1) a 3D geometric model of the scene (structure), 2) the camera's
position and orientation for each image (motion).
One seeks estimates that optimally explain the image data: thus, SFM is an
optimization problem. Formally, the goal is to find the estimate of the
scene and motion minimizing the ``error'' between the data predicted by the
estimate and the actual image data. To understand the SFM problem---and to
ensure that algorithms avoid false reconstructions---one must understand the
shape of the ``error surface,'' i.e., how the error depends on the estimate.
My recent results include:
- For sequences of two images, a simple, exact expression for the
error that depends only on the camera motion. This gives a fast
optimal algorithm, since one can estimate the motion by minimizing
over the motion alone, avoiding a time--consuming minimization over
the many unknowns needed to describe the scene. Also, I present a
solution to the stereo or triangulation problem: a simple, exact
expression for the optimal estimate of the structure
given known camera motion. I also demonstrate a new ambiguity in
recovering the structure by triangulation.
- An analytic model of the error surface, giving a fairly complete
understanding of the SFM problem. The model applies to planar and
nonplanar scenes, which is crucial since most 3D scenes are in effect
nearly planar. Using this model, one can show that the error surface
has no false local minima under some conditions. Our analysis may be
useful in practice for checking whether a computed reconstruction is
correct.
- Multi-image algorithms that compute directly from the photographic
image data, without needing to iterate from an initial guess at the
unknowns as in previous approaches. If available, this approach can
also and simultaneously use data in the form of 3D points or lines
pre-tracked over the sequence, or measurements of the affine
deformations of image patches over time. It is designed for sequences
where the camera makes small movements, e.g., hand--held video
sequences. It is simple to implement and gives results superior to
those of the Sturm/Triggs algorithm.