This paper summarizes the USC Image Understanding research projects and provides references to more detailed sources of information. Our work has focused on the topics of 3-D vision (including range data processing, stereo, shape from contour and object recognition), aerial image analysis, motion analysis (including 3-D motion and structure estimation, visual guidance for mobile robots, and an integrated motion system), and parallel processing (including mapping algorithms onto specific or flexible architectures, and processor-time trade-offs).
This paper summarizes our research projects over the last year. Much of this work is described in detail in the other papers in this proceedings, with this overview giving only a brief description of the detailed efforts. Work that is less complete will be described in somewhat more detail in this overview since there is no corresponding paper in the proceedings.
Three-dimensional vision is needed for many tasks including those of manufacturing, robotics and outdoor object recognition. To achieve these goals, we must develop a representation formalism that is rich enough to describe complex objects in terms of both volumes and surfaces and which allows us to convert between them. We must also develop techniques to compute these descriptions from real data which includes shadows and noise and we must recognize such objects from a large database of objects. Three-dimensional vision is the largest research area funded under this contract. Within that area, we have a variety of projects:
· Description of 3-D objects: We are studying the problem of generating 3-D surface descriptions from range data using the concept of deformable models. This work is described in [Liao & Medioni 1993]. A second project using range data is exploring the integration of surface descriptions from different viewpoints and is discussed in detail in [Chen & Medioni 1993]. A third effort addresses the problem of recovering segmented, hierarchical volumetric descriptions from range data.
· Perceptual Grouping: Most high level vision algorithms require perfect data as input, but it is impossible to generate such features with low level algorithms such as edge detectors. We are working on bridging this gap by transforming an edge image into a saliency map. This approach uses a non-iterative method based on a field associated with each edge. This field encodes the notions of simplicity, curvature constancy and co-curvilinearity. A detailed report on this effort is given in [Guy & Medioni 1993].
· 3-D Shape from Monocular Images: In this project, we are developing techniques for inferring 3-d shape descriptions given only a single image of the scene. We have developed techniques to use our earlier theoretical results on real images where contours are likely to be fragmented and distracting contours such as markings and shadows are present. We report on these efforts in [Zerroug & Nevatia 1993a and b].
Institute for Robotics and Intelligent Systems
University of Southern California
Los Angeles, California 90089-0273
We have a number of projects in the area of motion analysis, with autonomous navigation providing the context for most of the work, though these techniques have a much broader utility.
· Integrated system for Motion: We have developed an integrated system that includes hierarchical feature extraction and matching and feedback of 3-D motion estimation to the feature matching process. The system is able to tolerate errors and differences in the feature extraction and matching process by removing these inconsistent feature points from the later analysis. This is described in [Kim & Price 1993].
· Mobile Platform: We use the domain of autonomous navigation to unify our motion work. To this end we have a small project in vision based navigation with a trinocular stereo system for reliable 3-D descriptions of the environment. The recent results for this effort are briefly given later in Section 3. More details can be found in [Kim & Nevatia 1993].
Our work in aerial image analysis consists of two major components. First is the transfer of technology funded by DARPA to the RADIUS program. We also continue on our long range effort of analyzing complex cultural domains. Our recent work in extraction of buildings is given in [Huertas et al. 1993] and [Chung & Nevatia 1992b]. In the domain of large commercial airports we have shown good results on the detection of runways and taxiways [Huertas et al. l990]. Recently we have ported this system from the older Symbolics version to the Sun platform and assisted another DARPA funded group at USC in the reimplementation of these algorithms on a Prolog machine. This airport analysis work also forms the basis for a new effort in using the Loom knowledge representation system for vision tasks.
We have begun a project in using a standard knowledge representation system (the Loom system developed at USC-ISI). This system will be applied to our earlier airport analysis system. This work is briefly described in Section 5.
We are investigating parallel implementations of various vision algorithms developed in our group and elsewhere. We have studied algorithms for stereo and image matching, graph algorithms. This involves implementing such algorithms on existing architectures. The recent work on implementing scalable data parallel geometric hashing form matching is discussed in more detail in [Khokhar & Prasanna 1993]. In previous work, we have also explored the advantages of using flexible architectures [Reinhart 1991].
We have developed systems for building models from unregistered multiple range images [Parvin & Medioni1992, Chen & Medioni 1992]. The latter system integrates views at the triangulated surface level rather than at the pixel level. A triangulated surface model can represent a variety of solid objects, and theoretically to any kind of resolution. They are not ideal representations for high level vision tasks, such as recognition, because, first, the representation is still low level, second, it is sensitive to many parameters, and therefore unstable. However, we think it is a good intermediate representation for integration and for building high level description through surface interpolation from triangulation. Figure 1 shows the multi-view integration result for a complex object.
A second project in range analysis involves the use of deformable surfaces to generate a 3-D approximation of range data. This work, performed by C. Liao and G. Medioni, builds on our earlier work in "B-Snakes." The user provides an initial simple surface, such as a cube, which is subject to internal forces (describing implicit continuity properties such as tension and bending) and external forces which attract it toward the data points. The problem is cast in terms of energy minimization. We solve this non-convex optimization problem by using the well known Powell algorithm which guarantees convergence to a (possibly local) extremum and does not require gradient information. The variables are the positions of the control points. The number of control points is adaptively controlled. This methodology leads to a reasonable complexity and good numerical stability. We also provide a novel solution to the problem of subdividing a patch when the fit is bad. We show results on real range images to illustrate the applicability of our approach. The advantages of this approach are that it provides a compact representation of the approximated data, and lends itself to applications such as non-rigid motion tracking and object recognition. Currently, our algorithm gives only a C0 continuous analytical description of the data, but due to the flexibility of our adaptive approach it should be upgraded to C1 or C2 easily. Figure 2 shows an example of the extraction of the surface description from the range data.
We address the problem of recovering segmented hierarchical volumetric descriptions of three dimensional shapes. In an earlier work [Rom & Medioni 1992][Rom & Medioni 1993], we have suggested a method (using SLS) for obtaining hierarchical axial descriptions of planar shapes, together with a decomposition of the shapes into their parts. Unfortunately, it is not straightforward to extend these methods to handle three dimensional shapes. This is because in the three dimensional space the SAT and SLS axes are, in general, not curves, but surfaces, leading to unnatural descriptions [Nackman 1985].
Given the 3-D surface data, either from a CAD model, or from registered range images [Chen & Medioni 1992], or from a single range image, we first recover the parabolic curves on the surface. This requires the evaluation of the sign of the Gaussian curvature of the surface patches. It has been shown that this process is stable and reliable [Besl & Jain 1986] [Ponce & Brady 1987] [Fan et al. 1989]. The parabolic curves could be either on the surface of the individual parts, or on the border of the "glue" between parts. Note, that due to the transversality principle [Guillemin & Pollack 1974], there is almost always an anticlastic (negative Gaussian curvature) region between convex parts when they are joined. The parabolic curves on the parts we consider could be either meridians or cross sections of the SHGC and PRCGC parts (this has been shown for SHGCs [Ponce et al. 1989] and we have proven it for PRCGCs). Using simple tests we can hypothesize (or in many cases determine) the role of each parabolic curve. We can therefore segment the object into parts, and based on the properties of the specific parts, we can recover the axis of the parts from the meridians and cross sections.
One problem which remains is that some parts cannot be found until some other parts are removed. As in [Rom & Medioni 1992] and [Rom & Medioni 1993], we take an hierarchical strategy, in which, at each step, well defined parts are described and removed. Once these parts are removed, the next level parts can now be described. This process is efficient and produces a decomposition of the shape into its intuitive parts with a stable axial description of these parts.
2.2 Perceptual Grouping
We have started to develop a system for perceptual grouping based on the properties of collinearity and co-curvilinearity of partial contours [Guy & Medioni 1993]. The implementation is obtained by initiating an oriented vector field at each site detected by a low level detector. Structures which verify our constraints reinforce each other, and dominate other arrangements. Figure 3 shows a typical input, and the salient structures detected
In continuation of this work, we are also studying the class of curved generalized cylinders with circular but changing cross-sections. In this case, we are unable to find invariants for the visible boundaries. However, we are able to find good quasi-invariants that show that commonly used ribbon descriptions are in fact, stable for such objects. Our future work will focus on compound objects that combine a number of primitives that we have analyzed in the past.
3 MOTION ANALYSIS
We have advocated the use of multi-frame analysis for several years. In previous work, we developed a motion and structure estimation system based on so called chronogeneous motion, which includes uniform acceleration and constant angular velocity rotation and translation as special cases [Franzen 1991, Franzen 1992].
In recent work using the motion estimation system, we have been building an integrated system that includes hierarchical feature extraction and matching and feedback of the Franzen 3-D motion estimation results to the feature matching process [Kim & Price 1993]. This enables the system to tolerate errors and differences in the feature extraction and matching processes by removing these inconsistent feature points from the later analysis. Figure 5 shows the first and last frames of an indoor image sequence (acquired from the University of Massachusetts) along with the reconstructed image-plane trajectories based on the 3-D analysis and the reconstructed top-down view of the scene.
We have continued with our robot project using trinocular imagery for guidance. We are investigating robot navigation for situations where only generic maps are available, with one of the tasks being the generation of more complete maps. The visual navigation uses three views to improve the performance of the stereo system, both in speed and accuracy of the matching. Rather than producing a complete depth map we are concerned only with producing a "squeezed 3-D map" that shows corridor walls and obstacles in the hallway. Figure 6 shows the input three images for a hallway scene along with the resulting 3-D map and the planned path.
Most of our effort in this area is in support of the RADIUS program and is largely funded under a different contract though much of the basic technology has derived from our basic IU research effort. We have focused on the problem of 3-D object detection and description, particularly the buildings. Various RADIUS experiments with image analysts clearly illustrate the central role of buildings in a site.
In work by M. Bejanin and G. Medioni, we have considered the problem of finding the geometric transformation between two perspective views of the same scene and the determination of epipolar lines, given a set of matched control points. Many methods exist and divide in two groups: linear methods and non-linear methods [Horn 1991]. Linear methods work with matrix operations, using the essential matrix that was introduced by Longuet-Higgins in 1981 [Hartley 1992] [Longuet-Higgins 1981]. We have evaluated the performance of both types of methods using aerial images. We have also produced extensive comparisons between the two types of methods by testing them on synthetic random data. The main result of this comparative study is that the non-linear method seems to be more robust and reliable in the presence of noise. Also, the linear method has many degenerate cases where the transformation cannot be computed; among these is the "perfect" stereo case where both optical axis are parallel and perpendicular to the baseline. Finally, the result obtained on aerial scenes is not very accurate, as these scenes are nearly planar.
4.2 Stereo for Buildings
The problem of building detection and description is difficult for a number of reasons. Aerial images tend to be highly complex with even simple buildings having many architectural details, surrounding trees and vehicles, nearby and aligned roads etc. These cause the low-level segmentation to produce highly fragmented results. Also, 3-D information is not explicit in 2-D images, but must be inferred somehow. We have developed two systems, one using stereo analysis, the other using shadow analysis to overcome both problems (of fragmentation and 3-D description).
4.3 Use of Groupings and Shadows
Another system we have developed uses only a single image, but uses shadows to verify presence of a building and estimate its height. This system is currently designed for overhead views; we are in the process of extending it for oblique views. Basically, the approach is to use perceptual grouping to hypothesize likely building-like structures. We assume that the buildings are rectangular or composition thereof. Thus, the hypothesis generation phase consists of forming rectangular hypotheses from fragmented line segments. We select among these hypotheses based on some geometrical analysis of overlap, containment and strength of the evidence forming the hypotheses. The selected hypotheses are then verified by examining whether they cast shadows in the appropriate ways. Figure 9 shows an example. Note that this building contains a number of parallel structures on top of the roof, making the hypotheses formation and selection process particularly difficult. Nonetheless, our system produces good results. More examples and a more complete description is given in another paper in these proceedings [Huertas et al. 1993].
We have begun a new project in using standard knowledge representation technology for Image Understanding. Loom, a knowledge representation system developed at USC-ISI, provides a high level programming interface for Lisp and an environment for knowledge based system construction [MacGregor & Burstein 1991]. Since it is built on Lisp, it can be easily incorporated into our Lisp based environment and should be compatible with the Image Understanding Environment developments (which did not address the knowledge base aspects of image analysis). In past years we developed a system that uses domain knowledge to simplify the task of describing a scene [Huertas et al. 1990]. This original system encoded the knowledge about airports in the programs.
Our first goal is to evaluate the capabilities of Loom for high level Image Understanding and then to use Loom for generic high level descriptions of complex objects, and to use these descriptions for building other analysis systems.
6 PARALLEL PROCESSING
We have studied parallel implementations of several high-level algorithms, such as relaxation labelling and graph matching. Our recent work has looked at the problem of geometric hashing, which is used for a variety of matching problems [Stein & Medioni 1992]. In earlier parallel implementations the number of processors was independent of the size of the scene but depended on the size of the model database. In this work we have designed new parallel algorithms for both the MasPar and Connection Machine architectures which improve on the number of processors and improve the overall performance. Details of this work are given in [Khokhar & Prasanna 1993].
This paper represents the work of many different current and former students, visitors, staff members and associated faculty. Y. Chen is working on the integration from multiple views. C. Liao is working on deformable models. H. Rom is working on segmented volumetric 3-D descriptions. G. Guy is working on perceptual grouping. M. Zerroug is working on shape analysis from monocular contours. R. Chung completed a thesis on hierarchical features in stereo including detecting buildings in stereo images. F. Stein completed a thesis on matching using structural indexing. Y. Kim is working on the integrated motion system. D. Kim is working on the mobile platform and trinocular vision. M. Bejanin is working on the stereo calibration problem. A. Huertas and C. Lin are working on the aerial image analysis. A. Khokhar and V. Prasanna are working on the analysis of parallel algorithms.
[Besl & Jain 1986] P. J. Besl and R. C. Jain, "Invariant Surface Characteristics for 3D Object Recognition in Range Images," Computer Vision Graphics and Image Processing, Vol. 33, pages 33-80, 1986.
[Chen & Medioni 1992] Y. Chen and G. Medioni, "Object Modeling by Registration of Multiple Range Images," Image and Vision Computing, Vol. 10, No. 3, pages 145--155, April 1992.
[Chen & Medioni 1993] Y. Chen and G. Medioni, "Surface Level Integration of Multiple Range Images," in Proc. DARPA Image Understanding Workshop, Washington, DC, April 1993
[Chung & Nevatia 1992] C.-K. R. Chung and R. Nevatia. Recovering LSHGCs and SHGCs from stereo. In Proceedings of the DARPA Image Understanding Workshop, pages 401--407, San Diego, CA, January 1992.
[Chung & Nevatia 1992b] C.-K. R. Chung and R. Nevatia, "Recovering Building Structures from Stereo," In Proceedings of the IEEE Workshop on Applications of Computer Vision, pages 64-73, Palm Springs, CA, December 1992.
[Chung 1992] C.-K. R. Chung, Deriving 3-D Shape Descriptions from Stereo Using Hierarchical Features, Ph.D. Thesis, University of Southern California, November 1992.
[Fan et al. 1989] T. J. Fan, G. Medioni, and R. Nevatia, "Recognizing 3-D Objects Using Surface Descriptions," IEEE Trans. on PAMI, Vol. 11, No. 11, Nov. 1989, pages 1140-1157.
[Franzen 1991] W. Franzen, "Structure and Motion from Uniform 3D Acceleration," In Proceedings of the Workshop on Visual Motion, pages 14--20. IEEE Computer Society, Oct 7--9, 1991.
[Franzen 1991] W. Franzen, Structure from Chronogeneous Motion, Ph.D. thesis, University of Southern California, May 1991, IRIS Technical Report 266.
[Franzen 1992] W. Franzen "Structure from Chronogeneous Motion: A Summary," In Proceedings of the DARPA Image Understanding Workshop, San Diego, CA, January 1992.
[Guillemin & Pollack 1974] V. Guillemin and A. Pollack, Differential Topology, Prentice Hall, 1974.
[Guy & Medioni 1992] G. Guy and G. Medioni, "Perceptual Grouping Using Global Saliency Enhancing Operators," In Proceedings of the International Conference on Pattern Recognition, pages 99--104, The Hague, Holland, August 1992.
[Guy & Medioni 1993] G. Guy and G. Medioni, "Inferring Global Perceptual Contours from Local Features," in Proceedings of the DARPA Image Understanding Workshop, Washington, DC, April 1993.
[Hartley 1992] R. I. Hartley, "Calibration of Cameras Using the Essential Matrix," in Proceedings of the DARPA Image Understanding Workshop, pages 911-915, San Diego, CA, January 1992.
[Horn 1991] B. K. P. Horn, "Relative orientation revisited," in Journal of the Optical Society of America A., Vol. 8, NO 10, pages 1630-1638, October 1991.
[Huertas et al. 1990] A. Huertas, W. Cole, and R. Nevatia, "Detecting Runways in Complex Airport Scenes," Computer Vision, Graphics, and Image Processing, Vol. 51, No. 2, pages 107--145, August 1990.
[Huertas et al. 1993] A. Huertas, C. Lin and R. Nevatia, "Detection of Buildings from Monocular Views of Aerial Scenes Using Perceptual Grouping and Shadows," in Proceedings of the DARPA Image Understanding Workshop, Washington, DC, April 1993.
[Kim & Nevatia 1993] D. Kim and R. Nevatia, "Indoor Navigation without a Specific Map," in Proceedings Workshop on Intelligent Systems, Pittsburgh, PA, 1993
[Kim & Price 1993] Y. C. Kim and K. Price, "Improved Correspondence in Multiple Frame Motion Analysis," In Proceedings of the DARPA Image Understanding Workshop, Washington, DC, April 1993.
[Khokhar & Prasanna 1993] A. Khokhar and V. Prasanna, "Scalable Data Parallel Geometric hashing: Experiments on MasPar MP-1 and on Connection Machine CM-5," In Proceedings of the DARPA Image Understanding Workshop, Washington, DC, April 1993.
[Koenderink 1990] J.J. Koenderink, Solid Shape, M.I.T. Press, Cambridge, MA, 1990.
[Liao & Medioni 1993] C. Liao and G. Medioni, "Adaptive Surface Approximation of a Cloud of 3D Points," USC report, submitted for publication.
[Longuet-Higgins 1981] H.C. Longuet Higgins, "A computer algorithm for reconstructing a scene from two projections," in Nature, Vol. 293, pages 133-135, September 1981.
[MacGregor & Burstein 1991] R. MacGregor and M. Burstein, "Using a Description Classifier to Enhance Knowledge Representation," IEEE Expert, Vol 6, No. 3, pages 41-46, June 1991
[Nackman 1985] L. Nackman and S. Pizer, "Three-Dimensional Shape Description Using the Symmetric Axis Transform,", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 7, No. 2, March 1985, pages 187-202.
[Parvin & Medioni 1992] B. Parvin and G. Medioni, "B-rep from Unregistered Multiple Range Images," In Proceedings of the IEEE Conference on Robotics and Automation, Nice, France, May 1992.
[Ponce &Brady 1987] J. Ponce and M. Brady, "Towards a Surface Primal Sketch," in Three Dimensional Machine Vision, T. Kanade (ed.), Kluwer Academic, New York, pages 195-239, 1987.
[Ponce et al. 1989] J. Ponce, D. Chelberg and W.B. Mann, "Invariant Properties of Straight Homogeneous Generalized Cylinders and Their Contours," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 11, No. 9, pages 951-966, 1989.
[Reinhart 1991] C. Reinhart, Specifying Parallel Processor Architectures for High-Level Computer Vision Algorithms, Ph.D Thesis, University of Southern California, December 1991.
[Rom & Medioni 1992] H. Rom and G. Medioni, "Hierarchical Decomposition and Axial Shape Description," In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 49--55, Champaign, IL, June 1992.
[Rom & Medioni 1993] H. Rom and G. Medioni, "Hierarchical Decomposition and Axial Shape Description," IEEE Transactions on Pattern Analysis and Machine Intelligence, to appear 1993.
[Shafer 1983] S.A. Shafer, "Shadow Geometry and Occluding Contours of Generalized Cylinders," Technical Report CS-83-131, Carnegie Mellon University, May 1983.
[Stein & Medioni 1992] F. Stein and G. Medioni, "Structural Indexing: Efficient Three Dimensional Object Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 14, No. 2, pages 125--146, February 1992.
[Ulupinar & Nevatia 1990] F. Ulupinar and R. Nevatia, "Recovering Shape from Contour for SHGCs and CGCs," In Proceedings of the DARPA Image Understanding Workshop, pages 544--556, Pittsburgh, PA, September 1990.
[Ulupinar & Nevatia 1992] F. Ulupinar and R. Nevatia, "Recovery of 3-D Objects with Multiple Curved Surfaces from 2-D Contours," In Proceedings of the DARPA Image Understanding Workshop, pages 627-633, San Diego, CA, January 1992.
[Ulupinar 1991] F. Ulupinar, Perception of 3-D Shape from 2-D Image of Contours, Ph.D. Thesis, University of Southern California, August 1991.
[Zerroug & Nevatia 1993] M. Zerroug and R. Nevatia, "Scene Segmentation and Volumetric Descriptions of SHGCs from a Single Intensity Image," In Proceedings of the DARPA Image Understanding Workshop, Washington, DC, April 1993.
[Zerroug & Nevatia 1993b] M. Zerroug and R. Nevatia, "Quasi-Invariant Properties and 3-D Shape Recovery of Non-Straight, Non-Constant Generalized Cylinders," In Proceedings of the DARPA Image Understanding Workshop, Washington, DC, April 1993.