For single image analysis, we address the inference of descriptions in terms of integrated step edges, ridges, endpoints and junctions that can be useful for higher level processes. Standard feature detectors do not produce results of sufficient quality mainly due to noise and because edges and junctions are not integrated. We convolve the images with a set of oriented filters and then use tensor voting to infer the most salient features based on their consistency with their neighbors. Fragmented curves are connected and junctions are formed during the final completion stage. We first evaluated our approach on synthetic benchmark datasets, then, we proceeded to difficult natural images that exhibit phenomena such as occlusion and texture. We also address the issues associated with figure completion, a perceptual grouping task. Endpoints and junctions play a critical role in completion by the human visual system and should be an integral part of a computational process that attempts to emulate human perception. We propose a computational framework which implements both modal and amodal completion and provides a fully automatic decision making mechanism for selecting between them. The addition of first order information to the original framework is crucial, since it makes the inference of endpoints and the labeling of junctions possible. We illustrate the approach on several classical inputs, producing interpretations consistent with those of the human visual system.
For stereovision, we propose an approach that addresses these difficulties within a perceptual organization framework, taking into account both binocular and monocular sources of information. Initially matching candidates for all pixels are generated by a combination of matching techniques. These are then reconstructed in disparity space. Perceptual organization takes place in 3-D neighborhoods and, thus, does not suffer from problems associated with scanline or image neighborhoods. The assumption is that correct matches form salient coherent surfaces, while wrong matching candidates do not align to form salient structures. Surface saliency, therefore, is used as the criterion to disambiguate matches. The matching candidates that are kept are grouped into smooth layers, whose projections on both images can be used to obtain estimates of the color properties of the scene surfaces and reject inconsistent matches. Disparity hypothesis for pixels that remain unmatched are generated based on the color information of nearby layers and validated by ensuring the good continuation of the surfaces via tensor voting. Thus, information is propagated from more to less reliable pixels considering both geometric and color information. The use of segmentation based on geometric cues to infer the color distributions of scene surfaces is arguably the most significant contribution of our research. We have achieved very good results on widely used, benchmark stereo pairs.
We also proposed an approach for dense, multiple-view stereo based on the same assumption, that correct matches are consistent with each other and form the scene surfaces. Thus far research on dense multiple view stereo has evolved along three axes: computation of scene approximations in the form of visual hulls; merging of depth maps derived from simple configurations, such as binocular or trinocular; and multiple view stereo with restricted camera placement. These approaches are either sub-optimal, since they do not maximize the use of available information, or cannot be applied to general camera configurations. Our approach does not involve binocular processing other than the detection of tentative pixel correspondences and does not require foreground and background segmentation. We were able to reconstruct scenes, such as the ones captured at the CMU dome, that could not have been processed by state of the art, dense, true multiple-view stereo algorithms.
Finally, we propose a new implementation of the tensor voting process that, unlike the original algorithm, can be generalized to spaces with hundreds of dimensions. The advantages of the proposed approach include speed and efficiency and its applicability to a far wider range of datasets than the current state of the art. We are able to process non-flat manifolds and even non-manifolds, such as hyper-spheres and datasets of varying dimensionality or with intersecting manifolds. To the best of our knowledge, this is impossible with any other method. We have obtained very good results in dimensionality estimation, local orientation estimation, geodesic distance measurement, nonlinear interpolation and function approximation. The capability to perform these tasks opens the door for a wide range of applications such as unsupervised classification or the analysis and synthesis of motion data.
Back to home