We propose methods for generating tokens from images and show how they can be naturally incorporated as modules in the framework. If two or more images are available, the tokens correspond to pixel matches which can be reconstructed in a three-dimensional metric or projective space. Scene surfaces generate salient surfaces in this space, and thus can be inferred as salient, coherent groupings of tokens. We have achieved results in stereo that compare favorably with a large class of methodologies on standardized datasets.
The fact that such a large number of diverse algorithms achieve very similar performance confirms our belief that an inherent limit on the amount of information that can be derived from binocular cues only has been reached. To advance the state of the art in image interpretation, one should investigate the monocular case more thoroughly. The final part of this research addresses analysis of single images leading to the inference of descriptions that are a lot richer than before. To facilitate the inference of structure terminations, such as the endpoints of curves, and the labeling of junctions, first order information has been integrated to the tensor voting framework.