Seeing the forest for the trees
08 May 2010
Object recognition is one of the core topics in computer vision research: After all, a computer that can see isn't much use if it has no idea what it's looking at. Researchers at MIT, working with colleagues at the University of California, Los Angeles, have developed new techniques that should make object recognition systems much easier to build and should enable them use computer memory more efficiently.
![]() |
| Graphic: Christine Daniloff |
A conventional object recognition system, when trying to discern a particular type of object in a digital image, will generally begin by looking for the object's salient features. A system built to recognise faces, for instance, might look for things resembling eyes, noses and mouths and then determine whether they have the right spatial relationships with each other.
The design of such systems, however, usually requires human intuition: A programmer decides which parts of the objects are the right ones to key in on. That means that for each new object added to the system's repertoire, the programmer has to start from scratch, determining which of the object's parts are the most important.
It also means that a system designed to recognise millions of different types of objects would become unmanageably large. Each object would have its own, unique set of three or four parts, but the parts would look different from different perspectives, and cataloguing all those perspectives would take an enormous amount of computer memory.
In a paper to be presented at the Institute of Electrical and Electronics Engineers' Conference on Computer Vision and Pattern Recognition in June, postdoc Long (Leo) Zhu and professors Bill Freeman and Antonio Torralba, all of MIT's Computer Science and Artificial Intelligence Laboratory, and Yuanhao Chen and Alan Yuille of UCLA describe an approach that solves both of these problems at once.
Like most object-recognition systems, their system learns to recognise new objects by being ''trained'' with digital images of labelled objects. But it doesn't need to know in advance which of the objects' features it should look for. For each labeled object, it first identifies the smallest features it can - often just short line segments.
Then it looks for instances in which these low-level features are connected to each other, forming slightly more sophisticated shapes. Then it looks for instances in which these more sophisticated shapes are connected to each other, and so on, until it's assembled a hierarchical catalogue of increasingly complex parts whose top layer is a model of the whole object.

