MATILDA: Interactive Vision System

Overview: Today’s computer vision systems mostly tell the robot what to see depending on what we, humans, can see well in the environment. In other words, we expect the robot to pay attention to objects that are most appealing to us visually. An interesting question is that can we develop a visual attention system where the robot tells us what features it can see in its environment? Another interesting question is what if we provide the user and the robot with a shared semantic context so that the robot can report its findings to the human in human terms? To explore this, we use a track-driven mobile robot we call Rocky.


Objective: To design an interactive vision system where the robot learns a large set of features - those should be clearly identifiable by a semantic meaning-, and examines the training set using Top Down Induction of Decision Trees (TDIDT). Another main goal of this project is to enable the robot to produce a graphical depiction of the tree along with one or more text reports summarizing important information in the tree. In addition, it is desired that the human and the robot have a dialog about the features and the objects in the training data before the robot is exposed to new images using its training set. Finally, another dialog between the human and the robot is desired based on the new results.


Project Description: This project is composed of several individual parts. Feature selection is the first step. The developer chooses a large set of features to be extracted from regions in an image. The main constraint on these features is that they must have clearly identifiable semantic meaning. That is, human must be able to associate a clear human language term with the feature, such as “dark blue”, “very rough”, “straight line”, etc. This is necessary for providing the user and the robot with a shared semantic context for discussing the features.


The next step is training where the user selects a region in the image and labels it according to what the object is, e.g. tree, door, etc. The user should be able to establish basic hierarchical associations as well.


After the training, the training set is examined using TDIDT. The most likely candidate for this is the C4.5 algorithm. The result is a tree structure with the branches of the tree being created by different values of the features. Since the features have semantic meaning, traversing the tree produces a description of the features important to identifying different objects. The features producing the branches near the top of the tree have the greatest discriminatory power, and are expected to be appropriate for attention purposes.


Reporting follows the top down induction. At this point, the robot should also produce a graphical depiction of the tree along with one or more text reports summarizing the important information in the tree. It is likely that the tree itself will be difficult for the user to evaluate, thus the text summaries are needed to point out the most important information.


The next step is a dialog between the robot and the user about the features and objects in the training data. This dialog may result in changes to the tree structure, such as combining some branches pruning some branches or possibly other modifications.


Thus, after the training, the robot will be exposed to new images. The robot can be able to compare the features in the new image with the features in the training set, and can suggest candidate locations for objects.


Finally, a new dialog between the user and the robot based on the new results can occur. The robot may report about the objects that it thinks it found. The user can confirm or correct the choices of the computer.


During the initial phase of this project MATLAB will be used since it is easy to try out and implement new ideas. Overall, it is intended to implement the project using Microsoft .NET Framework.


Robots used: In this project, Rocky, a MATILDA Robotic Platform from Mesa Associates, Inc. is used. The robot is equipped with an omni-directional vision system which gives 360 degrees view of the robot’s environment. The omni-directional system consists of a parabolic mirror mounted on top of a Sony Digital Video camera. The images from the camera are captured using the IEEE 1394 (Firewire) port.


Previous Work: This work is an extension of the study done on Egocentric Navigation [1], and visual perception correction [2].


References:

[1] Kawamura, K., A.B. Koku, D.M. Wilkes, R.A. Peters II, and A. Sekman, “Toward Egocentric Navigation”, International Journal of Robotics and Automation, 17:4, November 2002. 

[2] T. Keskinpala, M. Wilkes, K. Kawamura, A.B. Koku, “Knowledge Sharing Techniques for Egocentric Navigation”, International Conference on Systems, Man & Cybernetics, Washington, DC, October 5-8, 2003.