Creating Intelligent Systems
Using SOFM
Lior Elazary, elazary@usc.edu
This paper describes the implementation of an intelligent system. The system proposed here will use Self-Organizing Feature Maps (SOFM) as the basis for the system. The SOFM will allow the system to be adaptive to its environment as well as solve some of the traditional problems such as when to learn and how. This would be done by having the SOFM act as a generalize lookup table for mappings between input space and output space. Two new changes would be made to the original SOFM: The first would let the system be able to know when it needs to learn. The second would let the system be able to know what mapping between input space and output space helped to achieve its goal and try to enhance the speed of the mapping. This would finally result in the system performing the mapping without any feedback. To test this claim, two experiments are preformed. The first would use an eye to try to fovate on an object, and the second would use the eye and a 6 DOF robotic arm to perform visuo-motor coordination mappings to point at the object. Both experiments would be evaluated by changing the environment in which these systems operate and finding their tolerance to changes.
The purpose of this project is to develop an intelligent system. In order to evaluate if a system is intelligent or not, intelligence will need to be defined. In this context, intelligent is defined as any system that is able to adapt in order to achieve its goal. Furthermore, a system would constitute of anything that has the ability to sense the environment it is in (whether it is internal like muscle tension/encoders or external like vision/smell/touch etc.), and do work on the environment (through motor/muscle actions). There are many other definitions for intelligence, but in this project this will be the basis for creating the system. Therefore, any system that is able to adapt in order to achieve its goal would be considered intelligent. A tick that bites everything that has a temperature of +37 oC and smells of butyric acid [Lorenz, 1977] would not be considered intelligent by itself. However, if for instance dogs started smelling different then butyric acid and the tick was able to recognize that and change its behavior accordingly (to achieve it goal of getting food from dogs) then it would be considered intelligent. Note, that if evolution caused the tick to change then the system is the tick and nature, which constitute as the intelligent system.
Creating an adaptable system would enable that system to operate on its own to achieve its goal without any interruption. Furthermore, if we attempt to create more complex robots in a constantly changing environment, making them intelligent is almost a must. For example, if a robot arm was given the task of performing surgery to make a connection between the liver and the small intestine [Conger, 2004]. If one of the motors failed, given redundant movements, the robot should be able to adapt to the change and continue performing its goal. Many biological organisms seem to possess intelligence (especially some humans), which are still highly superior to present day systems. However, these biological systems can inspire many control methods to further improve artificial made systems.
One way of creating an intelligent system is to provide the system with every action possible that it would need to perform. A system of this kind would not be considered intelligent, but if it is able to achieve its goal without interruption then it wouldn’t matters. The ultimate system of that kind would be one with no external sensors and one internal clock sensor. Systems like these have been implemented successfully in the automotive industry. The first robotic arm was able to successfully get bolts from hot metal tanks as the bolts were being formed. Nature also has biological systems that operate on internal or very few external sensors like. These biological systems are often very simple like the tick, or single cell organisms. However, the environments these systems are placed in are very static and rarely change.
When the systems become more complex and their environments start to change around them, they often fail. If a person walks in front of a robotic arm that is performing spot welding, then both the person and the arm would probably be damaged. It is also often the case that the arm would never be able to achieve its goal of spot-welding again without some kind of outside intervention. One way that has been proposed to remedy this problem was to add more sensors and more rules to the system. However, since a great deal of trying to predict the future is needed, it becomes very difficult to implement such a system in a dynamic environment. It is also the case that a complex system might need an infinite amount of rules in order to function in a changing environment.
This has
led to the problem known as the “frame problem” which was first proposed by John McCarthy and Pat Hayes in their seminal
essay “Some Philosophical Problems from the Standpoint of Artificial
Intelligence” in 1969 [Dennett, 1999]. At its simplest form, the frame problem asks the question of how to
specify what remains the same in a changing environment. In other words, how do
we know what to program the system with if the environment keeps changing?
The way that nature has solved some
of these problem has been through learning. Many biological systems possess the
ability to learn in order to adapt to their environment, and those who do not
are able to change through evolution (a form of learning). Learning in the
context of this paper is taken to be the act of acquiring the rules to perform
the mapping between the input (sensor) space and the output (motor) space.
However, (as the frame problem suggests) there could be an infinite amount of
rules that could be learned. Nature has solved this problem as well, by
creating systems which are able to recognize patterns. Therefore, the systems
are able to only acquire a subset of the rules in order to function properly.
Furthermore, because we live in a fairly static environment (the world does not
change locally every second) the systems are able to generalize properly in
order to function.
The ability to recognize patterns
gives rise to the following conditions. The systems are now able to acquire a
smaller subset of rules and generalize on the rest, and are able to predict an
outcome based on current conditions. This ability produces fewer overheads in
the computation required to acquire the rules as well as act upon them.
Furthermore, the systems are able to predict the consequences of a specific
action and determine ahead of time if this action would help achieve the goal.
Once the specific mapping has been learned, the system can use
this as a saccade map to produce fast, smooth and coordinated actions from the
input space without any complex computations or feedback control. This is very
similar to lookup tables used previously in AI. When a pianist plays a fast
arpeggio, the visual system cannot be used to guide the hand (visual feedbacks
are about 200ms [Kawato, 1995],
which is too slow for properly moving the fingers. So when the pianist first
starts to learn the arpeggio, the visual feedbacks are being used to guide the
hand into the proper movement. However, once the movement is learned, the
pianist can move his fingers in a very fast manner without any visual feedback.
This is all possible because the keyboard does not change. If the keyboard had
changed, then the pianist would need to retrain to adapt to the change.
All of the features described above
create a system which is very robust to changes in its environment. Since the
system is able to adapt in a very efficient manner, acquiring its goal becomes very
possible despite the changes. This paper will argue that using SOFM as a basis
for the mapping between the input space and output space provides the means of
creating such systems.
Map like structures have been known
to exist in the brain for quite some time now. For instance, there is some
evidence that an abstract map forms when a rat learns its location in the maze.
Certain cells in the hippocampus cortex in the rat’s brain respond only when
the rat is at specific locations in the maze [Olton, 1997]. Other maps have been found in the cortex,
thalamus, hippocampus and the motor cortex.
To model these maps in the brain Kohonen proposed the use of self-organizing feature maps [Kohonen, 1998]. The objective of the network is to map similar features vectors topologically. This means that similar inputs would lie within the same space, which gives the networks its ability to generalize. If a feature vector were presented to the network in which the network has never been trained for, the network would choose the closest feature vector that it knows of. This gives the ability of the network to be used as a generalized lookup table.
In Kohonen’s model, the network consists of an input layer and an
output layer, which is fully connected to the input layer. This output layer
would now be labeled the Kohonen Layer. When the network is first initialized
the output weights (which have the same dimension as the input layer) start up
randomly. When the input layer receives a vector as the input it tries to find
the node that best matches the input vector. To find the similarity between the
vectors the inner product can be used or the Euclidian distance between the
vectors [Kohonen, 1998]. In this project the Euclidian distance was used to
find the similarity between the vectors.
Once the similarity between the
input vector and the output nodes was found, the node with the smallest value
(the most similar) is chosen and labeled the winner in a type of “winner takes
all” scheme. The weights of this node are then adjusted to be more similar to
the input. This change propagates throughout the network. However, as the nodes
get farther from the winner node, their change becomes less and less
significant, often according to a Gaussian distribution. This behavior gives
the network the ability to place similar features vectors topologically.
The initial size of the neighborhood function is defined to be the
entire network. It then drops down exponentially with respect to the times the
network has been training. This gives the ability of the network to converge
faster. As a result the network becomes able to generalize over a large subset
of inputs. However, if the network becomes too specific (each nodes is in
charge of one distinct feature vector) the network will lose its ability to
generalize effectively. This often happens when the network size is too small
for its sample base. The desired outcome of the network is to blend the feature
vectors in such a way that the network can generalize effectively for any input
vector.
The use of SOFM can be used to solve
some of the problems outlined above. One learning algorithm is called the
“extended self-organizing feature map algorithm” [Schulten & Ritter
1989] in which the maps can be used as
a generalized lookup table to produce outputs. At its simplest form, the output
from the output layer in the SOFM can be tied to the motors directly. This is
very similar to the SOFM proposed by Kohonen, except that additional output
weights are connected from the nodes to some motor control device. The process
of learning makes the incoming vectors more similar to the input pattern and
the output vectors control some movements to reduce a predetermined error
value. The extension of the original algorithm is that both input weights and
motor output weights benefit from the neighborhood function and are able to
generalize. This also makes the system converge at a much faster rate. A
variation of this scheme was used for this project.
Once a winner is chosen, its output weights can be used to control the
motors. For instance, if a two-dimension object position (which can be obtained
from recognizing color blobs) was presented to the network and a winner was
chosen. The output of the winner can be adjusted as such that it would control
two servos (pan/tilt) to have a camera fovate on the object (figure 1).
|
Figure
1. Typical network connections. Net input are fully connected to the Kohonen
Layer, which are fully connected to the outputs. W1 and W2 are the weights
between the input and layer and Kohonen layer, and Wo1, Wo2 are the output
weights from Kohonen layer to Output layer. The outputs get copy directly to
motor movements M1 and M2. Note: only 4 weights are shown for clarity. |
As the network learns the mapping,
the map can soon be used as a lookup table to saccade the servos to fovate on
the object. Note that the size of the map does not need to be the same size as
the image; in fact, it can be much smaller. Nor does the network need to be
trained with every single mapping. The network (if built and trained properly)
should be able to generalize on other locations. As a consequence, even if the
object falls on a portion of the eye in which the system never seen before, it
would still be able to successfully fovate on the object. Furthermore, the
object might be on the same place, but the color blob detection algorithm (as
it is often the case) would not produce the same value for the location of the
object. Still the system would be able to fovate on the object using its
generalization nature.
This behavior solves the problem of
learning an infinite amount of rules, since the system only learns a small
subset of the rules, and can use the map to generalize on the rest. However, a
method of learning the rules still needs to be devised. Using the Kohonen
algorithm, the map would be able to categorize the input space in an
unsupervised manner. Therefore, once a winner is chosen, the output weights
connected to the motors would need to be changed to produce the correct
actions.
A few systems for controlling a
robotic arm to point at an object using vision have been attempted before both
in simulation and in real working robots [Walter & Schulten, 1993; Ritter
et al., 1989; Kuperstein, 1988; Chen, 1997; Gaskett & Cheng 2003; Zeller
& Wallace & Schulten 1995]. Most of these systems use a variation of
the “extended self-organizing feature map algorithm”. However, some of them
differ in the way the neighborhood function updates the weights and others have
either used different methods or modified the original SOFM to organize the
map. Most of the works discussed below and those mentioned above had great
success in achieving an adaptable system using SOFM.
One of the systems developed by [Walter & Schulten, 1993] used a
SOFM for visual-motor control of an industrial robot (Puma 562). They have
tried two neighborhood updating algorithms. The first was the “neural-gas”
proposed by Martinet and Schulten [Schulten and Martinetz, 1991] and the other was the original algorithm
proposed by Kohonen [Kohonen, 1998] Both methods had great success in achieving
a final position accuracy of 1.3 mm or 0.1% of the linear dimension of the
robot’s workspace.
The hardware used to test the system
was a Puma 562 6-degree of freedom robotic arm connected to a Unimation
Controller. Each servo (one for each revolute joint) was controlled separately
by a LSI-11/73 CPU. The original VAL II robot language on the controller was
replaced by using a Sun Sparcsystem 4/370 Unix workstation running a software
package called RCCL/RCI (Robot Control C Library and Real-time Control
Interface). Two monochrome CCD cameras with 560x480 resolution were fixed
toward the robot’s workspace with an angle of about 50o. Using a
miniature lamp on the robots end effector the cameras were able to provide two
input vectors (one from each camera) to be supplied to the network. A
4-dimension target location (from the two cameras) was provided and the network
goal was to position the end effector to match the target vector.
Their network consisted of between
100 to 400 nodes, which provided the mapping between the input vector of the
camera and the output vectors of the motors. Two different algorithms were used
as the learning rule. The first learning algorithm they used was the “extended
self-organizing feature map algorithm” described above. The second learning
algorithm was similar to the “extended self-organizing feature map algorithm”
however it did differ in the neighborhood function; the “Neural Gas” algorithm.
The main difference is that the nodes that get updated around the winner are no
longer updated in a topological order, but by their “closeness” to the input.
That is, the node who’s the most similar to in input (the winner) will learn
the most. The next node, which ranks the second for its similarity, will learn
a little less. Subsequent nodes will learn less and less based on their rank in
the ordered sequence of similarity. The same Gaussian function as was used for
the Kohonen layer was also used for determining the amount of learning. This
type of learning makes the network more flexible to topology mapping if the
spatial relationship is not homogeneous.
Their system preformed well, achieving
an accuracy of 0.1% of the linear dimension of the robot’s workspace. The
system was able to achieve this accuracy within 3000 learning steps.
Furthermore, after 3000 learning steps they elongated the last arm segment by
10mm. However, after the dramatic change only 300 iterations were needed to
bring the robot back to its previous accuracy of 0.1%. This experiment helped
to show the adaptability of the system to a changing environment.
Both learning algorithms preformed
well and learned the mapping in under 3000 learning steps to position its end
effector within 0.1% of the linear dimension of the robot’s workspace. Their
conclusion was that the extended self-organizing feature map algorithm is
slightly more dependent on proper tuning of the learning parameters then the
“neural gas” algorithm. Therefore the amount of time required to untangle the
network to spread over the input space relied heavily upon
and
for the extended self-organizing feature map algorithm.
However, further look into their graphs between the algorithms performance
revealed that the extended self-organizing
feature map algorithm was able to recuperate from the dramatic change faster
then the “Neural Gas” algorithm.
Another approach to controlling a robotic arm using SOFM has been taken
by [Zeller & Wallace & Schulten 1995]. Their approach was more similar
to the way biological systems adapt to changes. They adapted a task related
strategy in which the traditional mapping of absolute endpoint mapping was
coupled with an additional component providing information regarding relative
movement for task related mapping.
Their system was implemented on a
pneumatically driven robot arm (SoftArm). Air pressure was used to control the
limb of the robot instead of traditional servo drives. This gave the robot a
“muscle control” more similar to a biological system. The control of each limb
was made by means of agonist-antagonist pairs of rubbertuators that were
mounted on opposite sides of rotating joints. Varying the air pressure to the
rubbertuators could now also control the stiffness of the movement of each
limb. A more detailed description of the system could be found in [Hesselroth
et al., 1994].
The extension to the algorithm they
proposed was to add another layer, which consisted of a set of neurons. This
layer was then connected onto the traditional Kohonen layer that contained the
‘projections’ to the motor cells. These connections in effect would be
organized by task related ‘zones’. Therefore, distinct tasks are reflected in
distinct patterns of the projections from the proposed layer to the Kohonen
layer. This would result in different patterns of activations of the motor
cells.
The performance attained by using this system was quite amazing. A
delta of 1% of the workspace for a set of target points in the workspace would
be obtained within 100 iterations. They have also preformed successful
simulations in which the algorithm controlled a 3-segment limb moving in 3
dimensions with 4 degrees of freedom.
Kuperstein created a more complex model using SOFM, in which both the
cameras and the arm were allowed to move [Kuperstein, 1988]. Therefore, the
input to the network consisted of both the image location and the eye position.
The network was trained by first moving the arm to a random location. Images
from the eye were then used to produce motor maps and gaze maps. The visual
maps and gaze maps were then combined to produce the motor signals. This
experiment showed an average position error of 4%.
However, all of these systems used
an error function to determine the desired movement. Placing a non-adaptable
function in a system would make the system less adaptable. If the system would
undergo a major change it is feasible that the function would not hold true
anymore. Therefore, learning is always being made in the same way: it should be
made to explore the space and learn other mappings as well (even wrong
mappings). This is because sometimes the system could encounter a problem where
the system would think its improving where in fact it is not or vise versa, the
system should then be able to look at the global improvement as opposed to
local improvements. Furthermore, most of these systems included two modes: a
training mode and a live mode where an outside intervention was required to
chose between the two.
The system proposed here would attempt to solve the learning portions described above when using the SOFM. It would do this by having a conscience and unconscious behavior in which the unconscious would not need extensive feedback. Furthermore, the network would also be able to indicate to the system when it has learned the mapping without outside intervention. The system would also be able to explore the space and know about different moves and their consequences. This should create a more robust system, which would be able to handle a greater range of environmental changes.
The network used was similar to the “extended self-organizing feature map algorithm” outlines above. However, two components were added: the first component is a score (which is the error value for the object we are controlling, in this experiment it was just the Euclidian distance from the object to the center) for each node. This defines the goal of the system. However, more work would need to be made of how to present this in a more abstract way. The second change to the network is a step size for each dimension. This attached a tuning vector to the output vector and “pulled it” in the direction to minimize the score. The score was used to keep track of how the node preformed in the previous time it was the winner. If the winner node was below the precision needed, then it would simply perform the motor commands derived from its output weights. However, if the score was above the required precision then it would use the step size to try to minimize the score. This places the network into two modes: The first is the conscience mode in which learning is facilitated by monitoring the movement, and the second is the unconscious mode in which the movement is preformed with no monitoring or adjustment to the network until the end.
As described above the network used was similar to the “extended self-organizing feature map algorithm”. The network consisted of an input layer, a Kohonen layer and an output layer. All the layers were fully connected to each other. The input layer consisted of two nodes which described the x,y position of an object in an image. The Kohonen layer consisted of 50x50 nodes arranged in a rectangular fashion. The output layer consisted of 2 nodes fully connected to the Kohonen layer (figure 1).
Each node in the Kohonen layer included a score value and a step-size value for each output node. The score was used to keep track of how well the node preformed and the step-size values were used to adjust the output weights to minimize the score. The step-sizes are constant values in which only the direction changes (plus or minus). This can be visualized as the step-size values are used to pull the output vector in a direction to minimize the score (figure 2). This way, both magnitude and directions are being learned.
|
Figure 2: A single node is depicted using its output vectors and step
sizes. The output vector is visualized as a vector in which the step size
vector “pull” the vector in the correct direction. |
The network starts initially with the Kohonen weights initialized to be between 0 and 320 both in the vertical and horizontal direction. The purpose of this is to speed up learning of the input space. All of the scores for each node gets initialized to 0 and the step sizes for both the x and y direction to –20.
Once an input is presented to the network every node in the Kohonen layer is compared to the input vector using the Euclidian distance formula.
(1)
Where I is the input vector an Ii is one component of the vector. K is the weight vector in the Kohonen layer for each node, and Ki is one component of the vector. The sum is made over n dimensions, which in this project was 2 for the x and y coordinates of an object.
The unit in the Kohonen layer that is the closest to the input layer is then chosen according to the formula:
(2)
where Ki is one node in the Kohonen layer. The smallest value of the Euclidian distance between I and Ki for all nodes in the Kohonen layer is chosen. This node is then labeled the winner. The weights associated with the winner unit and the output layer is then used as the motor commands.
Once a winner is chosen, the score of that winner is compared with the precision required. The network then enters one of the two modes described above based on the results. If the unit score is zero or above precision then the network enters the conscience mode. If however, the score is not zero and below precision then the network enters the unconscious mode and the motors command are preformed. Since all the nodes start with a score of zero, the first time the network will enter the conscience mode.
Network mode = if
(3)
The conscience mode is where learning is preformed. The learning algorithm follows the one proposed by Kohonen for training feature maps [Kohonen, 1998].
The target motor commands are then taken to be the output weights from the winner node plus the step size.
Motor Outputj = Oij+Step-Sizeij (4)
Where i is the winning node and j is one component of the output vector (one for each motor). The movement is then preformed using these values. After the move is done, the score is computed. The score is simply an error we are trying to minimize. In this project the score was the distance of the object to the center of the image, which was computed using the Euclidian distance. The score is then compared against the winner score. If the winner was just initialized then its score would be zero, otherwise it would be the last score for the last movement. If the score is zero then the step sizes for the winning node are left alone and the network learns the input, motor output and score. If the score is larger then the previous score, then different step sizes are tried randomly.
Training is preformed by making the winning node similar to the input as well as the score. The motor outputs get assigned to the winner. This information then gets propagated through all the nodes according to a Gaussian neighborhood kernel function [Kohonen, 1998].
(5)
Where
is the distance
between the winning node and a neighboring node calculated using the Euclidian
distance.
is the “learning-rate factor” and the parameter
is the size of the kernel. Both
and
get adjusted after each iteration to decrease the
neighborhood function over time and decrease the change in learning. Both
values were chosen to drop exponentially over 5000 steps. However,
was never allowed to drop to zero so we would always keep the
neighbors changing within a radius of two.
(6)
(7)
The weights from the winning node
and the output nodes were also adjusted according to the equations above.
However, the learning-rate factor
for the motor output
weights was taken to be a constant 1 to produce faster results. The score was
also adjusted for neighboring nodes according to equations above. However the
score increased instead of decreasing. It was found that the score often
increased too much and hindered the learning process. The updated rule was then
adjusted to only update nodes in which their score was zero. Further work would
need to be made for updating the score properly. At the present the step-sizes
were simply copied to any node with a hci value greater then 0.1.
The software to simulate and control the system was developed using c++ and GTK. Using a GUI interface the networks were displayed with various parameters in order to evaluate them and tune them. A control panel was created in which various parameters could be chosen in order to manipulate the system. Furthermore, the SOFM were created as objects to give the most flexibility in using them for other applications.
The
first component that was developed was the SOFM object. Some of the code was
derived from Karsten Kutza pole balancing solution-using SOFM. However, a few
modifications were made like converting the code to be object oriented and
implementing the score and stepsize systems. The constructor of the object
created the SOFM with a given name, number of inputs, Kohonen lattice size,
number of outputs and initializing parameters like
and
. After the object was created, several methods could be
called upon to implement the system.
Initialize and random methods were also developed to initialize the network to predetermine values and randomize the weights respectively. A way to set the input, propagate the input and retrieve the output was also created. All the nodes were processed serially one by one. However a potential for parallel processing could be made to produce better performance. Furthermore, a way to write and read the network from a file was created so the network could be stored and retrieved at a later time.
A
way to visualize the network was created using GTK and Gnome tool set (figure
3). The display was organized into 4 quadrants. In the first top left quadrant
the information about the network was displayed. This included
and
values, the
(the neighborhood
function) and the average score. The top right quadrant produced information
about each node. The information included the input and output weights, the
score and step sizes, and the Euclidian distance to the input. The Bottom
quadrants were used to display the input, output and the Kohonen layer. Each
node was represented as a square, which was allowed to change its color and
contained a value underneath. The value and color were dependent upon the view
type that was selected. Scroll bars were also added to the Kohonen layer so a
large network could be viewed on a screen by scrolling through the nodes.
In
order to further examine the network, the network was allowed to be displayed
at run time using various display parameters, which could also be chosen at run
time. These parameters included the Euclidean distance, the magnitude of the
input and output vectors, the neighborhood function
, a way to visualize the input and output weights using two
colors, and the score at each node. The Euclidean distance to each node was
displayed by providing the final distance value underneath the node and the
node was displayed in green with an intensity that reflected the value. The
color value was scaled according to the maximum and minimum value of all the
nodes. This had the effect of displaying nodes with low values as black, and
nodes with larger values as bright green. The winner was displayed in white.
The magnitude of the input and output weights could also be displayed using the
same manner as the Euclidian distance. Furthermore, the input and output
weights could also be displayed with two colors (green and red). The intensity
of green and red reflected the value of the first and second weight values
respectively. For example, input weights of [0 0] would be shown as black and
[maxvalue maxvalue] would be shown as orange. This gave a way to find out if
the network was ordering correctly. Finally a way to visualize the score and
neighborhood function
was displayed in a
similar fashion as the Euclidean distance.
A few more controls were built to manipulate the eye, arm and the system as a whole. A way to set the threshold for the finger and object were created using scroll bars in the eye control panel, as well as camera settings like brightness, hue, etc. A way to move the pan and tilt manually for the eye was also created. An arm control panel controlled the movement of the arm as well as provided a way to test the postural primitive movement. Other controls included, stepping and running through the system as well as training parameters.
All output from the system was captured in a log file and graphed. Perl scripts were then used to extract the data from the log file and the network file and saved as various data files. These data files were then graphed using Gnuplot to visualize and evaluate the system.
|
Figure 3: Screenshot of the software |
A camera with pan and tilt mechanism was built to test out the system. The camera used was Logitech 4000 USB camera running at 320x240 resolution, with standard RC servos controlling the pan and tilt. A software program written in c++ and gtk, as described above, was used to simulate the network and control the system. A color blob detection to detect objects in the vision field and report their centroid was also used. The whole system ran on a Linux machine with 256Mb of ram and a 600Mhz CPU. The goal of the system was to fovate on a given object, that is; move the pan and tilt servos in such a way that the object would end up in the center of the visual field.
|
Figure
4: The eye |
The pan tilt mechanism to control the camera was built from simple RC servos available at any hobby shop. These are proportional servos controlled by an internal feedback loop to place the servo in a given position using a potentiometer. These servos are fairly accurate, however under different power conditions they vary in their accuracy. However, by using the system outlined above, it should compensate for any inconsistencies in the servos. An IsoPod DSP microprocessor was used to control the servos using serial communications. Furthermore, the servos were not calibrated or positioned in any specific way. It was up to the network to find the control signals required to fovate on the object.
In order to get the x and y position of an object in a picture, a method developed by James Bruce called CMVision was used [Bruce, 2000]. This method employs a Boolean value decomposition of the multidimensional threshold instead of comparing each value separately. To find if a given pixel is in a particular color class a simple bitwise AND was used between the threshold and the class. This technique enhances the speed in which we can find blobs of color. For example a given “orange” class in a 10 levels YUV color space can be given as:
Yclass[] = {0,1,1,1,1,1,1,1,1,1,1}
Uclass[] = {0,0,0,0,0,0,0,0,1,1,1}
Vclass[] = {0,0,0,0,0,0,0,0,1,1,1}
Then to find if a pixel with color value of (1,8,9) is in the class “orange” all we need to do is evaluate the expression Yclass[1] AND Uclass[8] AND Vclass[9], which in this case would be true indicating that the color is in the class. By combining several bits together and taking advantage of parallel bit operations in the CPU, we can determine if a given pixel belongs to several classes at once.
After the colors samples have been classified, connected regions are formed by using the run length encoded (RLE) technique. Then, using a tree-based union, the runs are merged to form the regions. The last step extracts region information from the merged RLE map like bounding box, centroid and size of the region.
Software created by Intel, OpenCV, was used to extract the image from the camera. This software enabled some image manipulations (like drawing boxes and circles on the visual frame or more complex operations like ellipse recognition) with ease. However, the software used RGB instead of YUV and because CMVision used YUV to manipulate the image some modification to the software had to be made. However, since the internal image of the camera is kept at YUV, it should be possible to skip the image libraries and use the CMVision software directly from the camera. This should speed things up if necessary.
The map used contained two input nodes (for the x and y coordinates of the object), a 50x50 nodes Kohonen layer and 2 output nodes. The structure of the network followed from the one described above. The network started initially with the Kohonen weights initialized to be between 0 and 320 both in the vertical and horizontal direction. The purpose of this is to speed up learning of the input space. All of the scores for each node were initialized at 0 and the step sizes for both the x and y direction at –20. Instead of moving the object manually to train the eye, the eye made random movements and the network was used to try to fovate on the object. After the network has been trained for an extensive amount of time, it was tested by manually moving the object and checking if the system could constantly keep the object centered.
The learning process followed these steps. Once the x,y position of the object were found using the CMVision software, it was fed into the network. A score was computed to find the distance from the x,y position of the object to the center of the image using the Euclidean distance formula.
(8)
Where W and H are the image width and height respectively, and Ox,Oy are the centroid of the object found.
Once the score was found it was compared against a predetermined precision. In this experiment 20 was found to be a sufficient value. This value allowed the object to be within 20 units from the center and still be considered centered. This is because under different lighting conditions, shadows, and reflections of light the image processing did not return the same x,y position of the object at the same location. When the score is greater than the precision, that means that the object is away from the center and the pan/tilt motors need to be adjusted to bring the object back to the center.
All movements in the pan and tilt directions were relative to the current servo position. This allowed the network to stay small, because the movement is just a correction value to get the object back on center. However, this relies on the fact that movements across the servos are linear. For example, a movement from 0 to 50 units in the servo would cause the same movement on the image as from 50 to 100. This might not be the case, because the pan and tilt mechanism might not be perpendicular to each other or the servos might not operate in a linear fashion. To remedy this problem obsolete movements will need to be made and the network size would need to be increased appropriately. However, this was not found to be a particular problem in this project. The one problem that this did cause was that once the eye reached its maximum movement in any direction it learned the wrong movement. Additional rules were needed so the network was not made to learn when this happened.
In order to get the appropriate motor adjustment to center the object, the network was consulted using the current x and y position of the object as the input. Once a winning node was established, its score was looked at to find if it had sufficient training to perform this movement unconsciously (i.e. with no training or feedback). If the score was below the precision (20) and was not 0 (not starting) then the movement was performed using the output weights from the winning node. The score was then recomputed after the move to find out if the network did not learn any bad behaviors. If the score was not below the precision, the score was simply updated to the current score. This made the node learn the next time it was activated.
If the winner score was zero or greater than the precision, then further training needed to be performed. This was established simply by adding the individual step sizes to the output weights (see equation 4). The move was then tried with the new values. If the move was better, that is the new score was less then the winner score and the previous score, then the network learned the new output weights as described above. However, if we did not improve, then the output weights were returned to their original value and the score was increased for the winning node. Increasing the score for the winning node after a bad move had the effect of punishing the network for making bad moves. This allowed it to escape the situation where the score is very small but improvement was not made (previous score less then current score). New step sizes for all dimensions were then chosen randomly, either –20, or +20.
The eye was trained by placing the object some place in its visual field of view. The network was then allowed to take over to try to fovate on the object as described above. Once the object was centered, a random movement was generated for both the pan and tilt servos to move the eye away from the center. The network was then used again to get the object fovated. This process repeated until the eye was sufficiently trained. Ideally all the scores for all the nodes should be below precision. This did not happen but the network came close. Further work would need to be done to figure out how to get the network to converge. This process moved the eye instead of the object to train the eye. However, for better results the object should have been mounted on a moving apparatus.
To evaluate the eye performance and the network a few measurements were taken at each iteration. After each move the average distance to the center was calculated and outputted to a log file. This measurement ideally would be below twenty as the eye becomes better at moving its servos to fovate on the object. The average score was computed over all the nodes (50x50) to find if they improved. Ideally, the average should be at twenty or below, which would indicate that the network is learning. The number of nodes that have a score below twenty and the number of nodes that have a value of 0 were also taken to determine if the network was improving or not. Lastly the object position and the number of movements it took the eye to move to the center were also taken. These measurements indicated the coverage that the eye has been trained for and how well it was able to fovate on the object respectively.
To evaluate the network more closely, the network was graphed using different components from the network. The Kohonen weights (the weights attached to the input) were graphed to find the coverage that each node is responsible for. Ideally, this should be a two dimension linear mapping between the image space and the nodes space. However, since the eye is not positioned directly above the object, but in an angle, the graphs should be a little skewed. In order to determine the output from the network, the input to output mapping in both dimensions were graphed as well as the output weights as vectors across the input space from the network (the two dimensions output weights were treated as vectors with a x,y starting position as the input weights from each node).
|
|
Figure 6: The average
distance to center after a given move. |
After training the eye for 10.5 hours (32216 iterations) the eye did exceptionally well. As can be seen from figure 5/6. The average score of all the nodes was decreasing which indicated that the network was learning. Furthermore, the average distance to center was decreasing indicating that the eye has learned to fovate on the object. The first 5000 iterations were where the network was learning the most. After that the network learned much more slowly. However, the network still continued to improve itself. The system was shut down, and restarted the following day. After training for about 10000 more steps, the system seemed to taper off as could be seen in figure 7/8. Subsequent iterations did not seem to make the system improve. The average score of all the nodes seemed to be around 22.5 and the number of steps to fovate on the object averaged 2 steps.
|
Figure
7: Training for additional 9 hours (27796 iterations). Looking at the average
distance to center, the network seemed to stop improving after 10000 steps. |
Figure 8: Training for additional 9 hours (27796 iterations). Looking at the number of steps to center (in red) and the average score across all nodes. |
Examining the network revealed that the output vectors were pointing in different directions in the 4 different quadrants as expected (figure 9). The mapping between the input and output in both directions were also fairly linear. Looking at the input weights of the network indicated that the network was also generalizing quite well. That is, the weights were evenly spaced over the whole input space (which is the image width and height).
To further test the system, the eye was moved to a different location from where it was trained and tested by trying to follow a yellow ball. The eye was able to fovate on the ball relatively fast (less then two steps at full servo speeds). To test the eye’s ability to track moving objects, the ball was kept moving in a circular fashion. The eye was able to keep the ball fairly centered. Furthermore, as a consequence of the movement of the ball the eye started adjusting its output weights to improve keeping the ball centered.
|
Figure
9: The output weights are plotted as vectors with an input weights origin. It
can be seen that the magnitude increases (greater motor signals) as the
object is away from the center. |
The experiment above showed that using SOFM as a basis for intelligent machines is a possible solution. After training, the eye was able to fovate on an object at any time and from any position without calibrations. Furthermore, the eye was able to determine its own kinematics to successfully fovate on the object. The eye also showed possible usage in tracking moving objects, although more work would need to be done in this area. In the past a formula would have had to be determined experimentally to do the mapping between the input space and output space. This however, was prone to errors because the servos controlling the eye would have had to be calibrated, and any deviations from the calibrations would have resulted in bad performance. Furthermore, any inconsistencies in the image or in the eye’s servos (either from the start or after time) would have only been able to account for by manually assigning exceptions to the formula or recreating the formula.
The Self Organizing Feature Map has proven to be a useful tool in fixing the problems outlined above, by letting the network, in a sense, determine the mapping between input and output experimentally. However, further work still needs to be done in the learning aspect of the system. The eye was seen to be tapering off and not learning any further. Ideally, the goal of the system would have been to fovate on the object within one step at full servo speed. However, the eye did come close to achieving this goal.
The reason the learning tapered off might have been because of the network size. The network was chosen to be 50x50 nodes to map the complete input space of 320x240 pixels. This means that each node was responsible for about 30-pixel area. As a result any position between these nodes (given if the nodes completely encompassed the image area evenly) would have had to be generalized and placed the eye in the vicinity of the center. Increasing the network size could produce greater performance since each node would have to generalize less. However, this would come with two problems: the first is that any network beyond 100x100 nodes requires a greater computing power (100x100 nodes took several seconds on a 600Mhz machine to compute). However, the potential is there for the network to be processed in parallel speeding up the computations enormously. The second problem would be with learning to generalize. Each node would need to be updated very carefully in order to stop it from “bleeding” wrong information to its neighbors.
To further test the idea of using Self Organizing Feature maps for developing intelligent machines, a more complex system involving a robotic arm and the eye was used. The goal of the system was now to fovate on an object and attempt to point at it. The system operated by first fovating on the object and then moving the tip of the finger above the object. However, before the system was able to successfully achieve this, it was first trained by moving the tip of the finger to the center of the image.
Pointing at an object involves the coordination between the eye and the arm to achieve this goal. This often requires a great deal of equations to model the kinematics of the arm and the eye. Previous attempts to achieve this required a great deal of resources to find these equations. Furthermore, in a complex system such as this, there are many more anomalies that need to be taken care. For example, when using encoders to find the arm position, over time the small errors in the servos start to accumulate and produce unexpected results. This often comes with the cost of calibrating the system, which requires time and is never prefect. The attempt to use SOFM for controlling a robotic arm to point at an object using vision should help alleviate some of these problems. Again, by using a system that can generalize we should be able to have the system acquire its goal without learning all possible rules.
The arm used in this experiment is a 6 DOF arm developed by Lynxmotion. The structure of the arm is fairly simple using the same RC servos as were used for the eye. The same computer system used for the eye was also used for the arm. The eye image output was connected to the computer via USB and was controlled using the same IsoPod microcontroller as for the arm. The IsoPod contained the code to manipulate the servos and was controlled using a serial connection at 9600 bps. The fingers and wrist were not used for these experiments, and they were not controlled. However, further research could be made for controlling redundant degrees of freedom as per [Marinetz, Ritter and Schulten 1990].
It was found that if the arm moved too fast it would shake when stopped, which would cause the finger to be all over the image, returning inconsistent results. To fix this, the code in the IsoPod (the microcontroller for the servos) had to be modified to incorporate acceleration and deceleration to produce smoother movements. This made the arm come to a silent stop after a move with no vibrations.
Being a cheap and simple mechanism, the arm
is more prone to errors. For instance, there is so much friction at the base
that it takes more force than the rest of the joints to move. Furthermore, this
move is fairly inconsistent. These problems make it a good test bed for
implementing and evaluating the system. Ideally, with the current training
algorithm the system should be able to compensate for all these problems
without prior calibrations or knowledge of the arm’s kinematics.
|
Figure
10: The arm |
To limit the amount of dimensions required to move the arm,
postural primitives were implemented. Experimenting with frogs and rats has
revealed that an underlying system of control was used to combine movements in
their limbs. These control mechanism help alleviate the number of signals and
motor commands that the brain needs to be aware of. Bizzi and others [Bizzi ,
Giszter, and Mussa-Ivaldi 1991], preformed experiments with frogs and rats by
electrically stimulating their spinal cord and measured the force field in
leg-motion space. They have found that all of the force fields were convergent
with a single equilibrium point. Furthermore, they have found that only 4
general fields existed for different postures of the leg in space. When
combined, these primitives were able to reach the whole leg space with only the
four primitives. In addition, it was also found that children begin a reach
from a rest position in front of their bodies. If they were not able to reach
the target after a movement, they would bring the arm back to the rest position
[Diamound, 1990]. These postural primitives have been applied successfully to
the area of robotics before [Williamson, 1996].
A more simplified version of the postural primitive was used in this project for the arm. Since the arm could point to an object without moving its wrist and fingers, these servos were eliminated from the control system. Furthermore, since the system only had one eye, it did not have depth perception. This means that the arm could not tell if it was running into the ground or not. In order to fix this a single motion primitive was implemented combining the shoulder and the elbow together. To achieve this the arm was moved by manually imputing a command for the shoulder and elbow to move the arm in such a way that the fingers were always about 10 cm of the ground (figure 11).
|
Figure 11: The arm and the postural primitive. The finger would move toward and away from the base approximately 10cm off the ground using the postural primitive function. |
To find the angles required for this, constant samples of angles at regular intervals of both elbow and shoulder were collected and graphed. A polynomial trendline with degree of 2 was used to approximate the movement as shown in equation (9).
(9)
Where E is the elbow position and S is the shoulder
position. This provided the system with a single dimension to manipulate in order to move the arm in space. The
base was used as the second dimension
to have the arm move its fingers on a plane 10 cm above the ground. Thus, the
system would only need to learn the mapping function
. Note that the arm movement is very similar to moving in
polar coordinates.
To find the finger position in the image field, a red marker marked the finger. Using the same CMVision software that was used for the eye, the coordinates of the finger were found. Once the coordinates were found it was fed into the network, which produced the motor output required to place the finger a little off center. The system was allowed to train by moving the eye at random (but stayed in the field of view of the finger) and then the arm tried to move its finger 10 units above center.
A new SOFM map was created to control the arm. A 50x50 node SOFM with two inputs and two output units was used to learn the mapping between the image input space and motor output space. The input to the network was the x and y position of the finger in the image. One of the outputs from the network controlled the movement of the base, which rotated the arm from side to side. The second output controlled the primitive function, which controlled the shoulder and elbow in a harmonious fashion to bring the arm in and out. The network created has the same parameters indicated by equation 6 and 7. The network started initially with the Kohonen weights initialized to be between 0 and 320 both in the vertical and horizontal direction.
The goal of the system was the same as for the eye; to minimize the score. The score, as indicated above was now being calculated as:
(10)
Where W and H are the width and height of the image, and Fx, Fy are the position of the finger in the image. It was found that because the eye was tilted at an angle to the object, when the finger was above the object it was about 10 pixel units above the center. This had to be incorporated in the score, as the goal of the system was now shifted 10 pixel units above the center.
However, the training had to be slightly modified from the training algorithm used for the eye. The visual feedback to determine the score used only one camera; this made the system suffer from depth perception. As a consequence, using random movements to check if improvements were made resulted in errors. To the system it looked like the finger was moving in the wrong direction, where in fact it was moving in the right direction. The reason for this was that even if the finger was moving toward the object, the score increased for the first few steps. This confused the system and made it try a different move (because the score was increasing, and we are trying to minimize the score). To fix the problem, the step sizes were calculated using the following formulas:
(11)
(12)
if
(13) if
(14)
One of the consequences of doing this is that the system becomes a little less adaptable. For instance, if the eye were flipped, the formulas would need to be changed. A better solution would have been to integrate the step sizes over subsequent error movements. For example, if the system controlled only one movement and tried to move in both directions and the score still increased, the system would try greater movements. A greater movement should free the system from the local minima. This algorithm was tried, but resulted in the system learning very slowly. For the sake of time the equations above were used.
Training the
system now involved moving the eye to a random location, while keeping the
finger on the image. Then training the system to move the finger to decrease
the goal, which was calculated using equation (10). If the arm was lost in the
image, the arm returned to the rest position and started again. Once the system
was adequately trained it switched over to live mode. In live mode an object
was presented to the system and the eye was used to fovate on the object. This
was done using the first SOFM that was trained for the eye. The second SOFM for
the arm was then used to try to get the finger above the object.
|
Figure
12: The average distance to center of the finger after a move. Ideally this
value should be below 20. |
Figure
13: The average score of all the nodes. This should be bellow 20 as well. |
After about 8.7 hours of training (15279 iterations) the arm improved at trying to acquire its goal. The system was evaluated using the same procedures outlined for the eye. Figure 12/13 shows that both the average distance to center and average score decreased. Furthermore, looking at the network input and output weights space showed that the network took on a shape, which encompassed the arm envelope (figure 17). The distance to center seemed to settle around 40 pixel units, which was farther than the goal. This meant that the arm would require multiple moves to reach the goal, and that the arm was not fully trained yet.
Similar to the eye, the network also did most of its learning in the first 6000 iterations. This is because initially the neighborhood function is large and so many nodes benefit from training. As the network becomes more specialized, the neighborhood function decreases which causes fewer nodes to learn. Therefore, the optimal time for learning is in the first part of training. Ideally the network should then learn the rough transformation in the beginning, it can then narrow the learning and improve upon itself.
|
Figure
14: Additional training after 2 servos were replaced. The distance to center
started increasing because of the new change, however it then started to
improve itself. |
Figure
15: The average score (green) and the
average steps to the center (red). |
In the process of training, two of the shoulder servos burnt down and were replaced with a different set of servos. These servos had different torque values and speeds which caused the shoulder to behave differently than before. After replacing the servos, the primitive function controlling the arm had to be changed to account for the new servos. When the system was brought back online the average distance to center started to increase in the first 4000 iterations as can be seen in figure 14/15. However, after that the network started improving itself again adapting to the new changes it encountered.
Further training did not seem to improve the arm (figure 16). The score dropped to about 30 in the next 2000 iterations, however it then picked up and settled around 35. This seemed to mirror the results found when training the eye, where the learning would taper off. However, it was a much more dramatic change in the arm.
|
Figure
16: Additional training did not improve the system. The average distance to
center remained at about 35. |
Figure 17: The output weights are plotted as vectors with an input weights origin. It can be seen that the magnitude increases (greater motor signals) as the object is away from the center. The network structure followed the dynamics of the arm. |
Training the arm took significantly more time then training the eye. This is due to the fact that the arm had to move much more slowly (about twice as slow as the eye) in order to keep the arm from shaking. Therefore, the arm needed twice as much training as the eye.
Furthermore, the arm’s finger did not move in a linear fashion against the image. However, the map was able to account for this by mapping this behavior. Most of this is caused by the way the finger moves in the image field as a result of the configuration of the arm. The arm was mounted on a rotating base in which it was able to move its finger toward the base and away from the base. Furthermore, another distortion of the finger against the image was also due to the fact that a single camera was used, which caused a lack in depth perception. This made the finger seem to travel in the wrong direction at times.
Other problems of the arm were that the arm was very flimsy. This produces inconsistent results in the moves. That is, under different situations the same position given to a limb would cause a different move each time. This was the case when a rotate base move was given or when the servos were under a lot of load when the arm was reaching. The network tried to compensate for this, but was not fully successful. One of the reasons is probably because relative movements were used like the eye. However, this gave many more problems then it did for the eye because the system was now more complex and more anomalies in movements were present.
One way to fix this would be to increase the size of the network and change from relative movement to absolute moment. This should give the network plenty of ways to find the exceptions when the anomalies happen and account for them. However, the system would need some way of knowing about the context of the arm. For example, the system would need to know that if the arm is under a certain position; giving a shoulder position would result in the arm of moving one way, and under another arm position it would cause it to move in a different way. The input to the network would probably need to be expanded to include the position of the arm as well as the image position. Given a 3 DOF arm would give a 5 dimension input vector, which could slow down the system.
The arm also showed robustness to changes in the environment. When the two shoulder servos were burnt out and replaced with different servos, the arm was able to compensate for it. It did this within 4000 iterations and without the need for any recalibrations or change in the system/software. Furthermore, the network did not need to relearn everything, just the change in the arm. However, if the neighborhood were increased slightly it probably would have been able to compensate for the change faster. After about 5000 iterations from the start, only about 2 nodes radius around the winning node get updated. When a big change happens, increasing the neighborhood size could result in faster learning, because the change can reach a greater amount of nodes at a faster rate.
The change made to the algorithm from using random movements in the step sizes to fixed equations could also be fixed. As noted above, this change could cause the system to be less adaptive under certain situations (like flipping the camera). Although these situations might not happen, it should still be accounted for because the whole goal of the system was to be adaptive under any situation. The proposed change of integrating the step sizes could be further studied to see if it is a feasible solution.
Another improvement to the arm could have been in the random movements created by the eye. During training the eye generated a random movement to simulate itself fovating on an object. However, since the neighborhood function starts by affecting a large amount of nodes not a lot of generalization was taking place if the movements were small. These random movements should have probably been large movements to occupy the whole input space. The random movement should have then been gradually increased in the range similar to equations (6) and (7). For example, the movements in the beginning of training should have been between 0 and 400 by steps of 100 and then the steps should have dropped down either linearly or exponentially with the number of iterations. This should result in faster learning because ideally the network should generalize upon the whole input space, and then slowly zero in to the details.
Furthermore, another SOFM could have been used to map between the eye position and arm position. During training when the arm learned its mapping between finger coordinates and arm position, another mapping could have been learned; the mapping between the eye position and the arm position. This mapping (if trained correctly) would be able to move the finger without looking at it. Therefore, if the eye is looking at an object but the finger is not in the vision field of view. The arm would still be able to move its finger toward the object or at least to get the finger into the field of view. This network does not need to be large, only rough transformation is needed. This would result in one network bringing the finger into the vicinity of the object and another for the precision of pointing at the object.
This project investigated two systems to help show that creating intelligent machines using SOFM is feasible. Both systems showed the ability to learn and to adapt to changes. However, at the time of this writing, the arm did not reach its full potential. The arm did show promising results by improving itself on trying to reach the object; reaching the object within a few steps. However, the arm did not reach a 1 step reaching process. That is, after the object was presented, the arm should have moved and pointed to the object immediately. This mainly had to do with the dynamics of the arm. The arm was very inconsistent in its movement, which made the system very intolerant to changes. However, the fact that the arm did show improvement suggested that the system was trying to learn. Further investigation into why the arm stopped learning would need to be investigated. Perhaps by creating a larger network to account for all the anomalies in the arm should provide better results.
<