9 Neural Networks

Bhushan Trivedi

epgp books

 

Introduction

 

Like GA, Neural Networks also has grown into a separate discipline today but it is still considered to be an important component of AI problem solving. In this and the next two modules we will throw some light on what neural networks are and how they help us solve some of the important AI problems.

 

The way we have looked at problems so far is very structured. We start by building a state space, learn about the domain and generate heuristics, carefully move around in state space from the initial state to the final state using all possible help as we can. This structured approach is well suited for domains which are well studied, especially gaming and expert systems. We are well aware of rules of playing chess or how to diagnose a disease or finding out fault in a car engine or something similar. We can design a state space, write rules and find heuristics for such domains easily and implement them. There are many such programs which have shown the capability of programmers to code such systems. When humans have learned how to solve it properly, it is easier for programmers to use that domain knowledge to solve those problems.

 

Unfortunately not all systems work the same. There are many AI problems which look trivial on the face of it but actually are much harder. It is because as it is not possible to represent either the problem or human solution in structured form. Consider recognizing a face or a signature or short listing resumes for a given post or something similar. If you ask any human how he has recognized somebody after many years they may not be able to answer.In fact humans are unaware of the method they used to store the information about faces that they see every day. They are also unaware of the process how the brain searches for a new face and get the match (recognize a face) so fast. One can easily understand that the match is not straight forward, the face does not look the same after some period. Sometimes humans are capable to even figure out somebody being son or daughter of somebody else when see the person for the first time. Sometime back there was a program on TV (called ChehrePeChehra) used to demonstrate a face which is basically combined from two known celebrities and the audience was told to find out who those two celebritieswere. Most humans could easily do that. If you ask any successful human how he did it, he would be unable to answer. Similar problem is of signature recognizing.

 

Experts who recruit look at resumes to find out who is best suited for the job need to search through jungle of information to get the right candidate. This problem is comparatively more structured than before but still the solution is not with clear cut algorithm. The resume shortlisting may be much more structured and humans may find some algorithm but algorithms to recognize face or finger print etc are hard to find.

 

So, the state space and the rules and all that actually cannot work here. We do not even have any heuristics for a solution leave alone algorithm or state space. When the best of our computers and best of the programs cannot do that, and seemingly dumb person’s mind can easily do that, we may ask ourselves a question, OK, how our brain is doing it then? Can we mimic the way brain works to solve these problems? A few researchers in past actually did that and came out with programs which could mimic brain in solving many of the problems (signature and face recognition, thumb impression recognition, access control solutions based on biometric) in a satisfactory way. There are many researchers still working on many similar problems.

 

So, the best thing to start learning how our brain works and how it is different than conventional CPU.

 

Brain and CPU works differently

 

The answer lies in learning how the CPU and the brain differ in functioning. People have done extensive research in learning about the functioning of brain. Here are some known differences between the brain and the CPU.

  1. The basic building block of a brain is a very slow, tiny element called a neuron.
  2. The processing speed of the neuron in the range of milliseconds in comparison with the speed of current processors which is in nanosecond’s range.
  3. The neuron ‘memory’ is pretty less. It can remember only a few items unlike memory associated with current processors.
  4. The neurons are huge in number. About 1010 neurons are part of a normal brain.
  5. The neurons are connected to each other dynamically. Number of connections is also pretty huge. On an average, every neuron has 1000 dynamic and adaptive connections to other neurons. That means total number of connections are well beyond 1013. Thus the brain’s neural network is highly connected.
  6. Not only the neural network is highly connected, it is highly dynamic as well. The connections come and go, the neurons keeps on dying and the weights associated with the connection is changing all the time.
  7. The brain neural network also works in a distributed and parallel fashion. Each neuron takes decisions on its own without considering any central authority.
  8. The storage in the neuron is done in a distributed and fault tolerant way. Atomic information is not kept in a single neuron but stored across neurons in a way that even if some neuron dies, the information is not lost.
  9. In fact human neurons (and thus the brain) are capable to store fuzzy and incomplete information for solving problems. For examples many of us, visiting some place after many years, can still find the path with incomplete information. We store faces and signatures and the like in such a fuzzy form that they can match with any nearby information which helps us recognizing somebody being son of somebody else and so on.
  10. The processors, however fast they are, poor in collaboration and synchronization. Many computer systems with high speed processors are available. A computer with huge number of processors may be easily built. It is though difficult to have capability to seamlessly synchronize  those processorsfor work. It is also hard to write algorithms which can take complete advantage of the highly parallel processing capabilities.
  11. The memory associated with processors is localized. If a typical part of the memory is corrupted, the data is gone. Extensive measures are taken to make sure the data is not lost (using redundant storage) in conventional processor memory.

 

These differences shows why the brain can solve problems which are harder for processors. In fact, the brain is customized to solve these problems. One study revealed that the neurons connected to human eye requires only 100 odd steps to recognize an image and take an action. This is not possible to be done by 10 million steps by a conventional processor. That is the power of working in highly parallel mode and ability to deal with fuzzy information.

 

The artificial neural networks (ANNs)

 

Inspired from the ability of the brain to do amazing things which conventional processors are not capable of, researchers started learning about how brain can be modeled in computing system. There are many researchers who worked on the problem and came out with many proposals. Most models used a single computing unit (which they called artificial neuron) and combine that computing unit in various ways to mimic different functioning of brains.

 

One model was based on the design of the cortex of the brain. A single layer of neurons was organized like a brain cortex and was used to solve problems like recognizing characters and syllables. That model was popularly known as self-organizing map. Another model used a highly connected but irregular pattern to connect with weights assigned by humans. That was popularly known as Hopfield network and used in solving problems which require use of associative searching. Associative search is quite common in humans. We do not ask usually “what is the customer number of XYZ person?” or “give me the person with phone number xxxxx”. Our queries are like “Give me the name of last movie of Rithik Roshan and KatarinaKaif where Rithik pretends to have stolen Kohinoor”. Such searches require us to explore associations between entities mentioned in the queries and are quite hard for conventional computing algorithms. Yet another model use a structured layering approach to arrange neurons. The neurons here have weights associated with them. These weights, which reflect the learning of those neurons, are learned and not provided by humans. This model is known as multi-layer perceptron (a kind of artificial neuron) model and many variants of this model are known. Most popular is known as BPNN or Back Propagation Neural Network. These models are good at solving classification problems like signature or face recognition.

 

The Neuron and the ANN

 

The brain neuron looks like the one shown in figure 9.1. It has 3 different components, the central part of the cell containsnucleolus and soma, dendrites for taking signals from other neurons and axon to pass signals to other neurons. The dendrites represent incoming while axon represents outgoing connections of the neuron. Each connection has some weight and the central part of the neuron is responsible for processing the combined incoming signal. Depending on the result of the processing, it decides to send the signal to axon. This process is quite complex involving chemical and electrical signals. All incoming signals are combined and processed to determine the output. The brain is trained correctly when the brain neuron learns to fire when correct inputs are provided and not when incorrect inputs are provided. Using multiple neurons can help brain learn about complex things with multiple inputs and multiple outputs.

 

For example when character ‘A’ is presented to a child and the signal generated by her eyes passes to neuron and neuron ignites the next neuron which is responsible for storing ‘A’, the child’s brain has learned to correctly identify that character. On the contrary, if the child’s brain incorrectly fires the neuron responsible for recognizing ‘A’ when ‘B’ is presented, the neuron has positively misclassified the input. Also, if ‘A’ is presented and brain’s neuron responsible for recognizing ‘A’ fails to fire, we can conclude that the neuron has negatively misclassified the input. Positive misclassification comes when the neuron fires (judges) that the character is X, but actually it is not X. Negative misclassification comes when the neuron does not fire when the character presented is X, which neuron believes to be other than X.

 

Thus there are three possibilities. The neuron has learned to classify the input correctly, positively misclassified or negatively misclassified. We will soon see how these input classification helps us in making artificial neurons learn.

The artificial neuron is modeled after the brain neuron. We will not use word artificial neuron or brain neuron now onwards. It can be understood from the context.

 

The artificial neuron looks like the figure 9.2. The inputs are named as X1 to Xn. They work like dendrites. The output works like axon. The processing is done by the function (we call it S or sigma), which is summation of all incoming values which mimics the nucleus. The incoming values are multiplication of inputs and the weights. There is some threshold value called Ф. If the summation is bigger than Ф the output is 1 otherwise 0. Mathematically the summation is written as follows

 

i=i= n1 where Xi indicates ith input and wiindicates ith weight. The value i ranges from 1 to n where n is number of inputs.

 

And the condition is written as follows

 

i= n

 

> Ф then output is 1 otherwise 0

 

i= 1

 

Sometimes the testing part is indicated as σ and the threshold input is indicated as b and the processing happens as depicted in figure 9.2. The yi (the output) is 1 or 0 depending on the σ.

 

Thus if for values x1 to xn, if the correct answer is yj and the actual output is also yj, the network has learned to classify that input correctly. Otherwise it is either negatively or positively misclassified.

 

The network is said to have learned correctly if all these inputs are classified correctly. How can that be achieved?

 

We must get the right set of weights (the inputs are not going to change, they remain the same throughout, for example if we took 20 samples of the customer’s signatures, they aren’t going to change. The weights are in our hand and thus they can change. The proper learning is achieved if we are able to get the set of weights that if we use those weights, for all inputs, the network responds back correctly. Not only the training inputs, but testing inputs as well. Training inputs are used to help the network learn while testing inputs are used to see if the network has learned properly.

 

To get the right set of weights, we need to take each input, calculate the output and based on that, change the weights such that for that input in future the output comes out correctly. Let us see how the weights are changed.

 

The output is based on the summation

 

x1w1k + x2w2k + x3w3k…. + xnwnk

 

We want this summation to be > Φ for correct input. Look at the value k which does not change and does not have any effect on the summation. One may ask, what is the need to complicate it by adding additional k? The reason is that such a unit may appear at any place, usually in a typical layer. That layer number is indicated by the value k. Thus W0k or W0k is 0th weight for the kth layer. In general Wki is ith weight of the kth layer. Thus bk represents threshold value for kth layer. Thus k is a number of the layer. Here we have only one layer of weights. When we have multiple layers and each layer has some weights, another dimension is provided to help the reader learn about that dimension.

 

Now for negatively misclassified input, the summation, which should be more than Φ is actually less than that. We must increase the summation. We do not change the input values, we can only change the weights. We must increase weights to make sure their multiplication to input increases to go beyond Φ. There are two types of inputs, 0 and 1. Weights which multiplied with 0 do not add to summation, only weight which are multiplied with 1 add to summation. Thus to increase the summation, we must increase weights to nonzero inputs for a negatively misclassified input.

 

What about positively misclassified input? The summation is more than Φ where it shouldn’t. We must reduce weights wi associated with all xi where the values of xi are one.

 

Thus for each input, we must see if it is correctly identified or not. If it is positively misclassified, we will reduce weights wi for all xi =1 and for negatively misclassified we will increase weights wi for all xi =1.

 

Once we change weights for each input output pairs, one epoch is said to be completed. Normally, few thousands of epochs are required for solving a typical AI problem. Sometimes the network cannot learn so we need to restart the program. Sometimes it takes more time while sometimes it takes less. It is because initial weight set is randomly chosen. If those random numbers are near to the solution it takes less otherwise more time. The weights represent weight vector of n dimension. The solution is also a weigh vector of n dimension. The learning process is moving from the random starting point in n dimensional space to the solution point in the same space. Further it is, and slow we move, it takes more time. Nearer it is, and we move fast, it takes less time.

Sometimes the -Ф is considered w0and the equation is rewritten considering multiplication of w0 with x0

= 1 as follows. We need to check only if it is greater than zero now. x0wk0 +x1wk1+ x2wk2 + x3wk3…. + xnwkn> 0 And the condition is written as follows

 

i ni 0 0 then output is 1 otherwise 0 where X0 is 1 and W0 is -Ф or -bk as b is mentioned as bk. The w is mentioned as Wk thus W0 is written as Wko1,

 

1The literature is not consistent with the representation. Xijmight indicate ith input of jth layer or jth input for kth layer. We used both here.

Sometimes the σ outputs a real value between 0 and 1 rather than discrete values in the previous case. Instead of summation some other function is applied. Thus the process becomes little different than what we have seen so far. The function which is used to output based on the summation of input values and weights is known as activation function. Figure 9.3 depicts the case.

 

One activation function popularly used is known as sigmoid and defined as follows.

 

Y = where the sum is the summation of the inputs multiplied by weights. In other words the value of Y is

Y =

 

The activation functions, Square and Sigmoid both, can be depicted as shown in figures 9.4 and 9.5. The square function contains discrete outputs 0 and 1 for inputs while sigmoid provide a real value between them.

There are many other activation functions used in practice but these two are by far the most used. There are a few differences apart from their shape in using these functions. Here are some important differences.

1. The square function is simple which uses addition and comparison with a threshold value (which is 0 when threshold is included in the summation like we did here) while sigmoid is more compute intensive requiring raising the sum to e. This makes it easy to use square function where the computing power is less.

2. Square is a discrete function. It changes abruptly for negative and positive values. Unlike that Sigmoid is continuous.

3. Some algorithms require a function which is differentiable (for example back propagation algorithm which we are going to look at later), the sigmoid type functions can only be used there.

4. The square function does not help learning for correct values. Sigmoid does. We will elaborate it soon.

5. The shape of the activation function decides the speed of learning. One can decide the slope of the activation function to decide the learning rate. It is possible if we usesigmoidfunction. We will elaborate this in the next section.

 

The process of learning

 

We stated multiple times that the neural networks learn this and that. Let us see how it learns anything. In fact learning is a process of getting weights which can unambiguously classify each input correctly. For example let us take a problem of recognizing faces of employees of the company. We might have few images of each employees, probably using a different shade of light and orientation and store them in a database. Now we write a program which implements the neural network and provide each image as an input to that program one after another. We set weights after each input image is provided and the output (may be in form of the employee name or number associated with that image) to make sure the network is better at learning the same image thereafter. We complete one epoch (get all images of all employees) and repeat it again and again till we get all images correctly identified. Once this happens, the network weights are set in a way that all images, when input, produces the name (or the number) of the employee correctly.The network is said to have learned. Various researchers have produced varieties of neural networks to date but this fundamental principal of learning is not changed. Let us summarize

 

1. The network start with random set of weights for all layers

2. All inputs (let us call them training inputs), are presented one after another

3. Output is closely observed, the weights are adjusted in a manner that if the same input is presented again, the output is stronger. That means if the output is incorrect, will now produce a correct output. If correct output is produced, it will strengthen the weights further to the extent that some changes in weights later can still produce correct answer.

4. The steps 2 and 3 are repeated until all training instances are correctly identified.

5. The final weight vector is stored for further use.

 

It is important to note that there are two different types of inputs, one which is used for training and another which is used for testing. When we use the same weight vector and see if all testing inputs give correct outputs. If so, the network is not only learned but also generalized on its learning. For example if the network is trained for handwritten characters, one complete set is reserved and not provided during the training phase. That set is provided in the testing phase. If that set is recognized correctly the network is not only said to have learned but generalized correctly.

 

There is one more method used in practice. When some large amount of data is given, it is divided into N folds, let us name them as F1 to Fn. Now F1 to Fn-1 folds are used for training the network and Fn is used for testing. Next time F1 to Fn-2 andFn are used for training and Fn-1 fold is used for testing. The network is continuously trained and tested for each N fold2.

 

Learning for correct values and speed of learning

 

While learning, the network receives two types of inputs, correct and incorrect. Correct inputs should be identified as correct and incorrect should be identified as incorrect. For example if we have written a program for judging signatures of our customers and design it based on say square function as an activation function. Now the customer’s signature information (may be images or some stylus like input) is fed in the program one after another (based on a reasonable number of samples of different signs by the same customer; 10 is a good measure3), and stored in the database. Once all sample data about customers is collected, the neural network program runs and take the inputs one after another. The program generates some weights randomly first. The signature information is fed in and the output is checked. If the output is not 1 (as this is correct signature the output must be one), we have to retrain the network in a way that it becomes

 

1. That means if we have incorrect weight setting for that input, we will adjust the weight to make sure when that signature is input to the system next time, the system responds positively. Thus it corrects the response for input which is classified incorrectly. (Negatively misclassified)

2  This is popularly known as n fold validation

3 One might think more the samples better it is. Unfortunately more than sufficient number of inputs creates overlearning which in turn inhibits the network’s ability to generalize. One must be cautious. Similarly, if the signature other than this customer is fed in and is classified as customers, we again need to adjust weights to make sure it should not be classified as this signature again. (Positively misclassified)

 

In fact the method to do so is pretty simple. We need to reduce summation for positively misclassified values and increase summation for negatively misclassified. To reduce summation, we need to reduce weights where the input values are non-zero (as zero input does not play any part in the summation as when it is multiplied with the weight, the term becomes zero). We have already seen that.

 

Unfortunately not only the misclassified but classified inputs are also required to learn. For example if we are working on signatures of the customer and for a typical signature the summation comes out to be 0.01, our threshold checker says “Positively Classified” and we are happy and done. Is it a good thing to do? No, our input is barely classified, a little weight reduction by next input may make it negatively misclassified next time. It is better if we increase weights even in this case to make sure the summation value increases enough so that weight reduction by some other inputs later can still make this as classified positively. This is also important to make sure that when we are testing the network and when the inputs other than what we have seen are provided, it should clearly classify them.

 

For strengthening the learning of the positively classified inputs, we cannot use square function in its raw form. The sigmoid works better. In the figure 9.5, you can see that till positive 6 or negative 7, the value is not near 1 or -1. That means even when the summation is positive for correct input, or summation is negative for incorrect input, the network is going to learn further. In fact the sigmoid function requires the summation to be infinity to return the value 1 thus it is actually going to learn until the programmer stops it. Usually programmer place a cap on the learning, for example he might accept .90 as 1 so when the activation function returns .9, it does not allow it to modify the weights further for that input. It stops when all inputs are learned.

 

Another important issue is the speed of learning. The slope of the activation function decides how fast the network learns. Steeper the slope faster it learns. That means little weight change induces bigger difference in output and faster it moves in the direction we want. In some cases the activation function slope is carefully changed to increase or decrease the speed of learning.

 

Generalization

 

Generalization is mentioned in the previous section. Generalization is an important component of human learning. When a child is taught to recognize character ‘A’ for example, he may be given a few different sizes of A may be in different colors. When the child is able to identify that character correctly we consider he has learned to recognize the character we wanted him to. Similarly a signature recognition for a bank employee does not end when he can recognize the signatures already present with the bank. He must also be able to recognize new (and may be a little different than the samples that he has seen earlier) and fresh signatures of the same set of customers.

 

We expect the neural networks to exhibit similar capability. Fortunately most networks which learn using setting weights can exhibit generalization, if some care is taken. Some designers put a cap on learning process while some others put cap on the number of units of the middle layer (when three layers of neurons are used), popularly known as a hidden layer. Some other researchers use noise in the input (sound surprising but true) to achieve generalization. When noise is added to the system, the system learns to be tolerant for that level of change in the input and learn to adjust inputs with that level of variations.

 

The black box of reasoning

 

In fact the ANNs represent the brain model which works like a black box. For example how we learn to classify things? My daughter, when was of age 3, took a ride with me. There were a few cars passed by, First one was a jeep, she asked, “what is it papa”, I responded “car”. The next was a smaller van (I think it was Omni, an eight sitter Van), she asked the same question and I responded back “car”. The third was a Maruti 800 car which followed the same sequence of exchanges. When the fourth car passed by, she screamed “Look papa! A car!” If you look at closely, all three models of car were quite different from each other but not only could her little mind grasp the common features from all three examples, she could use that to search the next item and correctly classify that as the car. Obviously she had little idea about how her mind could do it, most of us also have little idea how our brain solves such classification problems either. The human mind acts like a black box. We are given examples; our black box algorithm learns common features of those examples and stores them in some convenient way for it to search next time when similar example is presented. Our associations and information is stored in this black box fashion. We call it a black box as we do not have an idea how that information is actually stored. The ANN process works exactly like that black box. It has some inputs, known outputs for that input, and the process makes sure that the black box learns to provide similar outputs for similar inputs. Humans learn many elementary vision and audio related things that way. Take the example of hearing a few bars of music and come out with the actual song. It is an excellent example of how our brain does an associative search. Though we are good at doing this, we do not really know how the information about songs that we listen to are stored in the database and how we searched and received the answer from the database.

 

A computer program using a classification algorithm must work the same way. It should be given enough samples for it to “learn” the features and should be tested on unseen inputs to classify them correctly.

 

The unseen does not mean out of context inputs. For example signature recognition program should recognize the signature of the clients it has already learnt, may be drawn little differently than the samples. Similarly when a digit recognition of 0 to 9 is completed by producing some images of 0 and 9, we might produce a new image of any digit belongs to that range (i.e. any one from 0 to 9) but with different orientation or background or color or size etc.) for testing.

 

Unsupervised Learning

 

Before completing this module let us discuss one more important issue. So far, we took problems where we know what the output for a given input is. We are not always in position of doing so. We learn to classify things without external feedback many times. For example we read a few essays and classify them to be good and bad. We look at items and classify them to be of one type or the other (for example household items and office items, heavy items and light items and so on). We meet people and decide them to be nice, not so nice, helpful and unhelpful, joyful and sad and so on. We design the classification criteria on our own. We also decide about the number and type of classes we want the items to be classified on our own. The same job is achieved by type of learning in neural networks called unsupervised learning. In unsupervised learning, the items which has more common features are classified as part of a single class and items with different characteristics are kept elsewhere. In many cases, unsupervised learning is performed before the learning that we have discussed so far, the supervised learning.

you can view video on Neural Networks