11 Learning process in BPNN and Hopfield networks

Bhushan Trivedi

epgp books

 

Introduction

 

This module extends the discussion that we started in the previous module. We looked at how multilayer perceptrons or multilayer networks can be designed and how the activations are calculated for the input. Now we will see how learning is done and how weights are updated in BPNN. We will also see how Hopfield networks are used for content addressability in this module.

 

Learning in Back Propagation network

 

Back propagation network or BPNN is a neural network for which learning we can apply back propagation algorithm. The figure 10.5 that we have seen in previous module is an example of back propagation network. The back propagation network is popularly known as BPNN.

 

The forward activations (hi and oj) are calculated based on sigmoid activation function. Once the output is received for a given input, the weights must be adjusted for making sure better output is achieved next time. We have already seen that adjustment of weight is decided based on the error that we received. The output is a real value between 0 and 1 and the required value is either 0 or 1. For example if for a given face, the correct output is 0,0,0,0,1,1, and we might get 0.10, 0.01, 0.92,0.25,0.92,0.87 as the output (considering two digits after the decimal point and truncating the output accordingly) of each layer. Now the difference at each unit can be calculated as 0.10, 0.01, 0.92, 0.25, 0.08, 0.13 (the difference between actual values calculated and required output) Now we decide what our tolerance level is. Suppose if our tolerance level is 0.10, three units have correctly learned (their difference is less than or equal to 0.10) while the rest are still to be learned. Their weights are to be reduced if they are positively misclassified or increased otherwise. Look at output units 1 and 2, both of them are learned so do not require additional learning. So no weight update takes place for them. Look at the third unit. It should output 0 but provides 0.92 and thus the weights associated to the lines connecting to output unit 3 (o3) must be reduced. Similarly output unit 5 must increase its weights a little further.

 

The only issue now is to find out the exact values of weights to be increased or decreased for each erring input. For that, let us assume that oj is the output received and yj is the actual output received. The weights associated with the incoming connections are all needed to be changed. For jth output unit, the difference is yj-oj. The error is calculated by multiplying this value with oj and (1- oj)1. The error is denoted by δ2 (Error at second weight layer). The subscript to that δ2 indicates the unit of the output layer.

 

“1  Explanation to this multiplication is beyond the scope of this discussion. You can refer any book on BPNN for further discussion.”

 

Thus the error at output unit j is = δ2j = (yj-oj) * oj * (1- oj). Now let us look at the difference made at hidden layer.

 

Let us try to calculate error at h1. The h1 hidden unit is connected to all oj (for all j from 1 to m). Each oj has δ2j error associated with it. All lines that bring the errors back to h1 is shown in the figure 11.1 as boldface.

 

The summation of all errors and associated weights for a hidden unit h1 is found as follows

 

= w211 * error at o1 + w212 * error at o2 +W213 * error at o3 + w214 * error at o4+ …. + w21l * error at ol(

 

= ∑          2    * δ2j

 

The error at hidden unit h1 is therefore

 

= h1 * (1 – h1) ∑          2    * δ2j

 

Thus for ith hidden unit, the error is denoted as δ1i and is calculated as

δ1i = hi * (1 – hi) ∑     2    * δ2j

 

Once the errors are calculated, one must think of updating the weights. We do not really need to worry about reducing or incrementing weights as the sign of error itself determines that. For example if (yjoj) is positive, we need to decrease the weights and increase otherwise. The other values that we multiply oj * (1- oj) is always negative and thus the multiplication (yjoj) * oj * (1- oj) is positive when the weights are to be increased and negative when the weights to be decreased. That means, when the weight update is done, these values can be used as it is.

 

Thus the weight update is done using the errors associated with each weight. The error in the second layer (w2), is coming from the hidden unit and in the first layer (w1) from the input unit.

 

For example updates at w211  is based on the h1  (the place from activation begins) and the error at o1 is δ21 (Where the activation ends).

 

So weight update at w211 is based on h1 * δ21

 

Similarly weight update at w2ij is based on hi * δ2j // note that the subscripts of h and δ2 are different

 

The weight updates are indicated as ∆. Usually the weight update also includes the value called ɳ. So

weight updates are written as

 

w2ij = ɳ * hi * δ2j

 

The symbol ɳ is known as the learning rate. Its value is kept in the range of 0.3 to 0.42. Thus New w2ij = Old w2ij  + ∆ w2ij

 

An update is usually also multiplied with a value called momentum factor α which is kept at 0.9 after a few iterations. In the initial few iterations, the value is kept very law, 0.1 usually.

 

New w2ij = Old w2ij  + α * ∆ w2ij

 

Similarly updates at the first layer involves initiators at the first activation layer, xi and also terminators at the hidden layer for each weight and weight updates are written as

 

w1ij = ɳ * xi * δ1j Thus

New w1ij = Old w1ij  + ∆ w1ij

 

New w1ij = Old w1ij  + α * ∆ w1ij (considering momentum factor)

 

Once these weight updates are applied, one epoch is said to be over. Multiple epochs (thousands of them in most cases) are required in getting the set of weights which can classify all the inputs correctly. If we do not get one set, we may need to start all over again as there is no guarantee that we will always get that weight vector.

 

2  The values like learning rate and momentum rate etc are found empirically, researchers tries a few values and whichever value gives better performance with lesser overhead is used.

Figure 11.1 The error calculation back at hidden layer based on error found at output layer.

 

The steps of the algorithm

 

Let us briefly list the steps of the algorithm for training.

 

1. Decide about neurons for each layer, input and output first and then hidden as a geometrical mean of both values

2. Initialize each matrix of weights with random weights of small values for example -0.1 and +0.1 or between 0.05 and -0.05. The idea of keeping the values small is that the larger weights are found to start dominating and bias the output. Keeping them small helps them to learn in a more unbiased way.

3. Epoch = 1;

4. Pick up first input

5. Provide the input to each xi, and calculate each hi as well as oi based on the input.

6. Get correct output for that input, call it yj.

7. Find out error δ2 based on equation 10.13 for each activation hi.

8. Find out error δ1 based on equation 10.2 for each activation xi

9. If the error is less than tolerance level mark that this weight is learned and go to 11

10. Find out ∆ w2ij and ∆ w1ij and update weights accordingly.

11. If there is another input, take it and go to 5.

12. This state is reached when all inputs are processed

13. Epoch value is incremented by 1

14. If Epoch value is more than reasonable (different for different case), the network is not learning, stop and restart the process from the beginning.

15. If there is any weight which is not learned, pick up first input again and go to 5.

16. Otherwise store the weight matrices and ready for testing.

 

Now let us write down the algorithm for testing

 

1. Populate the weight matrix with weights learned during the training process

2. Take the first testing input.

3. Provide that input and see what the output is. Now we use a different tolerance level, normally it is such that if a value is > 0.5 it is considered 1 and if it is less than 0.5 it is considered 0.

4. Display the output. There is no updating of weights for testing inputs.

 

Geometrical view of the learning process

 

If one would like to view this process geometrically, it is about finding out a line in a plane which separates two classes. If there are multiple classes, we need multiple lines to segregate the plane in segments which correctly divides the plane into classes we want. For example if assume face recognition process, we might have a plane with some weight values as mentioned in figure 11.2. Multiple lines represent typical sequence of neurons. The combination of weights represent m and c values in a conventional y = mx+ c equation of line. Thus some weights represent the slope of the line and some other represents the intersection on Y axis.

 

Figure 11.2 Different regions separated in classes by different lines

 

The figure 11.2 represents a case where the correct weights for all lines are found and thus those lines are able to divide each region correctly4.

 

Starting with random weights means we are drawing an arbitrary line in the plane. Changing weights using back propagation algorithm means we are changing weights in a way that the new line separates the classes better than previous case.

 

In fact, when we are using multiple neurons in a single layer, it represents an n-dimensional plane where another n-1 dimensional plane or multiple planes are used to segregate different classes. In AI parlance, this plane (or a line in a previous case and in figure 11.2) is called a decision surface.

 

The back propagation algorithm, in short, is to find one or multiple decision surfaces of dimension n-1 for correctly segregating each class in an n dimensional plane.

 

Content addressability and Hopfield networks

 

Though most neural networks are used for classification, many other proposals are made to use them in other cases. An important use of neural network is in the field of content addressability. Let us take an example to understand. We have usually seen byte storing 8 bits and designed to take any combination to store any sequence of eight bits from 00000000 to 11111111. The example that we used for content addressability that is described here use same eight bits but little differently. Here a neuron is used to store a bit and each neuron is connected with other neuron with some weight. Each neuron is assigned some value in the beginning. Once the values are assigned, each neuron calculates the ∑i= n where n is total number of other neurons connected to it, xi is the value of that neuron and wi is the weight connecting the neuron. For example the last neuron in the figure 11.3 is connected to 5th and 6th neuron. It will inhibit the 5th neuron by 5 while exhibit the 6th neuron by 2. The 5th neuron is 1 so sets the last neuron to zero. The 6th unit on the contrary receive positive input from that neuron but it has a strong inhibition (-2 from second neuron and -5 from 4th neuron) and the total comes out to be negative thus it resort to 0.

 

In fact only based on this connection information a few values are possible to be stored by network shown in figure 11.3. For example if we begin with 11111111, the network will settle down into one of the few stable states. The state it settles into depends on which units are activated from the list. If we take a left to right route for example, it works like this.

 

The first unit becomes active (the first 1) which actives the fourth neuron. It also inhibits third neuron, thus it makes it

 

1_01 _ _ _ _

 

Now the second unit becomes active (second 1), it activates unit 7 and inhibits unit no 6. Thus after that the value is

 

1101_ 01_

 

Now next is fifth unit, it is turned on than it is one and now the eighth unit will be zero, as 5th unit inhibits it. The result is

 

11011010

 

Thus even if the input is incorrect (not stable form), the output is in stable form. You can try a few other combinations and can see that only a few possible options possible for that network to settle into. One more option available with the above case is, we just assume first two units are activated and then the last one is activated. You can get eighth unit on before the fifth and it exhibits the fifth neuron and result is

 

11010011

 

You can see two bits are different in this another stable state.

 

One typical stable state is surprisingly 00000000, all zeros. You can easily see that any other case also has that as a stable state.

 

You may be wondering what we are trying achieve by doing seemingly wasteful exercise. Eight neurons, which in true sense are capable of storing 28 values, are restricted to store much lesser (which one can count on finger tips usually), with additional weights and so on.

 

We are trying to demonstrate a type of network called Hopfield with the ability to content address. Content addressability, if you remember, is the ability of network to get some part of data and get complete data. For example listening a few bars of music and get the entire song. Here we have provided a few activations and we received activations from all other units. Quite similar to looking at part of the face and recognize the face. You can assume each neuron indicating a feature, an arc as a relationship between features weights as strength of that relation. For face recognition from a partial image, such networks can be used. For some information (some part of that can even be wrong! For example a big fish which comes often to the surface and throws water up like a fountain may be a query. The features are fish, fountain of water, being big in size and coming on surface regularly. All of this might get you an answer called whale. You may have noticed that we have an incorrect input (fish), whale is not a fish but a mammal but we still can get the right answer. In the case of figure 11.3 also, if we give 1110 as an input for example, the first two will set the next two and we will get 1101 and the rest based on the input. This is content addressability Hopfield networks are designed to provide. In our trivial examples we have looked only at 8 such neurons, in real world case there may be thousands of neurons5.

Figure 11.3 Content Addressability

 

Summary

 

The back propagation algorithm is usually applied to neural networks which are multilayer feed forward and complete. The usual neural network contains three layers, input, hidden and output. The input layer distributes the inputs to all hidden units, hidden units process them and distribute them to output units. The output units process them to generate outputs. Errors are calculated and propagated back. The weights between input and hidden and hidden and output units are changed accordingly. This process continues till the weights are set for every input correctly. The learning process basically divides inputs into output clusters so much so that every input is correct classified to a geometric region identified as a typical output. Another type of neural network is known as Hopfield networks which are able to help content addressability that means setting a small fraction of the pattern generates complete pattern.

you can view video on Learning process in BPNN and Hopfield networks