10 Using backpropagation for multilayer networks

Bhushan Trivedi

Multi-layer feed forward networks and learning

The backpropagation algorithm is quite useful in solving many real world problems including finding out right set of motivational strategies for employees1, shortlisting resumes from a given bulk, fingerprint recognition, voice recognition, face detection and so on. The back propagation algorithm is applied over multilayer, feed forward and complete neural networks usually. Most problems are designed and solved using those networks. We will look at how these layers are organized and back propagation algorithm is applied in this module. Back propagation algorithm is basically a classification learning algorithm. When the user knows how things are classified without applying any formal method, back propagation is an attractive option. For example we can classify faces as who is who but we do not know the formal method for the same. We can classify signatures but again we do not know how our mind does that. These are excellent examples of where can one use back propagation algorithm.

Many variants of the basic algorithm are used in practice with different types of networks butwe will confine our discussionto multilayer networks and the conventional backpropagation algorithm in this module. This discussion will be sufficient to initiate the learner in the process of using back propagation algorithm. Most real world problems can be solved with this combination. One can modify the network and algorithm for a typical case. The reader can proceed further using that knowledge.

The network used here is called multi-layer as it contains more than one layers. It is feed forward as the input activations are flowing in forward direction. Though activations are propagated forward, the errors are reported (flown) back to adjust weights. That is why the algorithm is called backpropagation;it propagates errors back. The network is also called complete as every input unit is connected to every hidden unit and every hidden unit in turn is connected to every output unit. In the sample figure 9.1 we only have 3-3-2 architecture (3 input, 3 hidden and 2 output) but the number of all the layers and number of neurons in each layer depends on what we are planning to learn.

Figure 9.1 The multi-layer feed forward network

One student of the author of this module has done his Ph D on this topic

A multi-layer feed forward network is shown in figure 9.1. The first layer is known as input layer and its job is to accept and distribute inputs in a way that every neuron in the hidden layer receives a copy of each of the input. Thus every unit of the hidden layer receives the information from all input neurons. The same thing is also true for hidden and output layer communication. The hidden layer comes next (though most networks have one hidden layer, some typical cases might have multiple hidden layers). The job of hidden layer to get features out of inputs. We will soon elaborate that. The layer is called ‘hidden’ as it is not seen from either side. The inputs interact with input layer while the output layer generates the output but hidden layer does not interact either with input or output and hence the name. The output layers job is to recognize the output. Let us take one example to understand how number of neurons at each layer are counted.

Assume a character recognition program is running. The input is a character which is in form of a 9*9 matrix as shown in figures 9.2, 9.3 and 9.4. This is a very crude representation but good for our discussion. One can take an image instead with each pixel as one unit and taking may be 500 * 500 pixels for making it far better. Even in that case our discussion does not change much.

The input layer accepts 9*9 matrix with values 0 for a blank square while 1 for filled square. Thus the input to the network for figure 9.2 would be following. (For better viewing, the ones are boldfaced)

000010000

000101000

001000100

001111100

010000010

Thus total 81 (9*9) binary input units are needed at input layer; one for each binary value in the input. Some of them are 1 while all others are zero. How many output units are needed? Suppose if we wanted to recognize all uppercase letters only. That makes it 26. If we want digits also, it makes it 36. We also want upper and lowercase with digits, the total different characters we would like to recognize mounts to 62 (26+26+10). For first case we need a 0..32 as output while the last two require 0..64 as output. For having 0..32 output we need 5 binary neurons while for 0..64 we need 6 binary neurons2.

What would be the number of hidden units? Researchers found that the geometric mean of number of input units and output units is a good measure. That means number of hidden units = √ ∗ where nI is number of input units and nO is number of output units. In our case the number of hidden units = √81 ∗ 6 = √486 = 22 (after rounding off). One may have a query, why we have to have hidden units? What if they are not present? In fact the first generation of neural networks haven’t had hidden layers, they have only input and output layers. They were popularly known as single layer perceptrons. The single layer perceptron could solve quite a large number of problems. It has an excellent, fool-proof method known to make it learn anything it is capable of. Unfortunately the single layer perceptrons were found to be incapable of solving a typical class of problems called non-linearly separable problems which includes simple problems likeXOR. Including hidden layer removes that hindrance and makes them capable to learn any problem that they can be trained for; including non-linearly separable problems. Researcher had also found that a network with one hidden unit can learn whatever a network with multiple hidden unit can and so one does not really need to have multiple hidden layers.

The algorithms to train multilayer perceptrons are not fool-proof though, in the sense that we cannot guarantee that network will eventually learn anything it is capable to. Practically though, in most cases it is able to. Even when it is not able to learn, the only trouble the programmer has to take it to run the learning programonce again3.

2 As total combinations represented by five neurons is 25which is 32. Same as the case with 64.

3 The search space of multi layer perceptron is highly convoluted, full of local maxima, thus the network might stop due to that reason. When the program starts all over again, it starts with another set of random values and effectively starting from a new place in the solution space and more likely to reach to a solution.

Hidden layers help the network learn about features of the inputs. For example in the case of our character recognition problem of character A, the hidden layer learns about the features of the character input. You can easily see that different samples of each character has different input cells turned on and if the decision is made based on which units are active, it would be incorrect 4.

Here, the hidden layer comes to the rescue. In case of A, one hidden layer might learn to remain active when the input produces a slanting line rising from left to right. Another hidden unit may learn to remain active when a straight horizontal line is found. One more hidden unit learn to remain active when there is a slanting line coming down from left to right. When all these three hidden units are active the output unit combination which represents the character A which learn to be active. A may be represented by 000001 as an output, that means whenever these three hidden units are on (that means there are three lines, two slanting in different directions and one horizontal), first five output neurons learns to have 0 while the last output learn to have 1.

Interestingly the hidden layer which has learned to remain active while the horizontal line is present also remains active when other characters with horizontal line is presented for example T or H. Figures 10.5,106 and 10.7 showcases a few inputs to the network and what hidden layers do with them. All three hidden layers, irrespective of which pixels are illuminated, learn to remain active when the slanting lines or the horizontal line becomes active. All of these three units are on, A is on. Unit-2 of the hidden layer is also on when characters like I and T are presented, if some other hidden unit learned to remain active when the vertical line is also present, that unit with conjunction with unit 2 can be used to recognize these two characters. One can easily understand that having a hidden layer helps in generalization as well. The typical combination of output unit becomes active when three hidden units becomes active (the case of A), thus as long as there are two slanting lines in opposite direction, and a horizontal line can be deduced from figure, the neural network has learned it to be A. That means any figure, which is not seen before by the network can also be recognized correctly. Other features like size and the color does not really matter here as hidden layers tend to ignore features which does not help in recognition. Similarly some features like two horizontal lines for I may or may not be present in the input. If the network is given sufficient samples (some with and some without those lines), the network can also learn that.

Let us also spend some time learning about the difference between one and multiple hidden layers. When the features of the inputs are quite difficult and better to be represented as a combination of other features, multiple hidden layers might be better. For example if we want to recognize a feature which is a combination of other features (like a typical line, a typical circle and so on), the additional hidden layer helps the designer to learn that feature.

Prerequisites to the Backpropagation algorithm

Let us learn about some prerequisites before we proceed further. We assume a three layer neural network as shown in figure 10.5. We will look at how number of layers are chosen, how number of neurons chosen at each layer, how activations are sent forward and errors are propagated back. We will see how the weights are updated to make sure network slowly converges to the solution in the next module.

Choosing number of nodes at each layer

Let us consider figure 10.8 for discussion. Though there are three layers of activations, input, hidden and output, the weights are divided into only two layers, one between input layer& hidden layer and another between hidden layer & output layer.w1ij are weights between input and hidden layers.w2ij are weights between hidden and output layers. The inputs are from x1 toxn, hidden units are from h1 to hm, output units are from o1 to ol. We assume that the input is of n units, output is of l units and the hidden is on m total units. Accordingly we havearrows carryingweights in the first layer are stemming from every input unit and terminate into every hidden unit. The total weights form a matrix of size n * m. Similarlyweights in the second layer are from every hidden layer to every output layer. Thus forms a matrix of m * l. let us make one more thing clear, x0 and h0, though look like inputs, they aren’t. They are set to fixed value 1 to avoid checking for a specific threshold. Also, when w10i and w20j weights are trained with others, separate process for learning correct threshold is not required Thus we have three layers of activation input, output and hidden and two layers ofweights; w1ij and w2ij.

The input is provided to the input layer. The size of input decides the number of nodes at the input layer. For example if the image of a face is provided to the input, assuming a maximum size of 2 kb, total 2000 input units are provided each bit of the file is input to one unit of the input layer. Thus the value of n will be 2000. The output units are counted based on the classification. For example we have 50 people to classify from, we need 6 output units (as 26 is 64, thus with 6 binary output units, we can get 64 different combinations). Thus the l value is 6. The number of hidden units, the value m, is the geometric mean of both values =√2000 ∗ 6 = √12000 = 110 (approx.); so the m= 110. Theweight matrices are 2000 * 110 (input to hidden) and 110 * 6 (hidden to output) respectively. Now let us see how hi (where i is between 1 to m) and oj (where j is between 1 and l) are calculated. As we have stated that back propagation algorithm requires a continuous function, we will go for sigmoid function. For example h1 value (which indicates activation at the hidden unit h1) is calculated as per sigmoid function applied to the summation of inputs at h1. Thus the summation is

Summation at h1 = x0*w101 + x1* w111 + x2 * w121 + …. +xm * w1m1

(The activations and the weights contribute to calculation of summation at h1 is shown in boldface in figure 10.5)

Similarly each summation at hi is calculated as follows.

What do these two equations indicate? They indicate something we have already seen. Consider each hidden and output unit as a single one, look at inputs coming in, look at cumulative summation of activations into theweights and you can see that the equations indicates the same thing we looked at in the previous module. The only difference is that we have more number of neurons. For a single neuron, 5Two subscripts used (i in 10.1 and j in 10.2 are dummy subscripts and can be replaced by any without changing any meaning.for example h1, the thicker lines show the connections and bold weight values indicate their weights.Weights of some other lines are also shown. What these two equations to do with learning in back propagation? How the network actually learns? We will see that in the next module.

you can view video on Using backpropagation for multilayer networks

References

Neural Networks algorithms, applications and programming techniques by James A freeman, David M Shapura, Pearson education.