6 Other Search methods

Bhushan Trivedi

Introduction

There are a few other search methods which we are going to explore in this module. These methods are gaining more popularity in current age. We will soon see how they are different than conventional heuristic search methods and why they are gaining the acceptance from the industry. As these methods are extensions of earlier methods, we will only discuss them in brief.

Variable neighborhood decent

One important attribute of a search algorithm is to find out the neighbors of the node to progress further. The function which finds out neighborhood nodes is called a neighborhood function. In trivial cases, the neighbors are found by applying all possible rules but in a complex case many rules are applicable sometimes. Also when we are searching the space using perturbation search, we may generate many neighbors depending on the criteria that we choose. The problems that we have explored so far has very few alternatives and we had no problem exploring all of them. That, unfortunately, is not true for most real world problems. Considering the branching factor, it is always a good option to only explore more promising options and probe dipper rather than explore all nodes and stop at shallow level.

In games like chess, a function which generates plausible moves is often used. Such a function is required because the moves from any given state may be very large but most of the move do not make much of a sense. When a human player plays chess, he usually considers only a few moves and ignores most others. This function (usually written as MoveGen()) generates the plausible neighbors. That means it generates only those moves which are better than the rest. The hill climbing process where the heuristic function returns better value as lower than previous one is known as valley descending. The process is known as descend. When one chooses the next move in that process, it is known as neighborhood descent. One interesting option is to decide different functions for generating neighbors for different stages of the search process. Sometimes the same function with different set of parameters might be chosen. In a way, we achieve different neighbors based on either different functions or different parameters passed to same function. Such function(s) are used for variable neighborhood descent as it can generate less such moves in the beginning and more of them at later stages or when the game enters into critical stage. For example in the initial run it might take top 5 moves but at later stages it might select top 10 moves. When multiple functions are used, the function which generates small number of neighbors is known as sparse neighborhood function while the function which generates large number of neighbors is known as dense neighborhood function.

In the figure 6.1, the situation after the Initial MoveGen() function call is shown. The function does only show two out of possible 20 moves (moving 8 pawns either using one block or two block forward and two moves for each knight. The advantage of using only two moves in such situation is to avoid unnecessary calculation of moves, most of which are worthless or has similar effect as the chosen ones. In fact it does not only prune calculation of 18 moves but the entire subtrees with those moves as roots. For example if the function can proceed for 3 plies (levels) given the time, and at each ply the average number of nodes to be expanded are say 20, 18*20*20 =7200 nodes are saved from expansion. If the program goes one more ply deep the saving becomes 144000 nodes. Two more plies and saving goes to whopping 57600000 nodes. That is why sparse neighborhood function is used in such situation.

In case when the king is under attack, or it is likely to be a heavy loss (losing queen or rook without any reward for example) one ma think of using a dense neighborhood function. In such situation, it is quite possible that the MoveGen() function might generate more moves such typical cases, especially when the king is under attack. Such a process is variable neighborhood descent. One can easily see that this process also mimics human behavior, when a player realizes that the game has entered into a critical stage, he become more attentive and plays more cautiously. He might explore more options than usual, taking more time and also looking at more moves or depths of moves.

Consider the case of missionary cannibal problem. Also consider a solution space with random sequences of moves. Suppose we pick up a random sequence which is not a solution. Now we may design a neighborhood function which generates a new state by replacing the incorrect move with another move. This may generate a few alternatives. Another neighborhood function may generate anentirely new subsequence from the erring move. We may append this subsequence at the erring move and thus generate a new sequence for each such subsequence. This might generate substantial number of alternatives. Clearly, the first alternative generates much fewer alternatives than the second one and thus known as sparse neighborhood function. The second one, obviously, called the dense neighborhood function.

It is easy to understand that if we use a sparse neighborhood function, we are not looking at some of the possible neighbors and thus ignoring them. That might lead us to a solution which is not optimum from the global point of view. (That means we end up at local maxima), Why? There may be a better nodes around but we are not connecting to them. For example if we look at all neighbors with only the erring state change, we may not find a state with better heuristic value and thus land in local maxima. In chess also, if we only consider top 5 moves, it may be possible that the 6th move, which does not look all that good right now, might be a far better one if we probe more levels (calls plies in game playing parlance). It might even prove to be the best solution! As we are not exploring that move, we are going to miss that best move.

If we use a dense neighborhood function, we need to process each neighbor more rigorously and thus we need more time. If we use a sparse function, we may end up at local maxima. The better solution is to use different neighborhood function at different times. The algorithm which does so is known as Variable Neighborhood Descent. The idea of variable neighborhood descent is about choosing neighborhood function based on the need. Usually, the initial part involves sparse while the later part involves dense neighborhood function or sometimes when the game has entered a critical stage.

Thus the variable neighborhood descent is a variant of Hill Climbing where the heuristic function is different at different stages. It uses the least value rather than highest value so it is actually finding out state with minimum and not maximum value. Instead of climbing, it is descending and it looks at neighbors in a variable way so called variable neighborhood. Hence the name.

Beam search

So far, we explored single path to the solution in all methods that we have discussed so far. That means, at any given point of time, we are exploring the best node. It is a good idea to explore more than one paths at the same point of time. That means exploring best and second best nodes together or top 5 nodes together. Considering current state of the art computing devices having multiple cores, this can be quite an attractive option. For example if we explore two paths at a time, we can execute them in parallel on a dual core machine or four paths at a time for a quad core machine. Whenever one path yields a solution the search stops.

When we explore multiple paths at the same time, we are exploring a beam of multiple paths. That is why this search is known as a beam search.

Beam search is better when backtracking is not possible. In case of two node beam search, whenever one path leads to dead end, the other path is available and we are not stuck up. We may pick up two most promising children from the other path and can continue as two node beam search once again. If we have a three node beam search, even when two paths led to dead end, the third path can still save the day. Thus bigger the beam, more tolerant the search process is. Figure 6.2 describes the two node beam search for 8,4,3 gallon milk jug problem. At each level, two most promising nodes based on their heuristic values are expanded further at second level, we only have two children, so both of them are expanded. For each child at second level, the best of their descendants are explored continuously till the end finds out two solutions. In some more complicated case, one of the path may lead to a dead end. The heuristic values are actually not shown in the diagram but the best node is depicted as a shaded node.

There are some real world situations which demand such processing. For example consider the speech to a multi-national audience is being translated to other languages. The speaker is speaking his native language (for example Hindi), which is translated by an automatic translation program to another language (for example French). This program continuously listens to what the speaker speaks and immediately translates the same for the benefit of audience who does not understand that language.

The translator program waits till the speaker finishes his sentence. As soon as the speaker completes the sentence, it starts playing the translation. While listening to the presenter’s speech, it continue to map the new words into its understanding and generate a feasible translation. It is quite possible that when the speaker commences the new statement with a word, there may be a few ways the translation can happen. At each subsequent words spoken, those initial interpretation might be reinforced, weakened or become altogether meaningless.

What if the translator program picks up the first interpretation and go ahead? The program has no option of going back if the first interpretation is incorrect. It is better to start with all possible translating options and pick up the best one when the translated message is to be relayed. For example consider when the translator starts translating, the translator program can think of two possible translations. The translator program continues to generate output text based on both possibilities. After a few words, a speaker utters something which invalidates one of the option. This option now does not remain valid and should not be considered further for translation. If we would have started with that assumption only, we are stuck and cannot proceed. If we use a beam search, the other alternative is still open and we can proceed further. More the options, better the chances of avoiding the deadlock.

Thus beam search is an attractive option for many such cases.

Tabu search

Tabu search is a solution to local maxima problem with hill climbing. The search happens exactly the same way as hill climbing until the local maxima is reached. When the local maxima is reached, the search strategy is changed and the surrounding area is explored further without using the heuristic function.

Once stuck at local maxima, the Tabu search deploys some other criteria for assessment. Let us take the same missionary cannibal problem. Let us take a case where we are producing complete sequences, testing them for solutions, and if not the solution, modify the sequence in a way that we get near to the solution. Suppose if stuck at point 3,3,0,0,L….1,1,2,2,R 2211L …..0033R. All the moves now onwards have a lower value of heuristic function. We are at local maxima and if only use heuristic function we cannot proceed further. For Tabu search, we may decide to have some other criteria than the heuristic function to go ahead.

One alternative criteria is to choose to replace a move in the sequence which is not changed for long. That means the next move contains the value in the sequence as one which is not changed for long. For example the value 2211L is not changed for a while so we change it now. Another strategy include some recent moves and avoid using them again. For example, we may try to remember only last three changes and have a strategy that those which are changed during last three moves are not changed again. Once the local maxima is reached, the typical move in the sequence is only changed and no other move is allowed.

One more interesting idea which is often used by researchers and problem solvers is to look for states which are rare. If you have closely looked at the water and milk jug problems, you probably have seen that the states which has some amount of fluid which is not equal to carrying capacity of the vessel, for example 1 gallon in a 3 gallon jug or 2 gallon in 3 gallon jugs are better states to reach to the final state. Such moves, looking at the all possible states, are rare.

Simulated annealing

This is another extension to hill climbing. This process is an excellent example of innovation in one domain taking inspiration of another domain. The simulated annealing is a search process used to solve AI problems is based on physical annealing process carried out in metallurgy.

The local maxima problem in hill climbing is addressed using many methods. One of them is to allow some randomness in the movement. The strategy is to allow a ‘bad’ move with some probability. If the better moves are allowed with the same probability as bad moves, the process selects the move almost in a random way. That process is sometimes known as a random walk. If the probability of accepting bad moves is zero, the process becomes hill climbing. Simulated annealing works between these two extremes. The idea is to allow random moves with large probability in the beginning and almost disallow them at final stages. In the middle stages bad moves are allowed with decreasing probability towards the end. The process, by virtue of accepting worse moves, help escape from local minima where all subsequent moves are worse than the current. As such probability decreases, the bad moves are selected with lesser and lesser probability. As long as the initial part is not over, randomly allow moves which leads to lesser heuristic value than the parent, to escape from local maxima. When come nearer to the solution the bad moves are less and less allowed.

This process is quite analogous to moving a small ball down the valley when there are many small pits along the path. If we allow the ball to roll on its own, it might settle in one of the pits and will not reach the valley. What is the solution? Provide enough momentum to the ball that it jumps over small pits and reduce the momentum when it is near to the bottom so it cannot escape from the bottom and do not settle somewhere up. It might still do so but if we have remembered the lowest point we reached so far, we can go back there. The simulated annealing works exactly like that.

As per our discussion earlier, simulated annealing is based on two important principles.

A bad move with some probability is allowed. The probability changes over a period of time, reduces when reaches near the goal state.
Best so far node is also kept so at the end if the final state is not the best, we can roll back to that state.

Simulated annealing, as mentioned before, is drawn from a source quite unrelated to computer science, leave alone AI. In metallurgy when new alloys are made, they are melted at a high temperature and allowed to cool down in a controlled manner to maximize crystal size of the solid state metal formed at the end of the process. In fact the final state is the minimum energy state. The idea is to get the solid state with lowest possible energy value as that is the most robust form.

The interesting part of the process is the controlling the cooling down process. If the temperature is allowed to reduce fast, the resultant form is a high energy state which is brittle. If the temperature is allowed to reduce very slowly, it wastes lot of time. In fact, various schedules are tested physically to come back with the most optimum schedule, which reduce temperature as fast as possible without compromising the quality of the resulting metal. Usually there is a threshold value, till when the rate of reduction does not make any difference. Once the threshold value is reached, the temperature must be reduced with caution. The rate at which the temperature is reduced is known as annealing schedule. This schedule is found empirically as there is no formula to govern the rate of reduction. Obviously, this schedule depends on type of metal.

The Simulated Annealing process uses the same terminology as the physical annealing process. The temperature is the value which decides the amount of randomness is allowing bad moves (moves to higher energy states from lower energy). The process here uses an objective function instead of a heuristic function. The only technical difference is the lesser value is better here. Thus this process is valley descending rather than hill climbing.

The process may be described as follows

Pick up the start state and make it current.
If it is a goal state quit. Otherwise BestSoFar = current state.
Pick up the temperature value T from the annealing schedule
Apply next rule (called operator in Simulated Annealing parlance). If no operator1(rule) left, return with BestSoFar state.
Apply the objective function to the new state and find out ObjFunVal(NewState)
If this value is better than the objective function value of the current state ObjFunVal (CurrentState) make this a current state as well as BestSoFar = NewState
If the ObjFunVal(NewState) is not better than ObjFunVal(CurrentState)
1. Generate a random number R between 0 and 1,
2. ∆E = ObjFunVal(NewState) – ObjFunVal(CurrentState)
3. Calculate the value p = 1+e-∆E/T
4. If R is less than p, make the new sate a current state. (This process allows a random move to a poor state by the probability p).
Go back to 3

Few terms in the discussion above need explanation. First is the Objective Function. The Objective Function is same as the heuristic function and is decided by the designer based on the domain knowledge. The ObjFunVal is a function call to this objective function which yields the value of that function for the state passed to it. T is still called the temperature but it is (obviously) not the actual temperature value. This value is decided by the designer based on his own intuition and judgment. This value, like T values in physical annealing, are decided empirically. That means using different annealing schedules and picking up the one which returns best answers. ∆E in physically annealing is decrement in temperature which is converted to decrement in the value of objective function here. This value ∆E has some say in determining the probability of moving to a worse state. When we want to allow a worse move with some probability p, one good way of achieving it by generating a random number between 0 and 1. Assuming all values between 0 and 1 are equally likely, the probability of a value generated being between 0 and p is exactly p. We use that property in our testing. If the generated random number is less than p than accept otherwise reject policy allows p moves out of 100 to be allowed and thus do exactly what we expect it to.

The simulated annealing process described is a superset of hill climbing. If we set the p value as 0, no move to a worse state is allowed and the process becomes pure hill climbing.

We haven’t said much about annealing schedule so far. The annealing schedule determines the amount of worse moves to be allowed for a given time. The temperature value that is used in determining the probability p is stored in a table with three components. First component is the initial value of the temperature, second is when to reduce it and third how much to reduce. The best way to decide the annealing schedule is to try a few options, look at the quality of the solution and also the speed at which the solution is found and use the best annealing schedule for the actual work.

1 Those who use word objective function for heuristic function, are more prone to use word operator to describe what their counterparts are calling rules.

you can view video on Other Search methods