14 Admissibility of A*, Agendas and AND-OR graphs
Bhushan Trivedi
Introduction
We have seen that A* is an algorithm for graph traversal and finding optimum path from a source to a destination using heuristics. A* decides the best path using two values for any intermediate node, cost from the root node and estimate to a goal node. As A* starts from an initial node and goes along to reach to destination, it also backtracks and updates the estimates and changes the path if the current path is no longer seems to be optimum. As h’ is merely an estimate of the actual cost h, we were not sure if the path chosen over our calculation based on h’ is really an optimal path. We would like to learn if it is possible to say something about the optimality of the solution that we get from A* . In this module we will learn the admissibility of A* which guarantees the optimality of the solution that we get from A*. We will also see the effect of both components over the calculation, i.e. the g and the h’. We will finally look at two important types of problems, one is based on agendas and another is on the AND-OR graphs.
Admissibility
The word admissibility describes the characteristic of an algorithm to find an optimal solution. If an algorithm is admissible, that means the solution it finds is always optimal1. Thus if we can somehow say that A* is admissible, we can conclude that the solution (the path in our previous module case), is optimal. That mean we can guarantee that there does not exists any path with lesser distance than what A* produces.
Now the question is, how can we find if A* is admissible or not and if it is based on some conditions; what those conditions are. Fortunately mathematicians have done some work in that area and concluded that if the heuristic that A* is using is itself admissible, the A* is admissible. The heuristic ∀ function xϵ (set is admissible of all nodes if) t h’(x) econdition<=ℎdescribed() in Equation 12.1 is always satisfied for all values of x.
———————— (EQ 12.1)
In other words, when h’ consistently underestimate the value of h(x), we can guarantee that the A* will eventually find an optimal solution. Sounds interesting! Let us try to understand.
This discussion is an extension of what we discussed at the end of the previous module. When we are underestimating the value, we may assume that the goal node is 3 units away but actually it is 4 or 5 or 6 units away. On the contrary when we overestimate, we might assume that the goal node is 15 units away and it is found at 10th. The above condition states that if there is no case of overestimation, we can guarantee the optimal solution. In case of underestimation, the goal node is further than your estimate. So when we update the value of a node and back propagate, we get a value which is better than we have originally estimated but still less. There is no chance that we have missed an optimal route. If we start with two options for example, 5 and 7 and we pick up 5 which is actually 9, as we go along, at some point of time it will become greater than 7, and we will backtrack an start exploring the path with estimate of 7. What if the path through 5 is actually 6? No problem, that is the optimum path and it is always going to be better than a path estimated to be of length 7 as it is always going to be equal or more than 7. What if the path which was estimated to be 7 is actually 11? As soon as it becomes higher than 9, we will come back and start exploring the earlier path and will again get the optimum answer.
What if the heuristic function is overestimating? Suppose we assume that the goal node is 4 nodes away from one child and 5 nodes away from another child. We pick up the one with 4 nodes away, and get answer in two steps. It is quite possible that the other child which seems 5 nodes away from the goal might just be one hop away and we miss that! The back propagation won’t help!
Look at figures 15.1 and 14.2 carefully. Let us take the case depicted in figure 14.1, h’ overestimating h. node A has two choices, B and C. When the heuristic function applied (we assume g as 1 for each arc), the path through C (of weight 7 + 1) seems longer than path through B (with weight 5 + 1) . Thus A chooses to explore B. The only child B has is D which claims to be 4 hops away. With g value 2, the total comes out to be 6 which is still better than C so it is expanded next. The only child E, has the heuristic value 3 which is added with g value of 3 which again is equal to 6 better than C and so it is expanded next, which yields G and a suboptimal path to goal is derived.
- 1 There are other requirements but not as important as we discuss here.
The case of h’ underestimating h is described in figure 14.2. A is expanded with C and B as two children like previous case. Though here, the values are underestimated and the actual goal node is further away than the estimate. B’s total (4 + 1 = 5) comes out to be lesser than C(5 + 1 = 6) thus like previous case, node B is chosen as successor and it continues further. As this is a case of underestimation, soon we find total increasing than the estimate. At some point of time, according to the example depicted in 14.2 when we explore E, we will learn it to be having 4 + 3 = 7 as the total. At this point of time, when the values are back propagated, path from A to C sounds better. When we progress further, as this is also a case of underestimating, we are not getting the estimate decreasing. As long as it is less than 7 we continue to explore that branch. When we reach K and find it to be 7, we might go back to H and explore it. Now we get the O explored and get the value 8, so we again explore K and get G. Thus, we still get the optimum result.
Thus we can conclude that A* is admissible if the condition described in EQ 14.1 holds.
So far so good; now we know that we need to have a heuristic function such that it will always underestimate the actual value. How do we get such a guaranteed heuristic function? Good question. For some heuristic functions, it is possible to clearly state if the heuristic function is really an admissible one. For example let us take one more problem called 8 puzzle problem and try getting one heuristic which works in that case.
The problem is described in the figure 14.3. There are 8 square shaped tiles in a 3*3 matrix like structure. The tiles can be moved to an adjacent empty place. Out of 9 possible places, one of them is empty and rest are occupied by tiles numbered from 1 to 8. The problem is to have one typical structure and we need to convert that structure into another structure by moving tiles one after another.
1 | 2 | 3 |
4 | 6 | 8 |
5 | 7 |
Initial State
1 | 2 | 3 |
4 | 5 | 6 |
7 | 8 |
Final State
Figure 14.3 the 8 puzzle problem
One such problem is described in 14.3. We have an initial structure depicted and we need to convert that to the finals state described in the same figure. Now we need to generate serious of moves to convert an initial state to a final state. We may describe this problem with elaborating the state space and so on but let us keep that in the exercise. If you have any trouble finding the heuristic or solution to this problem, a book by Nil Nilsson’s ”Principles of Artificial Intelligence” is a good resource.
Anyway, let us take one example of the heuristic; ‘the number of tiles which are not in the position they should be’. We know that in final node all tiles are at the place they should be, so for final node, the value of heuristic is 0. What about any other node? It is always greater than zero. We can take this as heuristic function and the value as the heuristic measure of the node. This heuristic function will bear some positive value and we would like to reduce it to zero by choosing moves whose heuristic function value is less than the parent.
Now let us check if the number of moves needed, can ever be less than this heuristic estimate? We can clearly see that it is always more than number of out of place tiles. For example let us take 4 tiles out of place like in above figure 14.3. If you carefully look at the problem you can clearly understand that number of moves needed is much more. We can only move one tile at a time. If everything is perfectly placed and we need to just move the out of place tile to the right place, we need at least one move. That means, we need at least that many moves as out of place tiles and usually more than them. Thus, this function is guaranteed to be admissible.
Unfortunately, we cannot say the same about all such functions. We cannot pick up any function and say that it is admissible. In fact that is not the only problem that we will be facing. There is one more problem.
Though above function is admissible, it is basically a local function (remember our discussion about local and global heuristic functions in module 7) which stuck in local minima sometimes. (It is minima here as the heuristic function is reducing and not increasing). Take the following case depicted in figure 14.4. The value beneath the 8 puzzle structure indicates out of place tiles; i.e. the value of heuristic function under consideration. You can easily see that the search is not progressing smoothly. In the beginning we have a plateau like situation where all outputs are similar barring one being worse (greater). The same situation continues until we find one node with a better (lesser) heuristic value. Unfortunately the joy is short-lived, only one child (barring the parent as from any node you can always go back to parent) who has a heuristic value higher than the parent and we are stuck in local minima. The Nilsson book mentioned earlier describes the problem in more depth and also provides one global heuristic function for the same which we do not discuss here. In fact the plain vanilla breadth first search is also admissible. It is of little use though.
The bottom line is, even if we get a function which is admissible, it might not have other desirable properties. Ok. What should we in this case? There is a corollary of the admissibility principal which can be of help. It is stated like this.
If the h’ rarely overestimates h by a small margin δ, the solution that we receive can be different from optimal by the maximum amount δ in a rare case.
This can be of some help. So we can choose other heuristic functions which are not admissible but near to it can also be used based on their other characteristics. Anyway, we will not discuss the issue further. Let us try to see the effect of the elements of search. Our only advice to the designer is that they may go and find an admissible function but also need to worry about other, more important characteristics. Researchers in many areas have done lot of work in finding right heuristic for the problems they are working on.
What if we change the value of g or h’? How the behaviour of the algorithm changes? Let us try to explore.
The effect of g
The introduction of g has done something which we did not mention earlier. It adds one more dimension in the process of searching for a solution. Our choice of expansion of the next node in earlier cases was solely based on their heuristic value. The introduction to g indicates that we are interested in getting an optimal path, not from the current node, but right from the root node. Thus we are considering two things, one, the actual cost of travelling to the current node (that is g) and estimate cost of travelling to the goal node from the current node (h’). If we are not interested in path and only prefer getting a solution in a fastest manner, the g value is kept as 0. Now we always explore node which is nearest to the goal node from the current node. For our graph search problem we are interested in getting an optimal path and thus A* with g as the distance of the arc represents an ideal solution. Suppose we want a single goal node with specific characteristic. Here we do not sequence of nodes leading to the goal than using g other than zero might not be the best solution. For example in networking there is a case where any one of the server should respond to a client’s request. The client’s packet begins from the client’s machine and stops when reach to any server that satisfy the requirements of the query and responds back2. This process is known as anycasting. In this process there is no point of finding out the complete path. The value of g can be kept as 0. Another example is a search for the destination while driving. An automated car driving to the destination cannot move from one path to another, it has to pick up the best path from wherever it is. Keeping g nonzero does not serve any purpose there. On the contrary, when we have not started driving and ask the system the best path to the destination, the system can use real g value, backtrack if need be, find the optimal path and present that to us so we can start travelling on the optimal path. The second type of problem solving is called planning and is quite useful in many AI problems. We will look at planning in module 15 and 16.
We have kept g as constant 1 in the beginning. It is also a notion in the network routing process. If A* is to be used in a similar problem, keeping g value 1 is a good solution. It is better if we use g as a real value representing the cost of the arc sometimes. For example if an automated taxi has to find an optimal path to the destination, it cannot just consider every arc to be of cost 1. It should take into consideration many things including the distance the arc covers, the density of the traffic at the time of travel, one-way lanes, width of roads, the road conditions, the weather conditions and information about water logging in monsoon, any information about roads under construction or repair can all count in calculating the cost of each arc and thus the g value calculation itself becomes a challenge. Also, if the next best node may be much further from current best node, than it is hard for the automated taxi to move there in real time when the heuristic update arrives. A taxi can use A* for planning the tour, where it is feasible to backtrack, but not while running. It has to find the best path from the place it is currently ignoring g value while driving.
- 2 For example the client wants to learn about time and trying to contact time server for the same. It is not important how it reaches to the time server, it is only important to be able to reach there and get back with current time value.
The effect of h’
The value h’ is an estimate of the node to the goal. If the estimate is perfect, the best node we choose is actually the best and thus we never need to look back and reach the goal node in a straight forward manner. What if the h’ is 0? We are choosing the node solely on g, the cost of reaching to the current node. If the g value is kept constant as 1, this is nothing but breadth first search. What is both g and h’ are zero? We have nothing to guide us. We will choose any node at random and test it for a solution. The search appears to have a random walk. This is nothing but generates and test. In all other cases, the amount of exactness with which h’ estimates h determines the quality of search.
Another point, do we have anything to say about the data structures to be used for these nodes? Currently Java and C++ are used to implement A* and object oriented representations are taking precedence over other methods.
Agenda Driven Search
The A* is quite useful in real world applications but a class of applications. In all search algorithms that we have studied so far multiple arcs terminating into a single node does not make any difference. For example if we have 8-5-3 gallon milk jug state (6,2,0) derives from two states, (5,2,1) and (6,0,2). Does it make any difference as compared to a state which is derived from only one or three different states? No; absolutely not.
In typical class of applications this makes a difference. Let us take an example of some fault finding system. Let us take a case of automated electrician. The problem is that bulb is not turning on. This is reported to the automated electrician and so it is looking for cause of this problem the electrician (the program) may start with bulb not turning on as a root node and all possible reasons as the children; assuming all nodes are designed that way. For each child, we also write possible causes as the children. At one point of time, we have a child indicating that the “current is absent” from parent “bulb not turning on”. Interestingly, another node “current is not present at intermediary junction” also has a child “current is absent”. Does it make a difference? It does. The likelihood of current being absent is much higher now. The possibility of current absent when current does not arrive at socket is much more than when it is also observed that current is also not present at intermediary junction. Figure 14.4 depicts the idea.
Let us take another example. We are translating a statement and interested in finding out the meaning of a typical word called ‘bank’. We are confused as the word indicates a financial institute or the river bank. If there is something in the statement, a parent node, indicates that it is river bank and another part of the statement also indicates that it is a river bank, it is more likely to be a river bank rather than a financial institute. Here also, multiple parents to a child make a difference.
Another interesting example is from a medical domain (which is also a fault fining domain). A patient comes with a typical set of symptoms which is the root node for the search of the disease. Again, each child indicates one of the potential causes of the parent. Suppose one such node is “weakness” which has a child called “malaria”. Another node is “fever at alternate days” also has a child called “malaria”. Does it make a difference? It does. Figure 14.6 shows the case. The heuristic value of Malaria is higher when there are two parents indicate that. It is more important for the doctor to check for Malaria than viral infection or typhoid.
Now you can easily understand that for some applications it important to learn how many parents are referring to a child. More number of them indicates higher possibility of the child to be true (or have larger heuristic value) . You can clearly see that the A* algorithm not designed to consider this important point. Thus, for a typical set of applications described above, A* cannot be used. One must find an alternative. One of the alternatives is agenda driven search.
The agenda driven search picks up top most items on the list known as agenda. Each item from the agenda has multiple items as children. For example a node ‘Diagnose the patient’ turns into ‘look for the temperature’, ‘is there a weakness’, ‘where ache is’ and so on. Each node has a heuristic value or priority. The items are stored priority wise, as doctor check for most probable disease before others, check for more important symptoms than other and so on. He might check temperature before, look for weakness etc later and may be finally asking for blood test if he is testing for malaria. Each item may be solved, may be divided into multiple other items, some of them are solved and the rest may be still unsolved. When the solution is received through one child or multiple children the process stops.
There are many systems work in similar fashion. One can pick up a task of organizing a conference as the first item on agenda; it is divided into venue management, food arrangement, participation management etc. Each of them also inherits into other nodes like inviting papers, reviewing them and so on from a task called research paper management. Participation management might involve deciding participation changes, facilities to be given to participants, and so on. There are nodes which has common parent like Deciding Hall capacity is based on amount of registration so it is one of the children for registration management, this also is part of venue management. Above all, we also have to set priorities for jobs and does the highest priority job first. For example booking the hall on the dates of the conference is highest priority and should be on top of the agenda. Agenda driven systems also need to have how the task is to be divided into multiple sub tasks and how each sub task is further divided and so on. Conventional A* won’t provide all these facility, other solutions must be sought.
It is also important to see the priority of the task changes like A* estimates, the important difference is that multiple references add different weights to the priority and generally a weighted summation considering all parents is used to decide the priority of the task.
Such problems require systems with the ability of considering the value of multiple parents for a given child. In fact not all arcs have same weight. For example “fever” has a child called “malaria” and “blood report positive” also has a child called “malaria” but the second arc is a far stronger one and weight of such arc is much more.
Let us take up another case “weak battery” and “lights do not turn on” are parents to “battery needs recharge”. But weak battery and lights do not turn on are symptoms usually seen together; having two parents in this case does not add much of a value. Thus amount of value addition by second (and may be third and so on) is also to be specified in the design. Obviously the problem requires many additional issues to be handled.
The AND-OR graphs
One important condition that we have stated as a prerequisite to applying A* algorithm is to have a solution defined by a single node. We have also stated that we will elaborate that point further. A* is an algorithm to be applied when the graphs that we are discussing are OR graphs. That means we can pick up any path to solve problem. The alternate paths emanating from a single node are OR paths, the solution can be sought using ANY ONE of them.
Sometimes this condition is not true. The solution of a parent node requires more than one child to be explored. Let us take an example. Assume that a parent node is a cause and children are reasons of that node. This example is from a domain of Intrusion detection. It is possible to have a denial of service attack by either having a network protocol problem or database corruption problem. The database corruption problem can only occur if both primary and secondary servers are corrupted. The problem is that denial of service is already observed, a graph containing reasons of denial of service and reasons for those reasons and so on is a huge graph. The A* is applied to find the least cost or most probable reason for the problem. We start from A and we have two causes B and C. Now when we explore C, we are in a typical situation. If B and C both are true, only in that case C is true. So to check if C is the cause of A, we must explore both D and E and not either of them. This is in contradiction with the parent level. To check the cause of A, either B or C is fine.
The nodes, D and E form an AND arc unlike B and C which form an OR arc. This is different type of a graph as it also contains the AND arc. Earlier discussion that we had during A* only contained OR arcs and are denoted as OR graphs. The graph depicted in 14.5 is different. These types of graph are known as AND-OR graphs. The A* algorithm cannot work here as the OR arcs are fine with A*, we can choose the most optimal path and explore. The AND-OR graphs needs special attention. For example in above case the best path from A to B can be calculated by g + h’ of B only while the other path must include g + h’ of both children and cumulative answer must be sought for comparison. As A* is not designed to handle AND arcs, a different algorithm must be designed to handle them. The algorithm called Problem Reduction is preferred in such cases. It is designed and used in cases like we described above. The problem that is mentioned in 14.7 from Intrusion Detection domain is solved by many researchers in their own way which is though very interesting, we should not be exploring them further here.
A* anyway can help solve the OR graph based problems in a very effective way. The algorithms which are devised to solve problems based on AND-OR graphs or agendas are using the idea from A*. For example consider above example, one can assume B and C as two children but cumulative g and h’ values of both branches and treat them as a single arc and single child and apply A* . The problem may be a case where two AND arcs have a single child so it is harder to assess the cost.
AND-OR graphs and problem reduction are important algorithms for the problems that need them.
Summary
When the h’ overestimates h, it is not possible for the designer to guarantee optimal solution but when h’ consistently underestimate h, it is so. We call that property admissibility. The admissibility of the heuristic function determines if the algorithm uses them is admissible or not. Even when the function is admissible, it is important to have other characteristics so the search process does not hold up in local maxima.
Keeping value of g as 0 searches in the fastest possible manner but ignoring the optimal path. Keeping g as 1 indicates that every arc is of the same weight. Keeping g as non-zero may be needed in a case like finding out shortest distance between two nodes. When h’ perfectly matches with h, we do not need search. When we keep h’ as zero, we follow breadth first search.
When multiple parents to a child make a difference to a child’s heuristic value, we need to use agenda driven search. When multiple children together needs to be solved to solve the parent, we need search tree represented using AND-OR graphs and search process using Problem Reduction.
you can view video on Admissibility of A*, Agendas and AND-OR graphs |