Optimal search is easy
Last time we looked at how to solve the eight puzzle using the hill climbing algorithm, which gave us a result much more quickly than a blind depth-first search did, but we wondered if the solution we found was the best we could do, and we asked if there was a way to use heuristics to find not just a solution, but the best solution. Today, we’ll see that there is, and it’s actually really straightforward.
We’ll start from the greedy search algorithm, which you may remember is just the hill climbing from last time, except that instead of sorting just the new nodes according to a heuristic, we’re sorting the entire queue.
Since we’re looking for an overall shortest solution rather than a shortest solution starting from a single node, it might be worth changing our heuristic so that it doesn’t just look at the current state, but at the history as well. We’ll call this new function f, and take it to be the sum of the number of moves we’ve played to get to this state from the starting state (the accumulated cost of the path), plus the old heuristic, which in our case was the Manhattan distance:
> f = lambda path, state: len(path) + h(state)
As an aside, notice that we’re just doing len(path) here because each move has the same cost, and our heuristic function is just an estimate of the number of moves we still have to take. It doesn’t have to be like that. The algorithm is also applicable to, say, a problem where we’re trying to travel from Leuven to Amsterdam by passing through a bunch of villages, where the length of the path won’t be the number of villages we’ve passed through, but the sum of the distances between those villages. The h function could then be the straight-line distance from our current village to Amsterdam, for instance. That would work just as well.

That’s an example we won’t be working out, because I don’t feel like looking up enough places between here and Amsterdam to make that interesting. The algorithm we’re building is trivially extensible to a general graph-searching algorithm, is my point.
The sorting step in our algorithm is now this:
> queue = sorted(new_nodes + queue[1:], lambda a, b: f(*a) - fa(*b))
This is already an awesome improvement. For our sample problem, it finds an optimal solution in 13 iterations (considerably better than BFS’s 803 iterations (which also found an optimal solution, because it’s BFS with a uniform transition cost), but also quite a bit better than hill climbing’s 27; and it’s a solution in 12 moves, so the one hill climbing found last time is indeed suboptimal).
For the record, and because this post wouldn’t have any relevant images otherwise, the full tree examined is shown to the right (goal node marked).
In fact, for this particular problem, we’re done. This is all that’s required to make the search algorithm optimal, provided that we’re dealing with a problem where every transition has the same cost, which we are. We’d like to come up with a more general algorithm, though, which we could also apply to graphs with varying edge costs, and before we have that, there are a few more problems we need to deal with. One of those is the visited list.
You’ll remember that we’re keeping track of every state we’ve ever seen in a list. In the case of the eight puzzle, this isn’t likely to compromise the optimality of the algorithm because we’re dealing with unit cost transitions again, but for searching graphs with varying edge costs, it’s going to be a problem, because the second time you come across a given state, the path you found to that state may be shorter than the path you found the first time.0
There are several ways to solve this. In general, the sensible way to do it is to go through the queue, look at the end state of each entry, and see if that node occurs in the path of any other entry in the queue. If it does, see if the cost of that other entry is lower, and if it is, just drop the current entry from the queue entirely, because we aren’t going to find a shorter path. This won’t prune all redundant paths, but it will take care of a lot of them at a much lower cost than our current set-up.
That’s what I’d do if we were storing our paths are states instead of rule applications here too. Instead, we’re going for a simpler but far more memory-intensive solution: we’re still going to store all states, but we’re also going to add how many steps it took to get to that state. A good way to do that is probably as a hashtable/dictionary, but because Python lists aren’t hashable, we’ll convert them to strings first:
> visited = {str(start): 0}
And the check further on becomes:
> s_hash = str(n_path) > if not s_hash in visited or visited[s_hash] > len(n_path): > new_nodes.append((n_path, n_state)) > visited[s_hash] = len(n_path)
We still don’t have an algorithm that’s necessarily optimal in all cases. Consider a case where we’ve found a solution, but the cost to get there is, say, 15, and we have another entry in the queue which doesn’t end in a solution yet, but has an f value of 12. As written, our algorithm would terminate and return the solution we’ve found, while there could very well be a shorter path through the other node!
The solution is not to check whether our node ends in a goal state while we’re expanding the nodes, but to check if the first node in the sorted queue does:
> for rule in rules: > if rule.can_apply(state): > n_state, n_path = rule.apply(state), path + [rule.num] > s_hash = str(n_state) > >if n_state == goal:>print "Solution found in %d iterations:" % steps, n_path>exit(0)> > if not s_hash in visited or visited[s_hash] > len(n_path): > new_nodes.append((n_path, n_state)) > visited[s_hash] = len(n_path) > > queue = sorted(new_nodes + queue[1:], lambda a, b: f(*a) - f(*b)) > > if queue[0][1] == goal: > print "Solution found in %d iterations:" % steps, queue[0][0] > exit(0)
And that’s that. We’ve now recreated the famous A* algorithm.1
There’s a lot of room for optimisation, of course (sorting the whole queue is pointless, for example, because you can extract the best node in a single pass, which is O(n) compared to the O(n*log(n)) of sorting), but the basic idea is there.
Two important considerations to A*:
- As with most graph search algorithms, there can’t be any negative edges in your graph. In our example, this obviously isn’t a problem, and in the real world it usually won’t be either.
- For various reasons, the h function has to be an underestimate for this algorithm to be optimal. If you want to play it really safe, you could just have h return 0 for everything, in which case you basically get Dijkstra’s algorithm. It’ll still find a solution for any kind of h, but it’s not guaranteed to be optimal, and it can, of course, take forever.
To appreciate the value of a good h function, by the way, you can indeed substitute the following for h:
> h = lambda _: 0
Running the algorithm again finds an optimal solution, but it takes 1828 iterations.
We can also try the other heuristic we mentioned last time, which just counts the tiles that are in the wrong position:
> def h_countwrong(state): > d = 0 > for n in range(1, 9): > if locate(state, n) != locate(goal, n): > d += 1 > return d
This finishes in a much more respectable 60. It can be proven that a better h function will always cause A* to perform at least as good as a worse one; that is, there’s no degenerate case where a worse h function can outperform a better one—provided h is never an overestimate.
I’m not going to get into the whole thing about monotonicity restrictions and pathmax and whatnot, because, to be honest, it isn’t very interesting. If you’ve followed everything so far, you’re already doing better than most people who use these algorithms for practical applications, apparently. Most people don’t bother with fancier search algorithms than A*, presumably for the same reason most people don’t bother with fancier sorting algorithms than quicksort.
If I’m going to continue writing these introductory AI things, I’m going to move away from search and do something involving machine learning, maybe. The KUL’s introductory AI class (where I get most of my information) only talks about version spaces, which isn’t (in my opinion) the most interesting approach, so it’ll probably be a while before I write about it.
0 It can be proven that strictly speaking, that’s not true if your h function obeys the monotonicity constraint, which basically requires that in any given path, if the h value in a given node is n, the value in the next node along the path can’t be more than n. If you want to know more, look it up.
1 Not really, because A* incorporates the more sensible approach to loop detection mentioned earlier. What we have is a more memory-intensive, maybe slightly more efficient kind-of-A*.
Nathan said,
March 5th, 2010 at 3:11 pm
In h_manhattan, shouldn’t it be subtracting the y values? That brings the iterations down from 52 to 13.
Regardless, nice post. I enjoyed reading it.
Cairnarvon said,
March 5th, 2010 at 7:00 pm
You’re absolutely right, of course. Another thing that is easy: typos. I’ve changed the text and the search tree to reflect the fix.
A nice demonstration of the fact that a wonky heuristic will still get you a solution.
Keegan said,
March 5th, 2010 at 8:52 pm
You can also use a heap which will find the next node to process in O(log(n)) (See the heapq module)
Cairnarvon said,
March 6th, 2010 at 3:47 am
Indeed. Something like a Fibonacci heap is great for this.
dons said,
March 9th, 2010 at 9:48 pm
I have a blog post request: graph databases.
Anonymous said,
March 10th, 2010 at 10:28 pm
Cool post, Xarn-