The Wagner-Fischer Algorithm
By CS Wagner
Last updated April 2011.
Background
In 1950, Richard Hamming defined the concept distance between two binary strings [1]. A binary string is a series of ones and zeros, such as 1001101. Distance is a measure of how different two binary strings are. Similar strings have very little distance. Radically different strings have very large distance. Therefore, distance is used as a measure of either similarity or difference.
Hamming distance (as it became known) is very simple. Given two binary strings, how many positions between the two strings are different? Compare 1001101 to 1011001. The Hamming distance is 2. If you go through each string one character at a time, you will see that two of the positions have different characters. Also, you can imagine that if the distance is zero, the two strings are identical. What if all of the positions are different? The distance will be the length of the strings - a count of how many characters there are.
There is a limit to Hamming distance. Both strings must be the same length. The only characters allowed are 0 and 1. Throughout time, these limitations have been relaxed. Some people will pad short strings with extra characters to make them the same length. For example, comparing 101 to 11011 is accomplished by padding 101 to 101** such that the * character will not actually match anything. Further, there is no reason that this must be limited to binary values. Why not compare ABCD to ABBA and see how many positions have the same character (2 in this case)?
Vladimir Levenshtein pushed this concept of distance much further in his 1965 paper (translated to English in 1966) [2]. As just mentioned, Levenshtein adopted the changes to Hamming distance that allow for variable-length strings and non-binary characters. Then, he defined Hamming distance as a count of substitutions. Comparing CAT to HAT, an H is substituted for a C. Finally, Levenshtein introduced two new concepts: insertions and deletions. Comparing BAT to BAIT, an I is inserted. Comparing FROG to FOG, an R is deleted. In the end, Levenshtein distance is a count of how many substitutions, insertions, and deletions are required to change one string to another string.
For many years, Levenshtein distance was a great concept, but it was not implemented often. To do so for short strings was acceptable, but long strings simply took far too long to calculate. Given two strings of lengths m and n, Levenshtein distance would require at least m×n2 comparisons to calculate. Luckily, a concept of dynamic programming was beginning to catch on.
The Wagner-Fischer Algorithm
Robert Wagner and Michael Fischer were programmers in a new field of programming called dynamic programming. In general, dynamic programming takes a large problem, such as calculating Levenshtein distance, and breaks it down into smaller subproblems. By adjusting the structure of the solution to fit the smaller subproblems, many complex problems are solved much faster with dynamic programming. Wagner and Fischer published an algorithm to solve for Levenshtein distance with dynamic programming in 1974 [3]. Their algorithm would only require m×n comparisons.
To break down the Levenshtein distance problem, Wagner and Fischer used a matrix solution. Given two strings, place one string down the left side of the matrix. Place the other across the top of the matrix. If comparing HEAT to CART, the matrix would look like:
| C | A | R | T | |
|---|---|---|---|---|
| H | ||||
| E | ||||
| A | ||||
| T |
The new problem is simply what number should go in each empty cell of the matrix. Look at it from a single comparison. If comparing the first letter of each string (H and C), what is the distance? Since they are different, the distance is a single substitution, or 1. So, a value of 1 should be placed in the cell aligned with the first letter of each string as so:
| C | A | R | T | |
|---|---|---|---|---|
| H | 1 | |||
| E | ||||
| A | ||||
| T |
Now, look at the last letter of each string. Both are a T. What should the distance be? Since they are the same, the distance will not change. Whatever it was up to that point is what it should continue to be. If you notice that comparing the characters will take you in a general direction from the top-left to the bottom-right, you can see that the value in the bottom-right cell will be whatever distance was in the cells to the top, left, or top-left.
That is the revelation that makes this algorithm work. Each cell is dependent on the values in the cells to the top, left, and top-left of the cell itself. Instead of solving the whole Levenshtein distance problem, we only need to solve the problem of calculating the distance of one cell. What we have is three cells to look at (the top, left, and top-left cells) and the character aligned with the cell to the left and top. For convenience, an extra row and column are added so that every cell has three cells to compare to. After experimentation, Wagner and Fischer calculated the values that belong in the new pre-filled cells. With them added, the matrix looks like:
| C | A | R | T | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| H | 1 | ||||
| E | 2 | ||||
| A | 3 | ||||
| T | 4 |
Now, to fill in a cell, we need to know the two characters that align with the cell. We also need to know the values of the cells around it. The only cell that has the values of all three cells to the top, left, and top-left is the cell aligned with H and C. After doing this a couple times, you will find that you have to fill in the cells from the top-left to the bottom-right. You can do it row by row from the top row to the bottom row. It also works going column by column from the left column to the right column. I do it row by row.
Our first cell compares H to C. They are different. The distance for this comparison is 1. Looking at the neighboring cells, the best distance so far is 0 (as seen in the top-left cell). So, if we add this distance of 1 to the best previous distance, this cell gets a value of 1 (which is 0+1).
| C | A | R | T | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| H | 1 | 1 | |||
| E | 2 | ||||
| A | 3 | ||||
| T | 4 |
Now, lets compare H to A. They are different, so the distance between them is 1. Looking at the neighboring cells, the best distance so far is 1 (in both the left and top-left cells). Adding this distance of 1 to the previous best distance of 1, this cell gets a value of 2. If you continue, you will find that the H/R cell gets a value of 3 and the H/T cell gets a value of 4.
| C | A | R | T | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| H | 1 | 1 | 2 | 3 | 4 |
| E | 2 | ||||
| A | 3 | ||||
| T | 4 |
Looking at the E row, we once again have four cells in which E doesn't match anything in CART. The distance of each cell will be 1. That 1 is added to the best distance in the neighboring cells.
| C | A | R | T | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| H | 1 | 1 | 2 | 3 | 4 |
| E | 2 | 2 | 2 | 3 | 4 |
| A | 3 | ||||
| T | 4 |
Now we are on to the A row. The first cell finds no match between A and C. We add 1 to the best distance in the neighboring cells. The second cell does have a match. A and A are the same. The distance is 0. So, instead of adding 1 to the best distance so far, we only add 0. In other words, the best distance so far remains the best distance.
| C | A | R | T | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| H | 1 | 1 | 2 | 3 | 4 |
| E | 2 | 2 | 2 | 3 | 4 |
| A | 3 | 3 | 2 | ||
| T | 4 |
The rest of the A row has no matches. As for the T row, there isn't a match until you hit the last cell. Then, for that cell, you add 0 to the best distance so far.
| C | A | R | T | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| H | 1 | 1 | 2 | 3 | 4 |
| E | 2 | 2 | 2 | 3 | 4 |
| A | 3 | 3 | 2 | 3 | 4 |
| T | 4 | 4 | 3 | 3 | 3 |
So, what do we have? That last cell is the cumulative distance of the entire matrix. It is the distance between HEAT and CART. Because we compared every character in HEAT to every character in CART, we not only checked for Hamming-style substitutions. We also checked for Levenshtein-style insertions and deletions. So, the value of the bottom-right cell is the Levenshtein distance between the two strings. Comparing HEAT to CART, the distance is 3.
Also notice how much work was performed. We compared each character of the left string to each character of the top string only once. If the left string has a length of m and the top string has a length of n, we did m×n comparisons. In this case, we did 4×4 (or 16) comparisons. That is a lot better than the old method that required m×n2 comparisons (64 in this case).
For a formal definition, Wagner and Fischer made a small change. When there is a match, simply assume the cell to the top-left has the best distance so far. That will reduce the need to calculate a minimum value between three cells when there is a match. The formal definition is based on the definition of a matrix called M. The rows are labeled i and the columns are labeled j. The left string is called A and the top string is called B. So, Ai is the ith character in A. Bj is the jth character in B. Mij is the cell aligned with Ai and Bj. The value for each cell of M is calculated with the formula:
| Mij = | { | if Ai = Bj, | M(i-1)(j-1) | ||||
| if Ai ≠ Bj, | min | ( | M(i-1)(j-1) | ) | + 1 | ||
| M(i-1)j | |||||||
| Mi(j-1) | |||||||
Alignment
So, is calculation of Levenshtein distance all that the Wagner-Fischer algorithm can do? Of course not. Let's start by looking at the addition of +1 on a mismatch and +0 on a match.
What if we have two strings where there is a vagueness about what the best match could be. Consider ABCD compared to ABXCD. No matter how we do it, the distance will be 1. But, check the matrix and follow the distance from the 1 in the lower right, hitting the lowest valued neighbor as you travel to the upper-left cell.
| A | B | X | C | D | ||
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | |
| A | 1 | 0 | 1 | 2 | 3 | 4 |
| B | 2 | 1 | 0 | 1 | 2 | 3 |
| C | 3 | 2 | 1 | 1 | 1 | 2 |
| D | 4 | 3 | 2 | 2 | 2 | 1 |
When comparing C to C, there are two possible paths. We can go left to a value of 1 or up-left to a value of 1. Before we try to limit this ambiguity, lets make it even more evident. Instead of adding 1 on a mismatch, we can add 5. That will make the numbers really change, but the result will still be the same.
| A | B | X | C | D | ||
|---|---|---|---|---|---|---|
| 0 | 5 | 10 | 15 | 20 | 25 | |
| A | 5 | 0 | 5 | 10 | 15 | 20 |
| B | 10 | 5 | 0 | 5 | 10 | 15 |
| C | 15 | 10 | 5 | 5 | 5 | 10 |
| D | 20 | 15 | 10 | 10 | 10 | 5 |
Because we are adding 5 for every mismatch, the Levenshtein distance is the value of the bottom right cell divided by 5. The advantage here is that it is very easy to see the minimal path from the bottom right to the top left cell. Still, we have that tie where we can either go to 5 or 5 from C/C. So, what if we decide to change how the upper-left comparison works. Let's change the entire function to this:
| Mij = min | ( | M(i-1)(j-1) + Sim(Ai, Bj) | ) |
| M(i-1)j + g | |||
| Mi(j-1) + g |
I'm going to call g a gap penalty. That will make more sense later. In our previous example, the gap penalty was increased from 1 to 5. The big change here is that we don't consider a match or mismatch when comparing characters. Instead, we have a function called Sim. Assume that Sim(a,b) returns 0 when the two characters a and b are the same, but returns the gap penalty (now 5) when they are not the same. The result will be same as the previous example. In fact, Levenshtein distance uses a gap penalty of 1 and that exact Sim function. So, we've just made Levenshtein distance a bit more expandable. Well, "we" didn't. Saul Needleman and Christian Wunsch published this expansion in 1970 (before the Wagner-Fischer algorithm was published). The big difference is that they made the gap penalty negative and took the max value instead of min value - basically making all the values in the table negative numbers instead of positive numbers.
Global Alignment
So, what were Needleman and Wunsch trying to do? If we alter the Sim function to make it more creative, we can see. First, we will create a table which provides a comparison between all the characters we are looking at.
| Sim Table | |||||
|---|---|---|---|---|---|
| A | B | C | D | X | |
| A | 0 | 1 | 2 | 3 | 4 |
| B | 1 | 0 | 1 | 2 | 4 |
| C | 2 | 1 | 0 | 1 | 4 |
| D | 3 | 2 | 1 | 0 | 4 |
| X | 4 | 4 | 4 | 4 | 4 |
Now, use our modified algorithm to fill in the comparison between ABCD and ABXCD. Keep the gap penalty at 5.
| A | B | X | C | D | ||
|---|---|---|---|---|---|---|
| 0 | 5 | 10 | 15 | 20 | 25 | |
| A | 5 | 0 | 5 | 9 | 14 | 19 |
| B | 10 | 5 | 0 | 4 | 9 | 14 |
| C | 15 | 10 | 5 | 4 | 4 | 9 |
| D | 20 | 15 | 10 | 9 | 9 | 4 |
We still have that ambiguity going back from C/C, but watch this. Change the Sim table and we get an entirely new result. Note: Comparing C to X results in 5 now.
| Sim Table | |||||
|---|---|---|---|---|---|
| A | B | C | D | X | |
| A | 0 | 1 | 2 | 3 | 4 |
| B | 1 | 0 | 1 | 2 | 4 |
| C | 2 | 1 | 0 | 1 | 5 |
| D | 3 | 2 | 1 | 0 | 4 |
| X | 4 | 4 | 5 | 4 | 4 |
| A | B | X | C | D | ||
|---|---|---|---|---|---|---|
| 0 | 5 | 10 | 15 | 20 | 25 | |
| A | 5 | 0 | 5 | 9 | 14 | 19 |
| B | 10 | 5 | 0 | 4 | 9 | 14 |
| C | 15 | 10 | 5 | 5 | 4 | 9 |
| D | 20 | 15 | 10 | 10 | 9 | 4 |
The ambiguity is gone. What we've done is made a preference to align B with X over aligning C with X. So, what does it mean to "align"? That is what Needleman and Wunsch were doing. Specifically, they were doing global alignment.
Assume we have ABCD and we want it to be same length as ABXCD. What do we do? We add a gap character to ABCD. We'll use * for our gap character. So, ABCD* is now the same length as ABXCD. But, that isn't very good. It would be better if we placed the gap character in a position such that the distance between our new string and the string we're aligning to is minimized. In other words, we want to get as many characters in ABCD aligned with matching characters in ABXCD as possible.
The last table we made using the Needleman-Wunsch algorithm shows us how to do that. Start in the lower right cell. If the characters to the left and top of that cell match, write the character down. For us, the character is D in both the left and top, so we write down D. Then, go to the neighboring cell (left, up, up-left) that has the lowest value. For us, that is the cell aligned with C and C. Those match, so we prepend C to what we're writing and have CD. Repeat. The lowest value neighbor is aligned with B and X. Those do not match. So, we prepend the gap character and get *CD. Continuing, we will prepend B to get B*CD. Then, we will prepend A to get AB*CD. So, we can make the claim that AB*CD is an optimal global alignment of ABCD to ABXCD (which is a true claim).
Now, depending on need, the value of g can be adjusted to decide where to place gap characters. The result of the Sim function can be altered to decide which character alignments are more important than others. Then, an optimal alignment of any two strings is possible. Why would you want to do that? Assume you work for the CIA. You just got a box full of papers. These papers are the instructions for some top secret spy device. Some sheets may be missing. If so, which sheets are they? You take a book you know to be complete and send it through a scanner that reads each page and encodes all the stuff on the page to a single letter A through Z. You get a string like CJSYUEHGJSKEIGHEYCJJE (but a hell of a lot longer). You send the book with missing pages through the scanner and get CJSYEHGJKSEIGHEYJJE. You perform a global alignment and get CJSY*EHGJKSEIGHEY*JJE. Aha! The second book is missing the page aligned with U and the page aligned with the second C.
Yes, that is a bit of fantasy. In reality, this is used to align sequences of DNA strands with one another for comparison. It is hard for the human eye to align ACAGCTAAGTCCAGTGA with AATCGTAACTACAGTTA. But, that brings up a entirely new task. What if you have a long string and you want to figure out if a shorter string is mostly contained in the long string. For example, you have THISISAREALLYLONGSTRING and you are looking for something rather similar to STRAY. This is called local alignment (as opposed to the global alignment that we just did).
Local Alignment
When finding the best alignment of a small string within a larger string, we are doing local alignment. Once again, this is easy to do with a small adaptation of the Wagner-Fischer algorithm. First, we'll look at the gap penalty we just added.
Right now, we have g for a gap penalty. We can make that more specific by using gi as a gap penalty for insertions and gd as a gap penalty for deletions. Then, our formula for filling in each cell of the matrix would look like:
| Mij = min | ( | M(i-1)(j-1) + Sim(Ai, Bj) | ) |
| M(i-1)j + gd | |||
| Mi(j-1) + gi |
This is nearly the formula proposed by Temple Smith and Michael Waterman in 1981 [5]. What we are doing is placing extra weight on either deletions from or original short string or insertions to our original short string. For example, if you are comparing ABC to ACXXXABXC, you have two options. You could delete the B to align with AC. You could also insert a gap to align with ABXC. Depending on what you really want, you increase/decrease the insertion and deletion gap penalties accordingly.
There is one more concept that Smith and Waterman added. Imagine that you are looking for a substring. When you compare the first character of each string, you are starting at a base point of no distance (Levenshtein distance is 0). From there, you may find distance. But, once we find the start of the substring we are insterested in, we don't really care how distant the previous part of the string was. Consider a comparison of BELL to UMBRELLA. When you reach the B in UMBRELLA, you don't care about the distance of UM. So, Smith and Waterman hide the distance in two ways. First, they prefill the top row and left column with zeros. Everything begins just as distant as everything else. Then, they use a negative gap penalty to push cell values down, but they don't allow the values to go below zero. So, when things start getting distant, they don't get too distant. Smith-Waterman's modified Wagner-Fischer algorithm looks like:
| Mij = max | ( | 0 | ) |
| M(i-1)(j-1) + Sim(Ai, Bj) | |||
| M(i-1)j + gd | |||
| Mi(j-1) + gi |
Notice that it is a "max", not "min" function. Because the gap penalties are negative, they will drive the values down, but they can't go below zero. The Sim function will attempt to raise the values when a match is found. For this example, use a value of -1 for both gap penalties. For the Sim function, use 1 when there are matching characters and 0 when the characters don't match. Comparing BELL to UMBRELLA, you get:
| U | M | B | R | E | L | L | A | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| B | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| E | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| L | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 |
| L | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 2 |
Now, we look for the cell with the highest value. In this case, it is the 3. From there, we travel towards the up-left cell by going to the neighbor with the highest value. We can go to either 2 from the 3. Then, we go to the 1 (either one of them). Then, to the next 1 and the next 1. We then hit a 0, so we are done. Just like global alignment, when the characters on a cell match, we write the character down. When they don't match, we write down a gap character. In this case, we end up with B*ELL. That is the optimal local alignment of BELL with UMBRELLA.
Common Sequences
Now, assume we are interested in finding common sequences between two strings. For example, we have two strings of numbers like 09061969 and 08601979 and we want to know what the common sequence of numbers shared between the two strings is. Of course, we want a long sequence. We also need the sequence to be in order between the two strings. We can use alignment to solve this problem. Global alignment works fine, though local alignment will work if you have a short string compared to a rather long string. Using global alignment, we have the following. Note that I'm using a gap penalty of 1 and the Sim function returns 0 on match and 1 on mismatch.
| 0 | 8 | 6 | 0 | 1 | 9 | 7 | 9 | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| 0 | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 9 | 2 | 1 | 1 | 2 | 3 | 4 | 4 | 5 | 6 |
| 0 | 3 | 2 | 2 | 2 | 2 | 3 | 4 | 5 | 6 |
| 6 | 4 | 3 | 3 | 2 | 3 | 3 | 4 | 5 | 6 |
| 1 | 5 | 4 | 4 | 4 | 4 | 3 | 4 | 5 | 6 |
| 9 | 6 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 5 |
| 6 | 7 | 6 | 6 | 5 | 6 | 6 | 5 | 4 | 5 |
| 9 | 8 | 7 | 7 | 6 | 6 | 7 | 6 | 5 | 4 |
Now, we take the path from the bottom right to the top left following the lowest neightboring values. You'll notice that we can go straight diagonally from the bottom right to the top left. There are alternate paths, but the end result will be almost the same (I'll get to the difference in a bit). When the characters on the left and top of a cell match, write the character down. Otherwise, write down the gap character. You will end up with 0***19*9. So, that is the common sequence between the two strings of numbers.
Now, notice that from the comparison of 1 to 1, there are alternate paths that eventuall reach the top left cell. If you follow all of those paths, you will find that it is possible to add another digit to our common sequence. We can get 0*0*19*9 or 0*6*19*9. By adjusting the values of the gap penalty and the Sim function, we can ensure that we get the most digits possible in our sequence. In this case, we could also adjust the Sim function to prefer matching 0 to matching 6 to ensure the most characters with the smallest values.
Implementation
While the Wagner-Fischer algorithm is a very useful function for checking string distance, aligning strings, and locating common sequences, it is also a very complex function. For two strings of length m and n, it requires O(mn) time. That is on par with O(n2), which is considered a bad algorithm in computer algorithm design. In order to try to improve the complexity, we must first understand how to implement the function.
First, keep in mind that this is dynamic programming. While the overall problem may be Levenshtein distance or global alignment, the real problem we are solving is how to fill in a single cell given the three neighboring cells. So, our function will look like the following - which will need adjustment depending on your favorite programming langauge.
/**
* Compares characters a and b.
* Uses neighboring cell values from up, left, and up_left.
* Returns the value for a cell.
*/
function CellValue(a, b, up, left, up_left)
{
...some code will go here...
}
Now, for the code inside. Why don't we make this as complex as possible. We'll use a Sim function for comparing the values of a and b. Then, we will have two gap penalties (insertion and deletion). What we return will be based on if we are doing distance/global alignment or local alignment. Remember, local alignment uses a max function with a max value of 0. The others use a min function. Putting this all together, we get...
/**
* Returns the value for a cell.
* Compares characters a and b.
* Uses neighboring cell values from up, left, and up_left.
* Set local to true to do a local alignment.
*/
function CellValue(a, b, up, left, up_left, local=false)
{
int gi = 1; // Insertion gap penalty.
int gd = 1; // Deletion gap penalty.
if(local)
return max(0, up_left+Sim(a,b), up+gd, left+gi);
else
return min(up_left+Sim(a,b), up_gd, left+gi);
}
Now, we need a Sim function. We will use a 0 on match and 1 on mismatch to make it easy.
/**
* Compares characters a and b.
* Returns comparison value.
*/
function Sim(a, b)
{
if(a == b) return 0;
else return 1;
}
Note: That can be reduced to return(a!=b);, but I want to make it obvious what it does.
Now, we can get the value for a single cell. We need a matrix. This is a two-dimensional array. The size of the array is length of the two strings (A and B) plus one.
int matrix = array[strlen(A)+1][strlen(B)+1];
Now, we need to fill the top row with 0 through the length of B.
for(int i=0; i<=strlen(B); i++) matrix[0][i]=i;
Then, fill the left column with 0 through the length of A.
for(int i=0; i<=strlen(A); i++) matrix[i][0]=i;
If we were doing local alignment, we'd fill each of those with 0 instead of i. But, this does global alignment or Levenshtein distance as it is. So, we got a prefilled array and we need to fill in the cells. We do them from the top row down to the bottom row, going left to right. That is a for loop in a for loop. This is where we hit the O(n2) complexity.
for(int row=1; row<=strlen(A); row++) for(int col=1; col<=strlen(B); col++) matrix[row][col]=CellValue(A[row-1], B[col-1], matrix[row-1][col], matrix[row][col-1], matrix[row-1][col-1]);
There will be issues with aligning the matrix with the strings since we filled the first row/column of the matrix with extra values, but overall this is a very simple function. Now, if we were calculating Levenshtein distance, we just look at matrix[strlen(A)][strlen(B)].
It should be obvious that we can't really improve the CellValue function or the Sim function. All we can improve is the for loop inside a for loop. To do so requires massive creativity. The most common method is to use what is called lazy evaluation. The code requires a recursive function. With this, we only fill in the cells that are required to solve the distance. If the strings are similar, most of the required cells will be right up the diagonal. So, we won't need to fill in the whole matrix. It is done as so:
function CellValue(matrix, A, B, row, col)
{
int gi = 1; // Insertion gap penalty.
int gd = 1; // Deletion gap penalty.
if(A[row-1] == B[col-1]) // This is a match, we only need to know the up-left cell.
matrix[row][col] = CellValue(matrix, A, B, row-1, col-1) + Sim(A[row-1], B[col-1]);
else
matrix[row][col] = min(
CellValue(matrix, A, B, row-1, col-1) + Sim (A[row-1],B[col-1]),
CellValue(matrix, A, B, row-1, col) + gd,
CellValue(matrix, A, B, row, col-1) + gi
);
}
I left out the local alignment case for simplicity. But, you can see that this fills in the matrix array as needed (assuming you pass matrix by reference so each time it is filled with a value, the main matrix also gets filled and not a copy of the matrix - which makes sense if you understand pass-by-reference vs. pass-by-value parameters). To make this better, you can use memoization to ensure you never calculate the value of a cell more than once. But, once you have this function, you can just prefill your matrix and then you can do: int distance = CellValue(&matrix, &A, &B, strlen(A), strlen(B)); The minimum number of calculations required will be all that are performed. I have tested this extensively. You should expect to complete 92.6% of the total matrix array on average. For many people, a savings of about 8 cell calculations between two 10-character strings isn't worth the hassle of using lazy evaluation. However, strings with lower Levenshtein distance will achieve much better than 92.6% as each time the characters match, at least two cells are not calculated.
Other improvements are based on special needs. For example, what if you simply want to know if two strings have a distance of more than 5. Once you complete a row and the minimum value is greater than 5, you know that the distance will be greater than 5. So, you can stop processing.
All in all, if you think you can improve on the Wagner-Fischer algorithm, try it. If your idea works, write it down. Then, publish it. Then, have a cool new algorithm named after you. Of course, all these are named after two people. If you need a second name for your algorithm, just give me a buzz.
Conclusion
The Wagner-Fischer algorithm and derivatives of it are very useful for calculating distance, global alignment, local alignment, and common sequences. The function is rather easy to write because it is based on dynamic programming. The main problem is that it has a bad complexity. For long strings, it will take a long time to complete. But, for short strings, this is a very useful function to have in your arsenal of comparison functions.
References
- Richard W. Hamming. "Error Detecting and Error Correcting Codes." Bell System Technical Journal, 29 (2):147-160, 1950.
- Vladimir I. Levenshtein. "Binary Codes Capable of Correcting Deletions, Insertions, and Reversals." Soviet Physics Doklady, 10:707-710, 1966.
- Robert A. Wagner and Michael J. Fischer. "The String-to-String Correction Problem." Journal of the ACM, 21(1):168-173, January 1974.
- Saul B. Needleman and Christian D. Wunsch. "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins." Journal of Molecular Biology, 48(3):443-453, March 1970.
- Temple F. Smith and Michael S. Waterman. "Identification of Common Molecular Sequences." Journal of Molecular Biology, 147:195-197, 1981.











