Knuth–Morris–Pratt algorithm

Very effective string searching algorithm (also known as KMP) with O (m+n) complexity, where m is length of pattern and n is length of text. Unlike the brute-force algorithm we can skip some of the comparisons in the case when a mismatch occurs. The algorithm was discovered by these gentelmens:

kmp

in 1974.

Algorithm

The algorithm contains two phases: pre-processing and searching phase. Lets start from the most difficult part and this is the is pre-processing phase. The aim of this phase is to build partial match table (PMT). In fact, PMT is an array, the length of this array equals to the length of the pattern. Each element in the table represents length of the longest proper suffix that matches with the longest proper prefix of the same pattern. To understand previous statement, I would like to provide you definitions of proper prefix/suffix with examples:

  1. A proper prefix is a prefix of a string which is not equal to the string itself. For instance, here are all examples of proper prefixes of the string “apple”:
    Proper Prefix
    a
    ap
    app
    appl
  2. A proper suffix is a suffix of a string which is not equal to the string itself. Here are all examples of proper suffixes of the string “apple”:
    Proper Suffix
    e
    le
    ple
    pple

As you can see, the length m of the proper prefix/suffix equals to 0 < m < n, where n is the length of the string.

Let’s go ahead and build partial match table using knowledge that we’ve just received, for the string “ABABABA”:

pmt_1

On the table above you can find three rows, for the first two rows explanation is not required. The last row which I named “PMT Value” represents partial match table itself. Each value in this row is a length of the longest proper suffix that matches with the longest proper prefix of the string “ABABABA”.

For example, to find PMT value for the substring “ABABA”:

pmt_2

We have to list all possible proper prefixes and suffixes:

Proper prefix Proper suffix
A A
AB BA
ABA ABA
ABAB BABA

In the table above, we have two matches “A” and “ABA”. The longest one is “ABA” and its length is 3. So the PMT value is 3:

pmt_3

Let’s find another PMT value for substring “ABA”:

pmt_4

All possible proper prefixes and suffixes are:

Proper prefix Proper suffix
A A
AB BA

We have only one match which is “A” and the length(PMT value) is 1:

pmt_5

I think you understood the idea and you can try to find rest of the PMT values by yourself and compare your result with my:

pmt_1

If your result matches with my then the pre-processing phase has completed and now we are ready to use partial match table in the searching phase. The searching phase is very similar to brute-force algorithm, the main difference is that when a mismatch occurs we can skip ahead basing PMT value.

For instance, in the example below we search the pattern P “ABABABA” in the string S “ABXXXXABABAXXXXXXXABABABA”:

kmp_search_1

I’ve skipped first two steps and stopped on the step three. Because, we have a mismatch. In brute-force algorithm after mismatch has occurred we would start matching pattern from the beginning with next char in string. In other words, we’d compare S[1] and P[0] or “B” and “A”.

kmp_search_2

In Knuth–Morris–Pratt algorithm we will skip one char and start matching S[2] and P[0].Why? Because, we the have the partial match table and according to this table we know (we already checked) that in substring “AB” there are no proper prefix which equals to the proper suffix. In other words, the char “A” is not repeated and we can start comparing from S[2]:

kmp_search_3

Farther, we unsuccessfully tried to compare P[0] with all characters in the string in the range S[2] and S[5]. Only starting from S[6] to S[10] we had a match which ended at S[11]:

kmp_search_4

Let’s stop on this step, since this is another case of mismatch. In previous case we didn’t have proper suffix and prefix, so it was best case. Now, we have matched the substring “ABABA” and based on PMT, we know that the longest proper prefix and suffix for “ABABA” is “ABA”. All we have to do, is shift pattern ahead so that the substring “ABA” matches to string. In other words, we compare S[11] with P[3]. Where 3 is the PMT value by index 4, where 4 is the number of matched chars:kmp_search_5

Again, we have a mismatch between S[11] with P[3]. Just check PMT and “A” is the longest proper prefix and suffix in substring “ABA”. So, shift pattern ahead and compare S[11] and P[1] as shown on the picture below:kmp_search_6

The remaining steps are very obvious, we simply compare the pattern, in the case of a mismatch check PMT, but the value is always zero. And finally we found the pattern:

kmp_search_7

Implementation

The implementation can be found on GitHub: https://github.com/alexvolov/algorithms-and-data-structures/blob/master/src/main/java/com/alexvolov/ads/algorithms/strings/KnuthMorrisPratt.java

Complexity

The running time is linearly dependent on the amount of input data O(m+n). So, it is asymptotically imposible to develop a more efficient algorithm.

Conclusion

Since the algorithm never moves backward plus running time is O(m+n) it can be used for processing large files. But, the chances of mismatch increases with big sized alphabets.

Quiz

Build partial match table for “ABRACADABRA”.

Click here to check your answer.
A: [0,0,0,1,0,1,0,1,2,3,4]

Leave a Reply