Tuesday, August 21, 2012

Rabin-Karp Algorithm

Here's a juicy code example implementing the Rabin-Karp algorithm in java. The Rabin-Karp algorithm is a string matching algorithm: given a text and a pattern string, the algorithm will return the location(s) of the given pattern in the text. Assuming the size of the text is n characters and the size of the pattern is m characters the average and best case running time is O(n+m).
The idea is quite simple: treat the pattern p0,m-1 as if it was a number expressed in radix-d notation, i.e., of the form:
P = p0 × dm-1+p1 × dm-2 + ... + pm-1 × d0
Now, a match is found if a substring of the text represents the same exact integer, i.e., the pattern p0,m-1 matches a substring tk,k+m-1 (0 ≤ k ≤ n-m), or equivalently:
P = Tk = tk × dm-1 + tk+1 × dm-2 + ... + tk+m-1 × d0
It's easy to see that the radix-d integer for the substring starting at index location k can be easily computed from the radix-d integer for the substring starting at k-1:
Tk = tk × dm-1 + tk+1 × dm-2 + ... + tk+m-1 × d0 =
     = (tk-1 × dm-1 + tk × dm-2 + ... + tk+m-2 × d0 - tk-1 × dm-1) × d + tk+m-1 =
     = (Tk-1 - tk-1 × dm-1) × d + tk+m-1
That is, subtract the first character of the previous substring multiplied by the radix to the size of the pattern minus one, then multiply by the radix and add the last character of the new substring. It's very straightforward on paper, although it can be tricky to get right.
This approach would certainly work if we could easily store in some integer type the radix-d numbers represented by the strings. Unfortunately, this is not always the case (almost never in fact). Let's look at a simple example: assume your char size is 8 bit, so that you would naturally choose the radix 256=28, since each digit represents a value between 0 and 255=28-1. An m-char string can express numbers between 0 and 256m-1=28×m-1. If we want to store these integers in a 64 bit unsigned long type, we can only represent strings of length 8: 28×8=264. Bottom line: we can't express P and Tk using integers.
This is where the most powerful idea of the Rabin-Karp algorithm comes to play. We don't need to represent these huge numbers, but only their value modulo a big enough prime q, i.e., P mod q and Tk mod q. The catch is that if the match is positive, i.e., the hash codes are identical, we still need to check for equality between the pattern and the substring to make sure it's not a false positive. This obviously increases the computation time if there are a lot of false positives.
One last thing worth noting is that the Rabin-Karp algorithm works very well to match multiple patterns all at once: instead of storing a single key P, we can store a set of keys P1,...,l for each pattern. Great care has to be made in computing the keys Tk now since the patterns can have different lengths, but once one gets the details right, it's quite straightforward.
Follows a simple java implementation of the Rabin-Karp algorithm to find all instances of a single pattern in a text.

No comments:

Post a Comment