The idea is quite simple: treat the pattern p

_{0,m-1}as if it was a number expressed in radix-d notation, i.e., of the form:

P = p

_{0}× d

^{m-1}+p

_{1}× d

^{m-2}+ ... + p

_{m-1}× d

^{0}

Now, a match is found if a substring of the text represents the same exact integer, i.e., the pattern p

_{0,m-1}matches a substring t

_{k,k+m-1}(0 ≤ k ≤ n-m), or equivalently:

P = T

_{k}= t

_{k}× d

^{m-1}+ t

_{k+1}× d

^{m-2}+ ... + t

_{k+m-1}× d

^{0}

It's easy to see that the radix-d integer for the substring starting at index location k can be easily computed from the radix-d integer for the substring starting at k-1:

T

_{k}= t

_{k}× d

^{m-1}+ t

_{k+1}× d

^{m-2}+ ... + t

_{k+m-1}× d

^{0}=

= (t

_{k-1}× d

^{m-1}+ t

_{k}× d

^{m-2}+ ... + t

_{k+m-2}× d

^{0}- t

_{k-1}× d

^{m-1}) × d + t

_{k+m-1}=

= (T

_{k-1}- t

_{k-1}× d

^{m-1}) × d + t

_{k+m-1}

That is, subtract the first character of the previous substring multiplied by the radix to the size of the pattern minus one, then multiply by the radix and add the last character of the new substring. It's very straightforward on paper, although it can be tricky to get right.

This approach would certainly work if we could easily store in some integer type the radix-d numbers represented by the strings. Unfortunately, this is not always the case (almost never in fact). Let's look at a simple example: assume your char size is 8 bit, so that you would naturally choose the radix 256=2

^{8}, since each digit represents a value between 0 and 255=2

^{8}-1. An m-char string can express numbers between 0 and 256

^{m}-1=2

^{8×m}-1. If we want to store these integers in a 64 bit unsigned long type, we can only represent strings of length 8: 2

^{8×8}=2

^{64}. Bottom line: we can't express P and T

_{k}using integers.

This is where the most powerful idea of the Rabin-Karp algorithm comes to play. We don't need to represent these huge numbers, but only their value modulo a big enough prime q, i.e., P mod q and T

_{k}mod q. The catch is that if the match is positive, i.e., the hash codes are identical, we still need to check for equality between the pattern and the substring to make sure it's not a false positive. This obviously increases the computation time if there are a lot of false positives.

One last thing worth noting is that the Rabin-Karp algorithm works very well to match multiple patterns all at once: instead of storing a single key P, we can store a set of keys P

_{1,...,l}for each pattern. Great care has to be made in computing the keys T

_{k}now since the patterns can have different lengths, but once one gets the details right, it's quite straightforward.

Follows a simple java implementation of the Rabin-Karp algorithm to find all instances of a single pattern in a text.

## No comments:

## Post a Comment