 Hi and welcome to this next lecture on data structures and algorithms. We discussed a naive algorithm for string matching which requires one to iterate over all the characters in the input pattern to be matched as well as over all the characters on the text to be matched and in fact, we had to redo or repeat a lot of computation starting today our approaches that leverage computations that were already performed to avoid repeated computations and bootstrap around them. So, the first algorithm we will discuss is the Rabin-Karp algorithm. The Rabin-Karp algorithm basically is hinged upon identifying the vocabulary and the size of the vocabulary around which strings in the input text and the pattern will be constructed. So, suppose the vocabulary was restricted to 10 values. So, alphabet was 0 to 9, in general you could have d values. So, the idea of Rabin-Karp algorithm is to represent the input sequence or any substring of the input sequence as a digit in the radix d notation. So, basically a string of k consecutive characters represents a length k decimal number or as I pointed out more generally it represents number in the radix or base d form here d was taken to be 10. So, to find all occurrences of the pattern p in the text t will compute the hash which is basically number mod q and q is chosen to be a prime of the number p of size m and compare it with m consecutive digits of t. Now, why hash? Well one could do the following one could actually look at the decimal or the radix d representation for the text say starting at position s and ending at position s plus m minus 1. Compare that with the pattern starting at position 1 and ending at position m and one could actually check if that the radix d representation for d for t and p are exactly the same. However, this can be expensive. So, the general formula for representing a number in the radix d notation is as follows So, let us say t s is computed or defined to be t starting at position s plus 1 and ending at position s plus m. So, t s will correspond to t of s plus m plus 10 times t of s plus m minus 1 plus 10 times t of s plus m minus 2 and so on until t of s plus 1. One can do likewise for p. So, small p is basically p of m plus 10 times p of m minus 1 plus 10 times p of m minus 2 until p of 1. So, computing the t and the p t s and p can be expensive. In fact, it might not be even possible to fit t s or p into one computer word. So, hash comes to our rescue here. The hash which is the number that is t s modulo sum prime q would be such that it fits as one word in the computer and likewise for p. So, p is chosen in order to make this number mod q small enough. However, this comes at a price even if two hash numbers match which means even if p mod q equals t s mod q it is not necessary that p and t s are same. So, even if the two hash numbers match you will still need to go back and check for the digits of t and p being the same. So, there is an additional overhead that one incurs for having used this hash function. Now, one can provide a counter example such as a raise to n and suppose your pattern was just a raise to m. One can easily see that there will be a match potential match of the hash functions at every position of t and this will lead to the checking for the equivalence of digits of t and p at every position. So, one could compute the worst case running time to be theta of n minus m plus 1. This is the time required to scan t times m. So, the worst case it could be order of n m and therefore, expensive. However, in most practical cases when the number of hash matches are constant one will find that the average case complexity is very reasonable. So, we will look at the average case complexity once we discuss the algorithm itself. So, once you have doubly check for the match of t a segment of t with p you report the occurrence of this pattern p in the text t. So, here is an illustration here we are looking at an alphabet set sigma which is ranging from 0 to 9. So, we are basically looking at decimal representations. So, the choice of q in this case is a prime number 17. So, we look at the pattern p the decimal representation mod 17 is 15. Now, suppose we had to match this against this long text you would start from the left most end look at the first 5 digits compute mod 17 well it turns out to be 1. So, if p mod q was not equal to T s mod q one can be very confident that p is certainly not equal to T s. So, we can proceed we look at the next 5 digits mod 17 is 15. Now, this is a valid match. So, what we find here is the p mod q equals the T s plus 1 mod q now does that mean p equals T s plus 1 no. So, now you need to verify that p indeed equals T s plus 1 and it was is correct 847 to 6. So, that was indeed a valid match. So, we flag a valid match next 72639 well you get the same mod 17 value here. So, we find that p mod q indeed equals T s plus 2 well T s plus 3 mod q p is not equal to T s plus 3. So, there was a false alarm that the hash value collision alerted us to. So, this is the price we need to pay for computing the hash function. Now, how many mistakes do we expect to see? So, we can expect the number of such spurious hits to be basically the order of the number of digits the length of the input text divided by the prime q. So, this number of false alarms or spurious hits they can be expected to be order of n divided by q continuing we find invalid match. So, is there a way that we could compute the hash value leveraging computations from the past. So, 38472 mod 17 is 1 can that be used to discover that 84726 with the additional digit 6 mod 17 is actually 15. So, we have an old higher order digit 3 that disappears and a new lower order digit 6 that appears. So, one can update the mod value using the honours rule as follows. So, the decimal value T s plus 1 can be obtained by first of all subtracting from T s 3 just the largest digit multiplied by the number of the base exponentiated by the number of remaining digits. So, what does that mean T s minus 10 raise to m minus 1 times the value at position s plus 1. However, after the subtraction I need to do a left shift in order to insert 6. So, this left shift can be achieved by multiplying by the same base 10 and to this we can add the new digit T of s plus m plus 1. So, this operation corresponds to removal of 3 or digit corresponding to the largest value and pre multiplication with 10 corresponds to left shifting what remains. Finally, addition of T at s plus m plus 1 corresponds to adding a digit at the least significant value. Now, all this does it carry forward to mod operations. So, recall from modular arithmetic we have some properties from modular arithmetic for addition and multiplication. So, in multiplication a mod q times b mod q is equivalent to a b mod q. One can write this in terms of the normal equality as follows. So, this just means that this multiplication mod q itself is the same as a b mod q. So, making use of modular arithmetic properties one could update the mod for the new number 84726 as follows. One can actually take mod of the terms on the right hand side. And what we note here is we have 10 raise to m minus 1 or in general this could be d raise to m minus 1. This might be expensive to compute at every step. However, by virtue of modular arithmetic one could actually compute h as 10 raise to m minus 1 mod q. This can be substituted the h can be pre-computed and the smaller value can be substituted to be able to do this computation in more reasonable time. So, this pre-computation involves computing p mod q and t 0 mod q and t 0 you might recall corresponds to the string starting at position 0 and ending at position m minus 1 in t. To begin with we compute t 0 and p and as already motivated we compute rather pre-compute d raise to m minus 1 mod q. We have made an assumption here that the text t consists of a vocabulary sigma which is of size d. So, we are able to create a one to one mapping from every character in sigma to one of the digits 0 to d or rather 0 to d minus 1. So, once we have done this pre-computation we only need to make use of the modified Honours rule with the modulo and compute the mod for subsequent sequences. So, this is what we do. We look for a match of p to t s and we do this for every position s if indeed there is a match then we basically ensure that the two strings are the same. So, we ensure that p position 0 to m minus 1 indeed is equal to t starting at position s and going until position s plus m minus 1. So, if j equals m if the index j finds a match for all the m minus 1 positions in both t and p then we know that the shift at s is valid rather there was a match at position starting at s in t. You keep doing this and for s less than n minus m that is for s that ends well before m characters of t. We are going to dynamically update t s plus 1 and we do that using the properties in modulo arithmetic that we already discussed. So, recall that this h was the general d raise to m minus 1 mod q. T s itself was obtained after mod q operation. So, t was t going from s to s plus m minus 1 mod q. Of course, perform this multiplication and addition and again check for mod q match. If there was a match here you would go back and check for character wise equality. So, analysis the pre computation which involves scan of the m positions of p and t is basically an order m operation. The matching of every position and t the substrings originating at every position and t with p is going to involve a scan over the entire array t. So, that will be n minus m plus 1 positions in t. Now, if there were c of such hits which was spurious or what we refer to earlier as false alarms the expected matching time going to these false alarms will basically be of the order of n minus m plus 1 which you have already accounted for plus c times m because with every spurious hit you will need to do a scan over both p and t. We have already seen the worst case. The worst case was when t was a raise to n and p was a occurring m times and this led to an order n minus m plus 1 times m. However, the average case is actually better. So, on an average one can expect order of n by q spurious hits basically that an arbitrary t will be equivalent to p modulo q can be estimated to be 1 by q and there are only order n positions. So, this leads to order of n by q spurious hits and one can show that the running time in the average case is basically order of n. What we see here below is the worst case running time and this is illustrated to the example of t being a sequence of n a s and p being a sequence of m a s that is when we incur n minus m plus 1 spurious hits. Thank you.