 Hello, and welcome to this next lecture on Data Structures and Algorithms. Today, we will conclude our discussion on matching patterns on strings with the very well known Knuth-Maris Pratt algorithm also abbreviated as the KMP algorithm. Recall our discussion of string matching on a pattern using a finite state automaton built from the pattern. We devised an interesting lookup table for the transition function delta of the finite state automaton with the goal that we are able to find all matches of the pattern on the input text in time that is linear in the number of characters in the text. Now, we are going to leverage that same idea, but instead of constructing the entire finite state automaton, we will basically make use of the delta kind of lookup constructed on the pattern to directly match the text while avoiding as many backward traversals as possible in the text. In fact, we will show that the number of backward traversals on the text will be upper bounded by the length of the pattern itself. In fact, the same idea will be used to construct the lookup table for the pattern itself. So, here we go. The Knife algorithm had no pre-processing on the pattern, but required a significant amount of time for matching the pattern on the text as order of n minus m plus 1 into m. The Rabin-Karp algorithm incurred linear time in pre-processing the text. Recall that this was using hash functions, but the matching time did not reduce significantly. In fact, it was still order of n minus m plus 1 times m. The finite automaton gave you a linear order n matching time, but the construction of the automaton turned out to be expensive, order m cube times the size of the vocabulary. We are going to leverage the construction of the finite automaton and kind of getting rid of what was unnecessary in the finite automaton, which is remove unnecessary baggage and leverage the delta function to reduce the pre-processing time to order m, while still maintaining order n matching time. This is the story. So, the idea of linear time string matching algorithm using K and P avoids computing the transition function explicitly, but instead leverages the idea by which delta was computed to find a so called prefix function pi. And this prefix function pi links to the index of the pattern P that has the longest prefix for any index i of the pattern i. So, pi cube is maximum value of the index k such that pk the string from 1 to k in P happens to be a proper suffix of the string P cube. So, k is therefore the length of the longest prefix of P that is a proper suffix of P cube. So, recall that this was exactly the idea that was used to compute delta. We are going to leverage pi cube directly. So, a brief recap if your pattern P is a, b, a, b, a, b we initialize pi 1 that is the index k into the longest prefix of P that is a proper suffix of P i that is a. So, there is actually only one character a proper suffix would only corresponding to the empty prefix. So, pi i, pi of 1 is 0. What about pi of 2? We have actually listed proper prefixes of pi 2 and proper suffixes as well b we find that there is no prefix of P 2 that is a proper suffix of P 2. So, pi 2 is therefore 0. At the next level we find that it is indeed possible to find a prefix of P 3 that is a which is a proper suffix of a, b, a. So, pi i is therefore 1 pointing to a. How about P 4? Well, there is indeed a prefix a, b that is a proper suffix of a, b, a, b this gives you an index of 2. Now, we will see how to compute this 2 without having to enumerate the proper prefixes and proper suffixes at every step. In fact, this is exactly the idea that is going to be used for string matching as itself. We are going to determine the number of characters that need to be skipped in the pattern mining process. These are characters which we are assured of not having matches and we will basically make use of the lower values of pi in this computation of pi for higher values of q. So, we continue this and we find that for pi 5 the length of the longest prefix a, b, a which is a proper suffix of P 5 is 3. Pi 6 has a longest prefix of length 4 which is a proper suffix of P 6. So, given the table let us try and do pattern matching on text T. We will revisit the construction of that table in a more efficient manner once we have learned sufficiently from the KMP pattern matching process itself. So, the idea is we compare each character of P with text T to obtain partial matching pairs and we keep track of the number of characters that are currently processed or matched for pattern P in text T. So, NCM stands for number of characters matched. So, pi NCM is going to tell you the longest prefix of P that is a proper suffix of the pattern that ends at NCM. So, this denotes the number of characters that need to be skipped in text T. So, NCM minus pi NCM are the characters that you know will certainly not match in text T which you could skip. Let us illustrate this. So, we started A and C there is no matching we can proceed they are only interested in looking up pi when there are some matches when there is some partial match. So, yes that is starting at the next character we do find some partial match until we hit C and A which do not match. So, we in all we have four positions what we are going to do now is decide if we need to skip and how much we need to skip. So, we look at pi 4 which corresponds to the length of the longest prefix that is a proper suffix of the pattern P 1 to 4 ending at B and it turns out that this is AB. So, we know that this is the length that we should be careful about. So, we start matching characters at the next position and we find that we have matches until position 4 position 5 of the pattern there is a mismatch. Now, how much should we move to the right? We look at pi 4 and we find the length of the longest prefix of the pattern that happens to be a proper suffix of the text. And that is 2. So, what we know is we can move to the right and be able to start matching at the next position following AB. So, we need to start matching at this A matching with this C. So, we know that there are chances of continuing this match starting at the next position because we have this prefix match AB with this match AB. So, at best what we can do is move 2 positions to the right which is basically 4 minus pi 4. So, pi 4 matters to us the remaining do not matter to us. So, we subtract pi 4 from 4 and move 2 positions to the right. We know AB match AB and we actually need to see if the next character here A will match C and we unfortunately do not get a match between A and C. Well, what can we do? Can we still salvage something? Well, we know that 2 positions have already matched. So, what we can do therefore is look at pi 2. Now, what is pi 2? What is the length of the longest prefix of AB which is a proper suffix of AB? It is basically 0 we know from this table. So, since there is actually no match no prefix match at all we look at 2 minus pi 0 which is 2. So, we can confidently skip these many characters and start the matching at the next position. So, we start with matching A and C no match. We proceed well starting at the next position we have 3 matches and again what do we find? There is a mismatch at the fourth position we look at pi of 3 the length of the longest prefix of A B A that happens to be a proper suffix of A B A is 1 and that is basically A here and I know that this A will match this A. So, therefore what I can do is skip the characters that come in middle and what does that mean? It means skipping 3 minus pi 3 which is 1 is 2 characters. So, skip these 2 characters. So, you can see that a lot of unnecessary matching has been avoided simply by making use of the pi. We continue A match at A, A did not match B and now this actually no scope for further matching because we actually exceeded the right threshold of T. So, we need to do this only for N minus M steps on the text T. So, here is the overall KMP algorithm which assumes that we have obtained the pi table already. So, you compute the prefix pi we will soon see that the same process can get you pi, the same process of matching the pattern P on the pattern P will get you pi. But for the time being we will assume pi is already known now until position N minus M here is what you do you keep checking if P Q plus 1 is not equal to T i. So, while Q is greater than 0 and P Q plus 1 is not greater than T i you set Q to pi Q. So, you basically are looking back and seeing well where do I need to reset and then check if P of Q plus 1 is equal to T i if yes then increment the value of Q. If Q lands up being M the end of pattern P you have actually got a shift at i minus 1 where there is a pattern match and this you set Q to pi Q. So, the skip of N C M minus pi of N C M is actually happening here. Now how do you compute the prefix pi? How about running the same process on the prefix P replace the text T with P start with the leftmost index of P and construct higher values of pi based on matches or mismatches matches at the lower level. So, here is the compute prefix function. So, what you note is that we are maintaining an index i into P, but treating P as T and we again have this skip here where you set Q equals pi Q, but note that we have set pi of 1 to 0 which we already discussed. You are not incrementing Q anywhere here you are only looking up Q for Q equals 0 initially. Once P of Q plus 1 equals P pi once you have actually exited this then Q is set to Q plus 1 which means you are looking for the next look up and pi i is set to Q the length of the longest prefix of P i which is a proper suffix of P i happens to be P Q. What about running time? So, first of all we are iterating over the pattern P M times there is of course, a bit of bad tracking that happens when you set Q to pi Q and can show that this actually cannot happen more than M times. So, the overall computation will be order of M likewise the pattern matching of P on T happens N minus M times overall over the index i. However, there is this additional backtracking or this might create a problem. However, one can show that the number of backtracking is are going to be upper bounded by the length of the pattern itself. For the details you can read the proof in CLR in the section on Kutmaris Pratt algorithm. So, the overall run time is basically order N. Thank you.