 Hi and welcome to this next lecture on string matching. Recall our Naive algorithm for string matching which was basically looking for all matches of the pattern P in the text T starting at every position in the text. What we were concerned about with the Naive algorithm was that there are potentially prefixes of the pattern P which also occur as suffixes in the text that was already matched. So, imagine that we ran a pattern matching P on the segment and we realize that there is a mismatch. How in the process does some segments of P that already matched T would be not want to leverage them and that is precisely what we will discuss today through the construction of a finite state automaton. What is a finite state automaton? It is a simple machine for string matching. A finite state automaton has an associated alphabet and given any observation from the alphabet automaton makes transitions from something called states from one state to another state. So, this is how a finite state automaton looks like. State S1, state S2 on seeing a symbol A1 from an alphabet sigma, you might make a transition to S2. However, on seeing another symbol A2 from sigma, you might make a transition to S3. You can also have self-loops on certain characters. You cannot have outgoing transitions to two different states on the same symbol. So, it is not possible to have S1 go to S3 on A1. Interested in deterministic automaton, a transition of the sort is allowed in non-deterministic automaton. You also have something called an accept state. So, on a particular symbol A3, I might go from S2 to S4 and call S4 an accept state. What does this automaton do? It basically accepts strings of the form A1, A3. Of course, we have not completed the automaton. So, on S2, if I see A2, I might go to S3 and keep looping around as I see A1. Unfortunately, I have not created an exit back from S3. So, imagine that I have a transition from S3 to S1 on A2 as well as A3. From S2, I might loop back on A1. So, now we can complete the set of strings that this automaton will accept. It accepts A1, A3 and I am assuming here that I am beginning with state S1. It also accepts A1, A1, A3 and A1, A1, A1, A3. So, it basically can accept A1 followed by any non-zero numbers of A1 followed by A3. It can also accept strings prefixed with A2, A1, A2 or A3 followed by A1, A3 and so on. So, it is a whole bunch of strings that it can accept. Our goal is, however, to construct a finite state automaton for a fixed pattern P in order to find all occurrences of the pattern P in the input text T. Now, the pattern P will be assumed to be fixed whereas the input text can vary. Now, we will invest more time in constructing the pattern P and the benefit we will enjoy is that we will be able to quickly process and scan the text T for instances of pattern P. So, formally a finite state automaton M is a phi tuple which consists of finite number of states Q, a start state Q0 and accepting state A. So, recall that we denoted accepting state with two circles, concentric circles. The start state was denoted by an incoming arrow. Sigma is a finite input all alphabet on the previous slide or sigma was A1, A2, A3. Delta is a transition function which takes you from one state to the other. So, delta on S1 at S1 with input A1 to Q to S2 and so on. So, you can list all the deltas associated with the finite state automaton illustrated on the previous slide. Now, what you want is to construct an automaton that exactly accept strings corresponding to the pattern. So, let us illustrate with an example. We have pattern P equals ABC and we need to see what are the matches of ABC into the text AB, AB, ABC. I have constructed this finite state automaton corresponding to the pattern. It has three states, the states are 0, 1, 2 and 3. So, this is an accept state, a start state and two intermediate states. The sigma consists of A, B and C is the alphabet. The delta as specified in this finite state automaton is as follows delta on 0 with input A takes you to state 1, delta on state 1 with input B takes you to state 2, delta on state 2 with input A takes you back to state 1 and so on. Now, does this automaton accept only instances of this pattern P? You can convince yourself that for any sequence which is not ABC for example, C, B, A you will be taken back to the start state 0. So, on C there is nothing you can do here, on B you keep looping back here, on A you move to 1, from where you move to 2 on B and from 2 you move back to 1 on A. So, if I had C, B, A, B, C, I will start with 0 not move away from 0 until I see this A and then B and then C will be in the accept state 3. However, if I see an A instead here, if it is A, B, A, B, C. So, when I see this B or rather by the time I am at B I will be on state number 2. However, when I see A I will be back to state number 1, which means I have already registered the first A that I should have matched against this pattern. The rest is just going to be routine just B followed by C. How do you construct such a pattern that accepts matches only corresponding to ABC and does not accept a match corresponding to any other character sequence? While preserving some memory about the characters that have already been matched so far. Now, we will illustrate the algorithm for constructing the finite state automaton with a slightly more difficult example. The pattern is A, B, A, B, C and as I pointed out, we need to keep track of character sequences in P that have already been matched in the text T. So, goal keep track of character sequences, the character sequence and by sequence I actually mean a prefix, character prefix of P that has already been matched so far. So, how do we do this? Well, we want to first of all assume the sigma is exactly what we see in the pattern ABC. We will also assume the set of states Q to correspond to as many positions as there are in this pattern. So, we will have position 0, 1, 2, 3, 4. We will also have a position 5 corresponding to the accept state. Now, before I write down my algorithm, let me specify what I need to keep track of this character prefix of P that has already been matched so far. So, given any input sequence X, I am going to define a function sigma that takes as input X and outputs length of longest prefix of P that is a suffix of X. Now, what is this X? Well, this X is what has been matched so far already in the automaton. X is a sequence of characters already matched in automaton. So, I would like to look at where to hop back into the automaton so that the prefix of P that is already matched as per the suffix of X is registered or memorized. Let us complete this definition. So, this more formally is nothing but the max over all indices K such that the pattern P K which is the first K characters of P, the prefix of P happens to be a suffix of X. So, we are going to denote this as the subset operation, but what we mean here is suffix of. Now, let us try and see what this would mean for our sample pattern a, b, a, b, c. So, I am going to start with my states 0, 1, 2, 3, 4 and 5, 5 being the accept, 0 being the input and I need to now identify where my arrow should be going. So, on state 0 when I see an a, I go to 1, from 1 I go to 2 when I see a b, on 2 when I see another a I should go to 3, from 3 when I see a b I should go to 4, from 4 when I see a c I go to 5. This is routine. So, suppose I have so far seen the string a, next when I see a b I know that I am going to move to state 2. How about c? So, when X is a, c I am going to ask this question what is the length of the longest prefix of P that is a suffix of X. So, when I say suffix of X it must include the last character c and unfortunately that is not possible you have to move back to 0. So, when I see a c I am going to move back to 0. What if I see an a instead of c a, a what is the length of the longest prefix of P that is a suffix of a well it just 1. So, on seeing and another a I am going to loop back. So, please note we are already provisioning for keeping track of character prefix of P that has already been mapped so far. So, let us continue our journey. What if I see a b next? Well I know if I see a b next I am moving on to state number 2 and there since there are no other options I will consider a b. Now suppose I add an a to this string a b is what I have when I have reached state number 2 and if I add another a I am on state number 3. If I see a b next what do I do? Well there is no transition possible what is a length of the longest prefix of P that is a suffix of X well it turns out that it is 0. So, when I see a b in state number 2 I have no option but to go back to 0. This is also intuitive for the string a b b you basically have to start any subsequent matches from scratch from the start of p. Now what if I see character c in X what is that is the next character I see a b c well the length of the longest prefix of P that is a suffix of X is still 0. So, I am going to mark this arrow for 2 transitions b and c from 2. Now the only other option now is a and we know that if a is the next character I have to move on to state number 3 and at state number 3 again I have the option of seeing a b which takes me to state number 4. If I see another a what do I do? So, note I am at state number 3 and I have seen an a what is the length of the longest prefix of P that is a suffix of X and it just turns out to be 1 which means I will move from state number 3 to state 1 when I see an a. So, here again we are registering a character that is already been seen this could be handy what if my next character was a b well we already know we go to 4 and now what if my next character was a c what is the length of the longest prefix of P that is a suffix of X unfortunately it is 0. So, this will mean that at state number 3 when I see a c I go back to 0 you move on to state number 4 to complete the story now what is the next possible character I see a c will get me an accept and a the length of the longest prefix of P that is a suffix of this new string is basically 1. So, again this will take me back to state number 1 a b will take me back to state number 0 and a c will take me to the accept state. So, this is the story of this longest prefix P that is a suffix of X now what does this translate into some really what we are saying here is that the delta the transition from a particular state Q on seeing any particular character alpha will be sigma with the prefix on the pattern P that led me to Q that is P Q but appended with alpha this is exactly what we see here. So, when you are at state 3 the Q is 3 the P Q is a b a now corresponding to different values of alpha have different transitions so an alpha of a takes you to sigma a b a a for which the corresponding values of sigma value is 1. So, you can see that delta Q a equals sigma P Q alpha should give you the destination state for almost for every Q you can continue this process and condense yourself that indeed this definitions of Ices for us now how do you translate this approach observation into an algorithm the intuition behind the algorithm is as follows. So, what is the desired invariant of this algorithm when I have seen the first I characters starting from position I of T I would like to be in a state which corresponds to the length of the longest prefix of P that is a suffix of the pattern P Q which has taken me to that particular state this should be the state after I see I this is the basic idea after seeing the first I positions and here is the algorithm that will help you maintain this invariant. So, let us call this algorithm construct finite automaton for given pattern P with sigma we will assume that the pattern P has length m and we know that our states will be ranging from 0 to m. So, we are going to start from state 0 to 0 and go until m what we did in our informal yet illustrative process here was we iterated over each character alpha in sigma the next thing we did was keeping track of suffixes of x and trying to find the longest prefix of P. So, this will mean that I will keep scanning backwards from wherever I am and what it means to be wherever I am is basically P Q with the next character. So, let us make that explicit here scan backwards from the first the string P Q suffix with an alpha to find state to go to and what is the state it is basically sigma of P Q alpha, but how do you get this sigma let us fill that up. So, we are going to keep track of the position the scan backwards using variable k. So, k is going to be set to be the minimum of either the last position m and Q plus 1 this is because we are trying to look ahead ahead of the current state and see what this alpha should lead me to. So, as we can see here when Q equaled 3 and then alpha equaled B our desired destination was 4 and therefore, Q plus 1 is important. So, k is min of m and Q plus 1 of course, if Q plus 1 itself as over flown we can only do with m. So, let us see what to do with k. So, we are going to say while P k the prefix ending at k is not a substring or does not match the prefix of m. So, we are going to do k equals k minus 1 and once we find that P Q is indeed a substring of P Q alpha and this will of course, be the case when k become 0 we then going to set delta of Q comma alpha to be k. Of course, we need to complete the for loops this has helped us complete the delta for every state and for every alpha and basically delta Q comma alpha points to the length of the longest prefix of P that is a suffix of P Q alpha. So, what we will do now is having constructed a finite state automaton for A, B, C we will try and match it to the text T and we will actually see how we are able to leverage the memory that the finite state automaton gives us that state transitions have been also tabulated below here. So, this is nothing but our delta which was obtained by virtue of our algorithm and now we can see the algorithm in action. So, starting at position 1 when you see input A the delta for 0 tells you to go to 1 that is what we have done when you see a B you go to 2. However, when you see an A you go back to 1. So, A B the next A takes you back to 1 the next B takes you back to 2 and then you do another A for which you go back to 1, a B which takes you back to 2 and now finally you are at state number 3 when you see a C. So, the match was just A, B, C and we avoided going back to 0 all the way when we saw these A on 2 occasions. So, we did save at least 2 computations or matches which brute force algorithm would have incurred. So, here is a simple finite state automaton algorithm specifically for matching an input text t. So, given the delta the m and the t if s is a shift index to t you start with the start state q equals 0 and for i equals 1 to n as you scan position i of the text t you make a transition from q to delta of q comma t i if you have reached the final state then you accept and you also print out the position of match of course, you can continue this whole process. What we did not do in our construction was complete the transitions from the accept state but please note that transitions will also exist at the accept state. As far as complexity is concerned this finite state automaton algorithm was simply a linear scan just scanning every position in t once. So, as you note this is being a simple linear scan and therefore, your order of complexity is just order n. However, there is a hidden cost and the hidden cost is computing the delta for the pattern p. The delta for p is offline and therefore, does not have to be computed for every specified input text. So, this is a one time cost and it makes sense to incur significantly larger one time cost if you want to be effective for a large number of input texts. Let us however, note what the complexity of computing delta is. So, note that in this algorithm for constructing finite state automaton, we had one scan over the pattern p doubt a loop. Now, for each element of the alphabet sigma I am going to incur the cost of running the loop on m. So, basically you need to multiply m with the size of sigma is that all well I will need to go keep going back on the k and the worst case I might have to go back all the way to the left. So, the third aspect is this while which will again incur and a factor of m is multiplied and finally, even to check whether p k is prefix that matches the suffix of p q alpha I might incur in the worst case and order m operation. So, again multiplying this with m the overall complexity is m cube sigma. Thank you.