 Welcome everyone to the first Track A session this year, 2019. We'll start with a talk by Rihal. We'll be talking about a proper online pattern matching in some linear time. Thank you. I would like to tell you something about our results on pattern matching, and I will try to explain you what is it about. And this is actually joint work with DeepTarca Chakraborty and the Bharati Das. So Rihal was a post-doc in Prague. Now he is at National University in Singapore. De Bharati was a PhD student in Prague. Now she is at the Denmark Central Park. But none of them could actually come. So they are like, they would be okay, but they couldn't come. So I have the opportunity to be here. So I would like to tell you about pattern matching. So what is the problem actually? So the pattern matching problem is actually a problem which people consider for a long time. So you are given some pattern, which is a string. That's a string P. And then you are given some text T. And you are supposed to find all occurrences of this pattern P in the string T. So like over here, there is this occurrence and that occurrence. That's what people call pattern matching. And this is a problem which was studied already like in the 50s, 60s, 70s. And we know like good outcomes for this problem, which are pretty much in linear time in the length of P and T. So this problem is actually very useful and nowadays it's actually even more useful than before because there is a lot of data you have to actually search in. For example, there are genomic data where you have actually given protein P and you have like a big genome and you are trying to actually find where it actually appears in the genome. But in these scenarios actually, typically you don't actually want just exact match because there may be some small mutations in the protein so you want to actually find all occurrences of proteins so you should actually consider like generalized problem where you actually not only look for like exact matches in the text but you want to actually find also anything which looks kind of similar. So how do you actually define similar? So for us, like in this work, what we actually look at is so-called edit distance. So what's edit distance? So edit distance, you've got two strings and you want to compare them. How similar are they? And this actually, this measure which actually tells you how close they are to each other. So you've got two strings P and Q and you are asking how many operations you have to make to the string P to actually get to the string Q. And what of operations I allow you? I allow you to actually delete some symbols. So for example, over here I am deleting Z from the string. Over here I can also allow you to change some symbols so I'm actually changing W to Q to G and I can also insert symbols. So I can actually insert symbols into the string and that actually makes it longer. So these are the operations which we actually allow and these are typical operations you can actually see because that's actually how mutations actually work. So that's how you would like to measure similarity of string and you would like to actually find occurrences, all occurrences of P which are kind of similar. So this is what people call approximate pattern matching. So in the approximate pattern matching scenario you have actually the pattern P and the text T and you want to find everything which looks similar to P. But of course that could be actually a quadratic number of strings in the length of T so you don't actually really want to do that exactly this. So rather you want to actually simplify your job a little bit and you want to actually only find kind of a position where there is a good match. So how do you do that? So you can define a parameter KT, so for each position 1 up to N, so T runs from 1 to N, and actually look what is the best match which actually ends here. So for the position T, the KT is going to tell me what's the minimal edit distance between P and any string which actually ends here. So I actually minimize over various lengths I, so I take strings from I up to T and I actually check what is the minimum edit distance. And I would like to actually output all positions which actually have this value small. Or more generally actually I would like to actually find, for each position T I would like to actually find this value KT. So I would like to actually output K1 up to KN. So the length of the strings for me actually typically will be, the pattern will be of length W and the text will be of length N. So these are the parameters I am actually I want to look at. So let me tell you actually what are our results for this problem. So our results for this problem are actually algorithms, two algorithms. So first of all, it's an algorithm which actually solves this problem for finding K1 up to KN and it solves it in time N to the W to the three quarters. So we actually call this sublinear because if you look in an amortized setting, if you just look actually on the dependence on the length of the pattern, if you normalize by the length of the text, then it's actually sublinear in the length of the pattern. So that's why we actually call it sublinear. And this is actually the first algorithm of this type which actually has like sublinear running time. But there is actually a small cache so we actually find exact values of K1 up to KN. We actually find some approximation for these values. So we find actually constant factor approximation. For some constant, which I am not going to specify, we can actually find, but it's not actually like enormous. So we can actually find some values which are, which actually approximated this true sequence K1 up to KN within a factor of 10. So that's what we can actually do. And so that's actually in the offline setting. In the offline setting, what we have is actually, we do have the two strings and they are actually given to us and we can actually work on them as we wish. And then there is actually an online setting. In the online setting, the problem is actually kind of the same. You want to calculate the sequence K1 up to KN or some good enough approximation. But actually the setup is that actually we are given first P and then T arrives symbols by symbol. So whenever actually next symbol of T arrives, we are supposed to output what K prime T is. So that's the game for the online setting. And over here, we can actually achieve an approximation which is not quite constant. It's actually the output sequence, which is like constant factor times KT plus some additive term W to the eight over nine. So I've actually explained why we get this additive factor, but what that means is that if actually the edit distance is large, it's bigger than this W to the eight nine, which is the length of the pattern, then we do actually have a constant factor approximation. If the actual value of KT is actually smaller than this value, then we kind of, we don't get a good approximation. And in the online setting, what we have is actually the, on average, we actually spend W to the one minus over eight time to process a symbol. And we need, what we want to minimize actually in this case is also the memory footprint, so not only the time, but also the memory. So what we have is in the so-called in the model where the P is stored and we need, we just count the additional memory in addition to actually storing P. So we store P plus W to the one minus one over 54 it's of memory, roughly. So that's our two main results. So how does this actually compare to what was known before? So people actually looked on approximate pattern matching for a long time, and the first algorithm which actually was giving interesting bounds was algorithms by sellers from 1980, and that actually achieved the complexity N times W. So that's kind of the typical complexity that is actually product of the length of the pattern times the length of the text. So you get bound N times W. And there was a small improvement by Masek and Patterson who actually is shaped of like log N factor from this running time. And this is basically, this is it. So there is nothing substantially faster if you want a generic algorithm which would work actually for all basically like edit distance thresholds. There are actually some specific algorithms if you actually limit by a threshold K how far actually you care about the edit distance. So if the edit distance is going to be bigger than the threshold K, then you can just output infinity but if it's smaller then you should actually output value. So there is actually a sequence of such algorithms which actually perform well with respect to the time. So there is, and all of them typically have running time like N times K. So if you limit the edit distance to be at most K which is the maximum, the possible edit distance, well the maximum edit distance would be W so the parameter K is between 1 and W and this has actually a running time N times K is sufficiently small so it's squared of W then you get actually something much better than what sellers get. There is yet another algorithm which actually has running time close to linear if K is really small so if K is like a force root of N then it has actually linear running time. So this is actually, this is like just a small fraction of what's known but these are the main things. So what is important to compare it is to basically these sellers, our Mashiq Patterson which actually runs in this N times W and our result actually runs in N times W to the, W to the 3 over 4. So that's the distinction. But our algorithms is approximate as they're actually, they give you the exact value of K, KT for each T. So that's another distinction between our algorithm so it's not completely comparable. So this is for the offline setting where you have full access to the strings. In the online setting there is actually quite a few works also in that regime and they actually look on various measures of the similarity between strings. So not only for added distance but there are some actually distances to calculate like hammy distance or L2 norms so these norms are actually easier to compute because you just align the strings and you just go over them. So there is actually generic reduction which kind of works in black box way by Clifford at all and they actually have a general reduction which takes offline algorithm and then turns it into online algorithm. But it doesn't actually work for added distance because the added distance kind of can insert and delete symbols which is actually problem for these algorithms. However, there are some results on this and there is a result by Clifford and Sach actually like two results, one of them actually first got the run time K times log N and then they shift the shade log N factor and it uses space which is proportion to the P so they have to preprocess to the pattern P so they have to preprocess the P and the memory and then they have this running time. And there is another result which is by Starikovskaya and she actually designed an algorithm which again for a small value of K it actually works well. So if it actually has running time per symbol like K squared times square of W equals K to the 17 so for like K to the 1 over 20 it's actually going to work in really good time and it has space K to the 8 times W square root of W. So again like for small enough K this is actually sublinear in the size of the pattern. So this is actually what how we can compare it to our solution so our solution actually gives you running time W to the 3 quarters and it gives you space W to the 1 minus 1 over 54 plus the length of the P which we have to store. So this is what's known about pattern matching so pattern matching is actually the size of the alphabet. I don't think it actually matters for these results. I don't think it matters for any of these results basically. It could actually sometimes it does but not in these. So the pattern matching actually problem of this approximate pattern matching problem is actually generalization of a problem which is just a computing at a distance. So you have two strings of length n is P and Q and they were a compute at a distance and essentially if you instantiate the pattern to be one of the strings and the text to be the other string then it's exactly basically we are talking about computing at a distance and then what you want to then actually the results which we have seen kind of actually a problem about computing at a distance. So at a distance can be computed in time n squared over log squared roughly and again there is a sequence of algorithms and you know if you limit how far you care about the at a distance then you have an algorithm in time n times k and there is an algorithm in time n plus k squared. And so this is actually known to be pretty much closest to the best possible because there is a sequence of recent results which actually show that you cannot really improve this quadratic time unless something unlikely happens like we prove circuit labor bound which perhaps they are true but we don't know how to prove them so it's kind of unlikely to do it right now or that we would improve actually algorithms for satisfability so we basically falsify the strong exponential time hypothesis. So under those assumptions this n squared time is pretty much optimal and we cannot do any better. So people actually look also for approximation and again there is like a huge body of work on approximation and there are two recent results which kind of like are main for our work so that's actually constant factor approximation algorithms and there were like two constant factor approximations for computing at a distance last year. So first of all there is a quantum algorithm by Borujoni et al which actually runs in sub quadratic time so it runs in time n to the 1.708 and then there was actually a result by actually us together with Elazar Goldberg and Mike Sachs and we get actually classical algorithm which does constant factor approximation and runs in time n to the 1.647 and this algorithm is actually starting point for our algorithm so that's why I'm actually mentioning these because this is actually these two problems computing at a distance and computing approximate pattern matching are actually closely related so that's why it's actually relevant. So let me just give you some hint of the ideas which we actually use to solve this problem so the main tool how people actually look on this is so-called at a distance graph or pattern matching graph which I will introduce in a second and so what's at a distance graph? So at a distance graph is the flowing graph so you have pattern P which is over here you have text T which is over here and you make this grid graph where there are diagonal edges and there are like horizontal and vertical edges and they have assigned some cost and what kind of cost you assign to them well you assign cost 1 to horizontal edges to horizontal edges cost 1 to vertical edges and the diagonal edges actually have cost which corresponds to whether the symbols at given position match or not so if I am looking at a position I here and position J here then the cost of this edge is going to be whether the two symbols are the same or not so if they are the same then you pay 0 if they are different then you will pay 1 and you can actually easily kind of like if you think about it what is measures so the horizontal the horizontal edges actually corresponds to removing symbols from T the vertical edges actually corresponds to actually removing symbols from P and the diagonal edges just means that you are matching symbol or you are mismatching if you actually pay cost 1 so what you can show is that the edit distance the edit distance between things P and T actually corresponds exactly to the shortest path from the bottom left corner so from 0.00 to 0.NW that's exactly the cost cost of transforming P into T using edit operations so if you actually calculate this calculate the shortest path from that point to this then you actually know what's the edit distance and that can be done in time which is proportional to the size of this graph so in times NW because this is actually a very simple graph and that's actually what the standard algorithms do they actually just calculate the shortest path in this graph so if you talk about approximate pattern matching graph so then the sellers actually had the following the sellers had actually the following idea so he actually took the edit distance graph and he actually augmented it by edges which actually start from this point and actually have cost 0 and go to all points at the bottom of this graph so you actually can jump for the price of 0.00 to any of the bottom bottom vertices and it turns out that now actually if you calculate the shortest path from this point to any point at the top you exactly get actually the value of KT so you can actually think about this but so actually just if you just take this graph and calculate actually what's the cost of getting to these points that actually tells you all the values KT so that's how you actually can calculate in this time NW you can actually solve this approximate pattern matching problem exactly so if you just care about approximation so how can we actually solve it so what's the main technique which we use so we actually approximate this pattern matching graph by some kind of sub graph with edit shortcuts so what we do is that we actually take blocks of these graphs and we actually calculate exactly what's the cost of actually going from here to here and we remove all diagonal edges and just put these shortcuts so for various boxes we basically cover this graph by boxes for which we know the exact cost and we remove everything else and if you do this then that's actually going to kind of give you rough approximation of the original graph so the question then is how do you actually calculate for like these shortcut edges how do you calculate cost of these boxes for a large enough fraction of the boxes so you want to actually put there as many boxes as you can for cheap and that's where actually the technique of this edit distance paper comes in the technique by which we used before for computing edit distance and that actually is based on the following idea so let me just briefly describe it so here is actually how we proceed to find these boxes so we take some epsilon so we want to actually say we break the t into blocks of size w to the one half you can actually choose different parameters but let's say we do it like this so we break the text t into these blocks and now what we want actually we want to find all matches for each of these blocks in p which are at distance up to epsilon so if you would find this for each epsilon to be minus i and we would put there these short cut edges then that would give a good approximation for the overall graph so for each of these intervals you want to actually find these boxes which represents that this block actually matches this substring in p so that's what we want to find and we want to find if they actually are at distance epsilon so how can you find them so if you just do it naively this is exactly pattern matching problems, you've got this block that's your pattern now in this original pattern you are actually looking for matches so that's exactly pattern matching problem and we know how much that costs that costs actually the length of the pattern which is in this case w to the one half times w so that costs w to the three half and since the number of blocks you have to deal with this n over w to the one half that's w so if you do this actually the way I described then sure you can actually calculate it for all epsilon you just have logarithmically many choices but it's going to cost you basically again the same thing as before so you don't actually get anything from this so the algorithm will be slower and it will be just approximate algorithm so you want to actually you want to do it faster so how can you do it faster well here is an observation so if this block actually matches that block well that means that if you take some sub-blocks of this block they are actually going to match some corresponding sub-blocks here also well so in particular actually if I now take sub-block of the block I care about so this is the block I care about so I take some random sub-block of some length of w to the one quarter so again there is parameter of the length of this block and I just pick a random sub-block of this length and I will actually find all the matches to that sub-block in P so for what kind of matches I will be looking for I will be looking for matches which are up to at a distance like say 2 times epsilon if you go for at a distance 2 times epsilon then just by Markov inequality you know that like on average at least half of these should actually half of the blocks, the big blocks you are looking for should actually have in that corresponding sub-block I prime you should actually see at a distance at most 2 epsilon so if you find all these blocks then what you can do is you can just like expand them and check whether they really are the good matches or not so that's actually the idea so the idea is we will actually sample small block we will search for close matches up to say distance 2 epsilon and we will expand all the close matches now of course the problem could be that there will be too many of these small matches so we will set some threshold and if there will be too many of them then we actually will do something else but if there are just few up to some threshold d which we will again set like to w to the one quarter then if the small matches are like few at most d then we do actually expand all of them and we are happy we found all the blocks and you can actually prove that that actually works what will be the running time well so now what we did is actually we saved instead of actually having to deal with this whole actually graph over here we actually are dealing with just one narrow strip and if you calculate the cost what we are saving we are basically saving the ratio between the length of this big block versus the length of the short block so that's the ratio is like w to the one quarter so if you just do the calculation how much it costs you it's going to cost you this n to the n times w to the three quarter so if you calculate the total cost actually of doing this for every big block now how much is going to cost you these expansions well since we put hold d there on the number of these expansions we are going to make then that's also going to actually evaluate to the same number so we got at most d expansions each expansions cost you like w to the one half square which is the cost of the usual added distance algorithm and the number of blocks is n over w to the one half so again in total you end up with the same running time n to the w to the c quarter so this is actually if the number of blocks actually the number of candidate candidate short matches is actually small at most d well if it's bigger then what you can do is so this is a situation when it's bigger then you do something more complicated you don't actually expand because you cannot afford to do that but if that happens then what happens is that you actually learn a lot of information so you find a lot of matching actually substrings in here in p which look similar to i prime so what you do is actually you look also for matching blocks in t for this i prime and then you can actually fill in this grid of boxes because everything here in this picture everything which is blue actually corresponds to good matches between between the strings in p and t so you learn a lot of information actually cheaply and again if you calculate cost then you end up with the same same runtime the total runtime in these situations will be like n times w to this equal so if you actually leverage these two scenarios as done in this at a distance paper before then you do actually get get a covering of the of the whole matrix by by good good enough boxes which actually approximate the pattern matching problem so so this would be actually the offline algorithm so the offline algorithm actually follows this paradigm so we have to we have to actually you have to work it out and you have to actually do this thing but it actually falls this paradigm for the online version what we do is we we kind of do it actually like left to right so we don't do it actually all at once you have to actually do it all like in a in a in a sequence from left to right so we actually look first for for actually we process first this blog then the next one etc we just have to store some small information and at the same time we have to actually also how be putting the piece in these blocks together so so there is a second stage where you have to actually take these this approximate pattern matching graph which is covered by these short cut edges and you have to actually calculate the solution from that so we have to kind of interleave these two phases together together the overall online algorithm so this is this is for our results so let me just actually say what happened afterwards so so there is actually now improvement on our algorithms so there is I'm the only actually improve this pattern matching algorithm so he actually gets like n times W to the one half plus epsilon so for an epsilon you can actually get such a running time so that's actually that's also can be used for like constant factor approximation so you so you get this algorithm which runs in time and to the three half plus epsilon for calculating at a distance and then then there is another like another kind of improvement but it's a it's again slightly actually like orthogonal so there is a result by Brackensier Rubinstein and by me and Mike Sacks so we get actually edit distance algorithm which runs in time n to the one plus epsilon for an epsilon so it's almost linear but it actually only works when the edit distance is large so where the edit distance is say n to the one minus gamma so it doesn't work for a smaller range neither of these results actually and we don't know actually how to do it for the whole range so this is actually the same issue that we have is the online algorithm that the online algorithm has this additive error so these algorithms also have this additive error and that's why they actually work only in this high at a distance range so let me just conclude by open questions so I think the kind of like most tantalizing open question is can we do actually approximate pattern matching in time n plus w so in linear time or quasi-linear so I do believe that there might be such an algorithm I have no idea how to make it but I'm confident that there must be or slightly weaker so can you actually at least do it in times n times w to the epsilon for any epsilon you choose for any epsilon thank you