 So welcome everyone, it's a pleasure to have Mihal Kutski as our speaker today. Before I introduce Mihal, let me first go around the table and say hi to everyone who's on time today. So we have a group led by Irfan from Indiana University. So hi everyone. Hi, then we have Sam from UCSD. Hi there. Then we have a group from Caltech that, oh, Benny is there. Hi, let's just upstairs from me. Oh, then we have a nice group from, a nice big group from Toronto. So welcome guys. And finally, there's a group led by Yi Jun from University of Michigan. So welcome. Hi, Grant. Okay. So, okay. Before we get started, let me acknowledge the other organizers that are working behind the scenes to bring you these talks. So that's Kimon Kanon and Ninja Day, Galaam Kamat, Ilya Rajenstein and Oded Riketh. And let me also say that the next talk two weeks from now will be a talk by Umla Maharev on classical verification of quantum computation. So today we're very happy to have Mihal Kutski from Charles University in Prague to give the talk. So Mihal got his PhD from Rutgers in 2003. He was advised by Eric Allender. And then after that, he spent a little time as a postdoc at McGill and then the CWI and then he joined Charles University. So he's done a lot of work in combinatorics in general. So lower bounds in many levels of computation, state of structures, algorithms, lower bounds for streaming communication complexity. But today he's going to tell us about an upper bound. It's a very nice algorithmic result that approximation algorithm for added distance. They got the best paper award at Fox, just a multiple. So congratulations and welcome and we look forward to the talk. Thank you for invitation to give this talk and thanks for showing up before the stalked line. So let me tell you actually something about our recent result, approximating added distance within constant factor into this subcubic time. And this is actually joint work with Dipter Kacakraborty, who was a postdoc here. Now he's at Weitzman with Deborah Tadas. She's my student. Elazar Goldenberg. So he is in Tel Aviv, Jaffo and Mike Sacks from Radgers. So maybe just before you start to break the suspense, you said subcubic time. I think you do mean sub quadratic time. So it's up quadratic time. Okay. So let me actually tell you first, what do we mean by added distance and why actually we are looking at the distance? So added distance is measure of similarity of strings. So we have two strings at x and y and you want to somehow compare them to find out how similar are they. So you probably know like a lot of different measures and I will mention some of them. But actually the one which we are interested in today is actually kind of like vanilla at a distance. And that's just the finance flow. So you take a string x, you take a string x and you are asking how many operations, how many elementary operations you have to do to this string to turn it into a string y. And what are the elementary operations we actually allow is you can actually remove letters. So for example, I remove this letter Z to get them to go into y. You can flip symbols. So I flip here W into G and I can also insert symbols. So I actually inserted over here between I and K, I inserted J. So these are the three basic operations we actually look at. And what we want to know if I give you x and y, we want to know how many, what is the added distance, how many operations did you actually do to change x into y. And this actually defines a nice distance measure. So it satisfies various properties like symmetry, triangle, inequality, etc. So we will actually go back to that. So, but this is actually not the only measure which actually people study and there are actually various other variants of added distance. So actually this one-sided distance which I just defined is actually sometimes called also Evanstein distance. And it's actually closely related to longest common subsequence. So longest common subsequence of two strings, it's kind of a dual measure to this. So you can just, it's just simple exercise. People also consider like Olam distance. So Olam distance, that's a measure where it's basically the added distance, but you assume that the alphabet is really large and actually each symbol appears just once. So if you put this kind of restriction, then it's actually easier to work with algorithmically because it's easier to do stuff. And then there are also like added distance with moves. So you actually have additional operation where you can actually take a whole block and you can actually move it somewhere else. So that's added distance with moves and people actually look at this. And of course there is hamming distance. So the hamming distance, that's kind of like the simplest kind of added distance measure because you just measure how many width flips you have to do to go from one string to the other. So you don't allow insertions or deletions. So that's actually very easy to work with. So these are actually various variants of added distance people actually look at and consider. And what we will focus on today is just the Vanilla added distance or the Levenstein distance. So why do people actually look at these measures actually at all is because they're actually useful in various contexts. So they are useful in bioinformatics. They are useful in pattern recognition, text processing, information retrieval. So I guess the biggest somehow place where they are used is bioinformatics because people actually try to align sequences of DNA and compare how similar like one DNA is to another one or actually check proteins. So that's actually why people actually want to be able to compute this quickly. And the inputs on which you have to compute these measures are actually fairly large. So that's why you want really fast algorithms. So let me tell you actually what is our result. And I will tell you about more. That's actually a constant factor approximation for the added distance. And we have an algorithm which actually runs in time and squared minus basically n to the two minus two over seven. So that's basically like 12 over seven, which is something like 1.714. And we have actually some some of the also faster algorithms which you can kind of like get by recursion. But this is the the easiest somehow the the base kind of algorithm which we get. So this this is actually slightly subcubic by like more than a quarter. And we get the constant factor approximation. So by and by constant factor approximation, I mean actually multiplicative constant factor. So what is actually known about computing at a distance? So actually people look on at a distance for quite some time. And there are actually various, actually various algorithms which were invented for that. And actually some of them appeared actually multiple times in various actually areas. So like for example, in bioinformatics, they also invented some of these algorithms. But what I'm actually going to tell you about is basically what's kind of most influential in their computer science. So so we are I'm not actually looking on the on basically what happened in the bioinformatics. So so the first actually algorithm which was which was basically proposed by Wagner and Fisher in 74 is actually quadratic algorithm. And it's actually algorithm which many of you probably know and it's actually dynamic programming type algorithm. So it's kind of like a typical example for dynamic programming. So you take the two strings and you kind of look on how how can you break it into two pieces. And then you solve solve this at a distance problem on each piece separately and you take the best basically break best cut into two pieces. So if you actually implement this properly, then you get basically quadratic algorithm. So this algorithm was slightly sped up by my second Patterson in 80s. And basically they showed shape of shaped of basically one log log n factor from the running time. So so we have n squared over log n running time. And very recently there was a paper which actually shapes like two log factors. So log square n. So it's running times like n squared log log n over log square. So that's pretty much one of the best we know of today. So so these are actually algorithms which are kind of like these are worst case complexities. And the people actually looked on also on these algorithms from the perspective of actually what's the running time depending on the on the actual edit distance. So if if the edit distance is k, how fast actually how quickly I can actually solve this problem. So you can in 85 actually observe that you can fairly easily actually get an algorithm which runs in time order k n squared. Sorry order k n. So so if actually the edit distance is small, then this actually runs runs fast. But if edit distance k is close to to the worst, which which is basically linear, then this is still quadratic. And this was actually improved in the sequence of works by Marius, Landau, Wischke and others to to basically algorithm which runs in time n plus k squared. So the dependence on on the length of the strings. So n is always the length of the strings is basically linear. Plus there is quadratic dependence on on the actual edit distance. So if the edit distance is small, say like square root of n, then this actually runs in linear time. So this is actually really, really nice. However, if the edit distance is again large, so close to n, then this is basically quadratic algorithm. So so this is somehow how the what's the state of the art actually up until now. And yeah, so there are actually many others. So so I just wanted to mention some some some others. So so people actually look on this also in the various other setups actually on computing edit distance. So for example, we looked on it also in synchronous streaming model, and we have some algorithm. And then the Bellagio we and Jean actually got n plus k to the eight algorithm in asynchronous streaming so they can even like get the sketching. So so there's actually a lot of work in in various other contexts, not only computing if x and y are actually present in one location, but also like streaming communication complexity and etc. So so there's actually a lot of work on edit distance. But from perspective of algorithms, the best algorithms we have is actually at least the kind of quadratic except for these lock lock factors. So and people I really try it actually hard to improve it. But it's actually turns out that like three years back, actually back course and index in their stock paper, they actually show that you cannot actually do much better. Unless you would actually improve satisfactor algorithms. So what they show is that if you have an algorithm for edit distance, which runs in time, which is actually sub quadratic. So if you if it runs in time and to the two minus epsilon, then you actually get an algorithm for sat which runs in time two to the one minus delta. And this is actually much faster than than we know off. So basically for general SAT, our algorithms done pretty much running time close to the two to the n. And if you would get such an algorithm, you would actually contradict so called stronger exponential time hypothesis, which many people believe. So this is strong indication that actually the algorithms, which we have seen on the previous site, the quadratic ones are basically the best possible. And we don't know how to do any faster. And well, there is there was actually after this breakthrough result of a course and in the there was actually a sequence of works, which actually get even like, you know, that basically show that even if you would just slightly actually improve on this running time and square, if you just shave some lock factors, then you would actually get some circuit lover bonds. So so actually would get circle lover bonds, we believe are true, but we don't know how to prove them. So basically improving this algorithm would imply that you have proven something which we don't know how to prove. So it's going to be our problem. So this is this is basically state of the art for for computing the other distance exactly. And people actually for quite a long time, actually, because they were stuck with these quadratic algorithms, actually asked, what about approximating at a distance? How so how fast can you actually approximate the distance? And there's actually a lot of results in this regard also. So there's this algorithm, which I mentioned, which runs in this n plus k squared algorithm, that can be readily turned into actually approximation algorithm, which is going to give you basically square root of n approximation in linear time, because you just run the algorithm. And if you transfer too long, then the distance is large, bigger than square root of n. Otherwise, it's basically otherwise you basically know the exactly the distance. So that gives you like square root of n approximation in linear time. Well, after that, there was actually there were some improvements to this, basically to this method. And there was n to the C over seven algorithm which runs in basically quasi linear time by by a Barry, Josef, Jram, Carl Gammer and Kumar in 04. And then there was like further improvement to n to the one third. And all these basically have like fairly sizable actually, you know, the approximation factor is really large. So it's like polynomial. And it took basically like up until 2009, where Antoni and Onak actually showed an algorithm which actually runs in time better. So they actually sorry, which gets which gets actually better approximation. So they actually got two to the square root of log n approximation, which runs in time almost linear. So so they basically got also almost linear time. And now they basically got subpoinable factor approximation. And then actually in 2010, under Onak and Carl Gammer, they actually improved this to even like poly logarithmic approximation. So they actually got an algorithm which runs in almost linear time and to the one plus epsilon time. And it gets gets you basically log one over epsilon and approximation. So if you, you know, if you go with this epsilon close to zero, then you kind of get worse and worse approximation. But this is basically very good because up until then there was like a really huge, huge approximation factors. And moreover, actually, what they also showed is that there is like a large, large, actually, large class of techniques is actually not going to work. So basically some, some type of sampling techniques, actually, if you kind of sub sample the strings and then you compare them, that that shouldn't actually work. So they have like very specific, basically a class of algorithms, which cannot actually actually better approximation factor than logarithmic. So this is basically the state of the art, actually, until this year. So in this year, actually, in Soda, Borogena and at all actually presented the quantum algorithm for approximating at a distance. And that algorithm makes an ounce in time and to the 1.7 and et cetera. And it's actually gives a constant factor approximation. So this was actually a major, major achievement because up until then we didn't know how to do that. And just the only, only though back is what that was actually quantum part to the algorithm. So let me actually say that our algorithm actually shares many similarities with actually the algorithm. So we actually just do some things actually different than they do. And the technique overlays actually quite similar. So this is actually what we are kind of like positive results. And on the negative side, just one quick question. So this 1.708, that's, that's like tiny bit smaller than the number that you gave us earlier. Is that, is that okay? Well, actually, so, so we can actually do better than the number I actually gave you. So we can like 1.6. So, so I think we can actually get slightly faster than they do. Classically. So, so I think they stop over here. We actually can go a little bit further. So, so I think we get that. So there is also, there is a tradeoff between actually this approximation factor and, and the exponent over here. So, so there's some iterative procedure to actually improve the running time. And you are actually losing on this, on this approximation. And so this is actually a few related algorithm multiple times. And this, this constant factor is then large. So, so what a boot and back course. So, so this is actually regarding the upper bounds. And for lower bounds, actually, there was this result by basically a boot and back course, which actually showed that if you would get like one plus one over polylog approximation in sub quadratic time, then again, you would actually prove some lower bounds, which we don't know how to prove. So, so this again indicates that like pushing, pushing basically the algorithm, the algorithms farther beyond this constant factor may be actually again hard. So we don't know actually where, where the hardness actually starts and quite stops. So this is actually what was known. And let me then go back actually to our result. And let me actually tell you a little bit about the main ideas in, in our, in our algorithm. So, so I would like to kind of give you like a rough sketch and some high level overview of the algorithm. So, so the big basic thing actually, which we do is that we actually kind of look on it as verification problem. So we actually look on what we could call gap at a distance problem. So in the gap at a distance problem, you are given two strings X and Y and parameter theta. And you just want to distinguish whether actually the added distance is below theta n, or whether it's bigger than theta n times some constant. So it's just, and you have some constant C, which is bigger than one, and you just want to actually distinguish these two cases. So you don't care about the inputs X, Y, which are kind of in this gray area. So you don't, your algorithm doesn't have to be correct on those instances. So once you actually solve a gap at a distance, then it's fairly easy just by kind of like by research to actually find the right threshold. So you can actually run, you can actually, if you, if you have fast algorithm for gap at a distance, then you can actually quickly find out a good approximation for the other distance, just by trying various thresholds theta. So what we have is actually, we have an algorithm which runs in basically for the chance in time, theta d four over seven and times and three 12 over seven. So theta actually is, is an, I will actually basically talk about relative added distance. So I will always normalize it by, I will divide it by n, or if I will be looking on some shorter strings, I will divide it basically by the length of strings. And so our algorithm, basically theta is number below, below one. So there's a fairly nice dependence also on, on the t-parter parameter, similarly to these previous algorithms we have seen, which actually do better if, if the added distance is small. And we get actually three plus epsilon approximation in this time. So what is the main idea actually behind the algorithm? So the main idea behind the algorithm is actually very simple. So, so you get these two strings X and Y, and you want to kind of like verify whether they are actually close or whether they have added distance at most theta n or whether they are far away. So, so here is actually our approach. So imagine that we want to actually verify that they're, they are close. So, so they have actually small added distance below theta n. So if they are, if they have small added distance, we would like to actually prove, we would like to find some, some good actually, good way to transform one into another one. So we would like to somehow find a good match between X and Y, so that whatever is not matched has to basically counts as the added distance. So how, how we can do it is the very simple idea is, is the following. So, so let's just, so I have here my X and I pick here in X, I pick some string, I will pick some interval I of some, of some nice length. I call that length L over here. And you should think of L to be some small problem. So m to the delta. And here's what I'm going to do is I'm going to actually find a match, good match for I in string Y. So let's say, say I actually matches over here in Y. Well, if that's, that's the case, then it's actually very likely that the neighborhood of I actually matches to the neighborhood of J. So I can actually expand, expand the neighbor, the interval I into bigger one. And I can just calculate, calculate the distance between this, this whole large interval, which actually contains I and the large interval, which contains this good match chain. And it turns out that this is actually cheaper to do than actually just processing, just looking for the whole match of the large interval actually in Y. That would be actually much more costly than actually, if you just actually take first this small kind of anchor I, and then you actually expand it to the surrounding area. And you can actually repeat this process. So I actually, I picked here some I at random, expanded a little bit, and you can actually repeat. So you pick another interval, you find a good match, you expand, you find a good match, you expand. And once you basically cover the whole X, you can actually look on what actually the match down here is, and you can actually try to extract a good, good match. So how do you do that? Well, you know, you can actually discard these, these pieces because they don't actually contribute to the match. So over here, there is some overlap. So you also have to edit to the edit distance over here. Again, you are missing some pieces. So you just add all this to edit distance. And basically, you can bound the edit distance after some of the edit distances of the small pieces plus this extra error time. Now, the reason why this actually works fairly decently is because if you, if you assume that the strings X and Y are actually close to each other, so they are, they have like a distance theta n, then that means if you actually look on some, if you look on basically a random interval I at the top, it's going to basically be matched to interval J in Y. And basically, it will be matched with this edit distance, which is like 2 theta L. So it should be basically proportional to the length of the interval I. And if the original length was, if the original distance was theta n, then, then you expect the edit distance of, of these two intervals I and J to be 2 theta L. And this is just basically by averaging because the edit, edit operations have to occur somewhere. And you, on average, you basically expect them to occur, occur in the, in the right size interval, basically the proportional number of times. So by Markov, you would expect them like this, probably half you actually hit someone who is, who is fairly close to you and you can expand on that. So this actually works really great if you would have say two random, if you would have X would be random and Y would be obtained from X by some edit distance, edit operations. And, but actually if X and Y are kind of like not random, then there is a problem because when you actually, when you actually pick an I, there may be actually multiple matches to, to that I. So you don't know actually which one to expand. And maybe you have to expand all of them. So that's actually what, what we actually, what we are going to do. So let me actually try to explain this more carefully, actually on the next slide, what we actually do. So I'll actually draw the whole picture actually slightly differently. So now what I, what I'm going to do is I'm going to actually draw this in the plane. So at the bottom, I will actually align my string X. So I will just put here my X. And vertically, I will put actually Y. So I'll just put here my string Y. And I'm going to break basically X into chunks of size and to the three over seven. So I picked this and overseas and to the C over seven, because it actually works nicely for us. And what will be my goal, my goal will be actually to find for each of these intervals for each of the chunks actually a good, good match in Y. And how do I do it? Well, we will do it in the way I explained. So say I'm trying to find the match for the second chunk. So I will pick here interval I of length n to the one over seven. And I will find good matches for that string. And if the matches are not that many, so there's some threshold parameter D, which we set to n to the two over seven. So if you didn't get too many, too many matches, then we will actually expand, expand that match to actually bigger to the full size interval. So to this whole chunk. So this is kind of the, the basic idea. And that's basically the algorithm we have seen before. So I've, I pick in small I, I find the matches. And if they are not too many, then I actually I'm going to expand on them. So, so what this picture actually signifies is that actually I, I match interval I in X to interval J1 in Y or J2 in Y. And I actually annotated this box. So this box basically represents that there is a good match. And it's a match of relative cost of post two theta. So, so that's exactly what we were looking for before. So I actually annotate this, this box, we call that certified box. So we annotate it by, by the actual edge distance. And now once we actually find these smaller, smaller matches, we actually expand them. So we expand them to the full interval I. And you kind of do it proportional. So kind of like along the diagonal. So this small, small box should actually be on the diagonal of the big box. So we call that actually diagonal extension, because you kind of extend this, this small box in the diagonal. And that's actually what we calculate at the distance for. So basically, our algorithm actually has two parts. In this case, first you find these small matches, and then you actually expand them into bigger one. So what's the actually cost or how much is it going to cost you? So imagine that like, we would always actually have this situation that there are not that many matches. So at most, the small matches for each, for each big one. So then let's calculate the work we had to do. So first of all, actually, if you are looking for these matches for this small interval I, then basically the amount of work you actually going to do to find this J1 and J2 is actually kind of proportional to the area which is above I. So that's actually why this picture is very, very kind of convenient to look at, because somehow the work you have to do is basically the area of this rectangle above I of this strip. And then calculating at the distance for the large boxes is again actually proportional to the area of these boxes. And so it basically costs you this much. So basically, the red part is the amount of work we have to do to actually process this chunk of length into this 3 over 7. So what's actually the savings in this case? Or what's the cost? So how much do we pay? Well, we have to actually process the strip, but the strip is of length n to the, has like one side, n to the 1 over 7, and it has actually height n. So it costs us like n to the 8 over 7. So that's actually over here. So it costs us n to the 8 over 7. So we pick these parameters actually as well that they would nicely match, because if this n to the n 2 over 7, then we get the cost to actually do these big ones. Each of them costs n to the 3 over 7 squared, because the naive algorithm for a distance takes this much time on these boxes. And we multiply it by these, or we also end up with n to the 8 over 7. So both the operations actually finding the good matches and then expanding actually in this case cost n to the 8 over 7. So what's the saving? Well, the saving is basically n to the 2 over 7. Why? Because if you would just use the naive algorithm and we would take the whole i and match it to the whole thing, the whole height of y, then we are going to pay n to the 3 over 7 times n, which is n to the 10 over 7. So we're kind of saving on this particular chunk we are kind of faster by factor of n to the 2 over 7. And if you do this on each of the chunks, then overall we are actually faster by this factor n over n to the 2 over 7, and that's how you actually get this bound n to the 12 7. So this is in this ideal situation where actually there are not that many cost matches for these small chunks. Now, but that's actually that's that may not happen. So here's the other case. So imagine that we actually go into a chunk and we pick i there. And now there is actually a lot of matches. So we find actually a lot of candidate guys, which could actually perhaps be matched to i. So then that's a problem because we cannot expand all of them. That would be just too costly. So we do actually something else and we do the following. So well, the fact that we actually found a lot of matches for i actually reveals a lot of information about why because that means that that's actually that there are these pieces actually are all the same. Basically, if they are close to i, that means that they're actually close to each other. So you actually land off a lot of information. So how do we actually process this information? So we do the following. So we actually, we want to actually kind of like get as much mileage from this as possible. So how do we actually get that? So what we do is actually, we actually going to look for intervals in x, which are close to i. So say i1, i2, i3 and i4 also look like i. So if these guys actually look like i, then that means that basically these strips actually basically look like the strip of i. So I can immediately basically like going from i, I can actually immediately like say that these strips actually look basically all the same. This falls by triangle inequality because if I say i3 is close to looks like i and j2 looks like i, then basically j2 and i3 are also very much the same. So we do this actually for some closeness parameter epsilon. So before this epsilon was 2 theta. So I told you about this epsilon to be 2 theta. Well, we have to do it actually for various epsilon and I will actually expand that later. But let's look actually in this particular situation what we do. So in this particular situation, so what I'm going to do, once I found some interval i, which got actually a lot of matches more than this threshold m to the 2 over 7, I am actually going to find all these i1, i2, i3, i4, which are near i. But I'm going actually to look for them up to distance to epsilon. So these i3 can be up to distance to epsilon from i. And what I do is actually I revisit this column also, this strip above i, I actually look all matches, all j's, which are up to the distance 3 epsilon. So that doesn't actually, that just can actually increase the number of boxes I see, actually it doesn't decrease. And if I do it like this, then first I have a guarantee that actually each box over here corresponds to a match which has at the distance at most 5 epsilon. That's just by triangle inequality. Because this guy from i is at the distance at most 3 epsilon. And this guy is from i4 actually at the distance at most 2 epsilon. So together they are at most at a distance 5 epsilon. And but moreover, it also means if I have chosen the parameters like this, it also means that for each i1, I have actually found these boxes, which I mark as nearby, actually I definitely contain everyone who is at the distance at most epsilon from i1. So because of the choice of these parameters. So I actually, this i1 actually covers in this strip, I actually mark everything which actually I should have marked. So I can actually do this. So this somehow shows that basically for very cheaply, I basically had to just process this strip. And I had to process this kind of this strip. I actually basically get kind of quadratic amount of information. So that gives you that gives you a speedup. And again, what is going to be the speedup. So let's calculate the cost. So for processing this strip, I paid basically n times the width of the strip. So n times size of i. And how many times can this repeat? Well, one can actually show that this cannot be repeated many times. So this can be repeated actually at most n over i times d. So n over i is the number of small strips in x. And the factor d, you get this factor d. So you actually, if you wouldn't have this factor d, then it will get like the travel bound. But if you have this factor d, then what that means is that, well, whenever we actually have this dense case, we basically basically cross out some j's, which cannot be used again. So each i is basically associated with some intervals which are nearby, at least the intervals which are nearby to i. And they will never actually appear in this situation again. So you can touch us by the choice of parameters two epsilon and three epsilon. So basically each time you make a progress, and somehow you basically say factor of d. And if you put these numbers together again, you basically get that in total actually over all possible repetition of the situation, you basically pay n to the 12 over 7. So that's again basically the bound on the running time. So what you are going to do is actually, so how do you actually piece this together? So let me actually explain that. So here's actually what you do. So first actually you are going to basically treat the dense columns. So how do you do that? Well, you basically go left to right in this matrix and you test basically the first strip just by sampling. Is it dense? And if it's dense, then well, you do what I explained. So you basically like expand the information in the matrix. And then you take the next guy. So you check is it dense. And if it's dense, then then again, you expand and then test all the other guys. And if any of them is dense, then you would expand, but say there are no more dense guys. So let me just say that actually testing by sampling actually turns out that if you just test by sampling them again, like the time you need to actually test each individual like narrow interval, the small guys, actually it turns out to be again, like exactly the bound we were looking for. So there is like trade-off between various parameters and various constraints. And this is one of the constraints. So we can actually afford to process every single strip test, whether it is dense or not. So then once we actually process these dense guys, then we basically got this system of boxes labeled by their edit distance. So for example, this interval matches to that interval and it has some specific cost. So that's basically the output of the first stage. And then in the second stage, we are going to basically process the big chunks. And we process them by basically in each chunk, if the chunk actually contains some sparse intervals, then we are going to sample them at random. So we sample say one at random in each chunk, in each big chunk, like this, and we expand diagonally the way I explained in the sparse case. So then these two, the first three big chunks are covered and let's look on the last one. So in the last one, there are some sparse columns. So we pick one of them at random, we get this match, and we expand into this bigger chunk. And this is the information basically we collect. And then we will actually process it in like next round. So we actually do this actually for various closeness parameters. So we have here like closeness parameter epsilon, and we actually range this epsilon between a theta and one. So we actually kind of collect a lot of information about boxes and we will label them by various at a distance actually. And then that's actually what we feed in the second stage of the algorithm. So we look only, since we are looking for approximation, we can actually afford to say look only on the powers of two. So we can choose say epsilon to be just 2 to the minus i. That's kind of what we are doing. So somehow to review actually, so the first stage of the algorithm basically serves to collect the information about some matches in the matrix is the hope that that's going to actually, going to basically allow us to approximate a good match for x and y. So let's see how can we do that. So let's say this is the true match actually. That's the best match between x and y. What do I mean by this curve? Well it means that if I say if I pick this point and I go here then that that is actually matched over here. So the curve basically tells me somehow is the graph of which point in here is actually matched to which point here. So this is somehow how the curve goes. And our goal in the second stage when we are trying to recover good matches basically identify good basically somehow cover of this path by excuse me by these matches we found. So we found here like various matches and our goal is basically to assemble all of these boxes we found like overall match with a good approximation. So that's basically the second stage of the algorithm. So the second stage of the algorithm actually combines these boxes together. In the second stage of the algorithm actually it turns out to be fairly simple algorithm. It's basically some kind of dynamic programming. You just do like one sweep over the whole set of boxes from left to right and you kind of calculate what's the best match. So I will tell you more details in a second. So but basically the second stage algorithm actually depends on the number of boxes. So which we actually kind of gathered. So what I have swept under the rack is actually how many boxes we do actually deal with. And since we are looking at the approximation we don't have to actually look on all the boxes. So for example over here I have a box of size of size W by W. So I don't have to actually try all possible alignments in vertically in Y. I can just actually consider alignments which are kind of like always there is a gap between the next box at least TWA. So I can actually just take boxes in this strip which are actually aligned to this TWA. So I don't have to consider all of them. So that actually reduces the number of boxes by basically TWA factor. And to reduce it even more we actually use the idea of UConn which actually gave you this order K and algorithm in the first place. And UConn actually observed that you don't have to actually look on the hall matrix on everything on every possible match between X and Y. Because for example if you look on this interval if the distance is at most theta n then the box the interval to which it matches actually in Y cannot be shifted by more than theta n. So we can actually just look on shifts of this interval W at basically steps theta W. And if you do the calculation then that basically gives you that you are going to need like n over W squared boxes in total. And since W we set is like n to the 1 over 7 we end up with n to the 12 over 7 boxes in total. So the first stage roughly so this is up to polylogarithmic factor so actually there is you get slightly more but this is roughly what we get. So there is one more actually issue which which I didn't actually discuss and that's actually how can you combine the boxes because maybe actually the boxes which you which you want to combine actually kind of overlap vertically. So that means that there is a match between I and this interval and we would like to actually somehow combine it with a match for the next interval but there is a vertical overlap here. So what we can do and what we actually do is we actually shrink each box proportionately to its added distance. So if this box is actually labeled by epsilon W then we actually shrink it by vertically actually by this epsilon W. So we can basically cross the blue parts and we just keep the red part and we do this for each box and again because anyway actually we are losing over here already epsilon W so we can actually do that and we don't lose much. So as we are shooting for constant factor approximation this is actually fine and we do this actually with every box. So then once we do this we can actually find really like one continuous not continuous that there may be maybe some jumps but actually we can look actually for non-overlapping basically disjoint these joined boxes which are actually aligned. So here is actually how we proceed we really proceed left to right and by very simple basically like almost linear algorithm we actually find what is the best match given these boxes. So that's pretty much the whole algorithm. So eventually the algorithm basically produces boxes which approximate the best paths in the graph. And if the match which we found is not too much bigger than theta and then we say that the strings are close if it's actually far away from theta and then we say they are fine. So let me just actually recap then what is our result. So here is actually more precise statement of our results. So we actually we can actually iterate this construction. So this scheme basically always if you look on the algorithm which we actually use then it kind of basically breaks the problem into computing at a distance on smaller boxes. But once we have actually this algorithm which runs faster on any box then we can actually apply it to the smaller boxes on which we had to actually use at a distance. And so we can actually decars. So once we get fast algorithm we can actually plug it into the algorithm and we get even a fast algorithm etc. So we can do this repeatedly. So if you do it actually twice then we get like entity 1.64 algorithm which actually works for the full range of at a distance. So that's actually I think better than the algorithm which we have seen the quantum one. And well but if you actually you can iterate it even more but then the problem is somehow the quality of the algorithm actually deteriorates with the smaller theta. So as the theta gets smaller then you kind of run into some technical issues and so we can actually basically get the algorithm running time down to the 1.618 but only for at a distance which is actually large. So if the distance is close to n then we can actually do even slightly faster than this. Let me just maybe comment on this quantum algorithm since I actually touched it. So the quantum algorithm actually also uses actually this dense and sparse case but we actually handle the sparse case actually differently. So they use actually Grover search to actually somehow resolve the sparse situation whereas we use this expansion which you have seen. So these are our results. So let me actually finish or close by open problems. So there's like a bunch of like obvious open problems. So one of them is what is the best possible approximation factor. So we get the approximation 3 plus epsilon in time n to the 1.714 and we can kind of get slightly faster algorithm but we actually pay by the approximation factor. And it seems that the technique basically is limited by approximation factor basically 2. So it's not obvious how to go actually beyond actually say 2 approximation just because we are actually using the triangle inequality. So we approximate at the distances by basically triangle inequality so that immediately gives you basically a factor of 2 loss and we don't know how to go basically below 2. Well we don't know how to go below 3 but I think the real bottleneck is factor 2. The other question is like can you get actually faster algorithm and there is a partial answer to that. So Alexander actually recently communicated to us that he actually kind of improved the algorithm and he gets actually n to the 3 halves plus epsilon. Again constant factor approximation. So the answer to this is yes you can. And with Mike we actually believe that you can do even better. So we think we can do actually n to the 1 plus epsilon at least for the high range. So we don't know how to actually extend it for somehow the full range at the distance algorithm but we believe that we can do actually in almost linear time at least for at the distance which is close to n. So this question actually of how to do for small at the distance brings actually a very nice open problem and that's actually a question. Can you actually somehow reduce the problem for the low at the distance into the high at the distance case? So what do I mean by that? Well imagine that you have strength of length n two strings of length n and they have like a really small at the distance. Can you somehow compress these two strings so that you would actually preserve the at the distance but now the strings would be basically a flank which is roughly k or it's proportional to the k. So we don't know how to do that and it's actually not given you can do it but it would be really nice if it could be done because then it would be very easy actually to extend our result into into full scale edit distance algorithm if such a reduction would actually run in a small time so substantially sub quadratic. So that's that's like a very nice open problem we I would like to see solve but I don't know how to solve it. Yeah so let me finish you. Thanks. So we have time for questions so if anyone wants to speak up or type a type of question go for it. I'll pause for a second so that you have time to collect your thoughts and get the mic. Maybe while we pause let me ask one just to clarify one of the last things you said. So in terms of getting approximation factor anything lower than three then it jumps to n squared in the runtime. Is that correct? As far as we know yes at this point. Another question is can you say again why in the low edit distance regime so because you're looking for constant approximation factors so you're you know you're more demanding to yourself with what's the... What's kind of the issue? Yeah so why the algorithm actually works why is it easier for the high edit distance and not for the low edit distance it kind of actually depends with the precision of the algorithm so as you kind of go down with the edit distance you have to kind of like increase the precision of the approximation by boxes that's one of the things. So you basically you have to kind of like find like Morgan or boxes. Well the number of boxes actually stays the same but it kind of like blows up somewhere else so there is actually a trade-off of a lot of parameters and it's actually hey it's not clear how to set them optimally and even if you set them optimally it's not clear that simply there is this theta factor coming up somewhere so as you go down with or actually there is like one over theta factor coming into the running time somehow and as you go down with the edit distance it's kind of... It's funny because for really small edit distance you do have these linear time algorithms right or something. Right so the challenge so the challenge is say at the distance like into the three quarters. Okay. Yeah so we don't have to handle that one actually. Okay maybe one last question regarding the so because of the quantum algorithm so you said so is there do you know what that group is doing or something I mean is there can they use your ideas and then go one more pass over it and say well actually okay fine so now we can now you know now that you had these extra ideas they can come back at them and say well we can speed also those ideas out using even just very simple groups. That's a good question I don't know actually so but I guess it may not be easy or maybe you can combine the ideas yeah I don't know. Do we have questions from the audience? If not I'll take it. Okay I think there's no question so let's let me thank Mihal Kutski sorry Mihal Kutski for the talk once more. Thanks everyone for joining us in the hangout and also the people who are watching on YouTube and so a couple weeks from now we'll have Umla Bahref give the next tcs plus talk. Okay so I'll take the thing offline if anyone wants to hang out for a little bit more you can you can just tick off. Bye.