 Well, thank you all for coming to the CSE Distinguished Lecture and the keynote address for the data mining workshop. Our speaker today, Professor Michael Jordan, I first met him in 1991 when he was a professor of brain and cognitive science at MIT after having done a PhD in cognitive science at UCSD. He's now the Pihong Chen Distinguished Professor of Electrical Engineering and Computer Science and of statistics at UC Berkeley. It's been quite an amazing journey from psychology and cognitive science to hardcore statistics and mathematics and probability and computation. And in fact, if you've been part of the machine learning and statistics community for any length of time, you've witnessed this journey because the evolution of Michael's ideas and contributions and the broad teams of his work as they've evolved have in fact been the arc, have determined the arc of the statistical machine learning community. I think it's fair to say that the statistical machine learning community has gone where Michael has led it. In fact, this has been widely recognized by the science and engineering and the academic community. So Michael's now a, has been a member of the National Academy of Engineering, National Academy of Sciences, the American Academy of Arts and Sciences, fellow of the AAAI, ACM, IEEE and a host of alphabet soup of acronyms that I don't know or that I can't remember. And you know, so this is, but in addition to all the great work that he's been doing, I want to highlight one aspect of Michael that has personally impacted me very positively, which is the great mentorship he's had over the years of a large number of graduate students in post-docs. In fact, my first memory of Michael is I had just given this talk as a 30 or a second year graduate student in front of 500 people at a large machine learning conference. And I encountered Michael walking down from the podium and he said, you know, great job or great talk or something to that effect. And I remember being on a high for four days or five days after that. And you know, so Michael's graduated about, I think, 50 to 60 graduate students and post-docs over the last two decades and we've all ended up in really great places. And so many thanks to Michael on that. And please join me in welcoming Michael, Professor Michael Jordan. Thanks very much for the nice introduction. Really lovely. I'm proud of many of my past students in post-docs and particularly of the tender who's done really great work over many years. He's one of the people whose papers I continue to read, which to me is kind of the highest compliment of one academic to another. All right, so I'm pleased to be here. I assume it's a pretty heterogeneous audience, so I'll do my best to bring up some themes that many of you may be interested in. Somebody that'll be inscrutable to some of you, but that's okay. Hopefully I'll just spend a little bit of time on that. The title is kind of meaningless. The first few words are just trying to give an umbrella of the things I mentioned today and really bringing statistics and computation together. And then I did promise to say something about the bootstrap, which I will do in about matrix completion. But I saw on my abstract out there, I didn't promise to talk about Stein's method, I promise to talk about phylogenetic analysis instead. So I will do that. So there is another line of work that I'm really excited about. It's Stein's method. If you know what that is and are interested, go see my website. Okay, so let's get going. So I live close to Silicon Valley and work in statistics and often have gotten all excited about some new methodology and I've gone down and given talks in the South Bay and say Google or whatever. And they'll say, great, you have this new methodology, but I have a petabyte of data. Will your methodology run on my data? And I him and ha basically say, no. And then I say something like, well, if you were to subsample your data, throw away most of your data. Then I can run on the remaining part of the data and probably do well. And I say, well, all the guarantees though I could give before are gone now. And so they'd say, well, that doesn't sound like a very impressive field you're in. And then they say things like, I also need to have a time constraint. I need a fast answer. Can you, and I have a lot of data and you like data, right? Can you give me a guarantee you'll give me a good answer? And I'll say, no, I can't do that at all. That's really beyond what we can promise. So then they say, well, what field do you in? This is really not, it's not like a reasonable engineering field. So yeah, that kind of conversations happen several times. And after enough of those conversations, you get pretty unhappy and dissatisfied and I want to work on it, okay? So there are a lot of pretty deep intellectual issues here, really statistical and computational together. So let me just sort of say a little bit about this. First of all, let's be clear on the fact that just because we're in the asymptotic regime, that means there are no more problems left. So in classical statistical education, you were told there's a number of data points gets large, everything gets simple, the airbars go to nothing, and there's no real statistical issues left. It's just computation at that point. And that's definitely not the truth, the case. So let's be clear on a couple of reasons. The first is really just statistical. If you're a statistician, this is meat potatoes for the rest of you. It may not be a little more subtle. So I got, think of my data as rows and columns. Rows, like a database person. Rows are my entities and columns are my features or covariates or descriptors or attributes. If I have a thousand rows, I usually don't collect too many columns. If I have rows or people, I may be interested in height, weight, income, a few things that describe the things I'm interested in and the consequences I'd like to pull out of the database or the queries I'm interested in answering. If I have three billion rows, so I've collected data on most of the people in the world, I'm gonna be interested in a lot more descriptors. I'm gonna be interested in what kind of books do you like to read? What meal you had yesterday? What city you live in? What's your genome? All kinds of descriptors, so I can make predictions about you at a very fine level of granularity. Like, will you click on my ad or will you buy life insurance or can I provide some service that you're actually really interested in? I'm not just gonna show you an ad. And that's where the world is going. But there's a gotcha here, it's a real problem, which is that if the number of rows is growing up to three billion, the number of columns is probably growing linear in that. I probably need a linear number of descriptors. But I'm interested in my hypothesis space is all the combinations of all the columns. I'm interested, you live in Beijing and your last book you read was Jonathan Franzen and you, like punk rock music, what's the probability you want life insurance or whatever. And I can make up facetious ones, but a lot of them are really interesting. Like, you have this mark here in genome and you smoke and so on. What's the probability you're gonna have a bad disease and so on. So you're interested in an exponential number of hypotheses. So if I get more and more data, I'm not, completely opposite of the asymptotic, I'm always in a regime where I have way too little data for the ambitions I have for the data. All right, so if you start looking for data, looking for patterns in this exponential space, you're gonna find all kinds of patterns that look perfectly true on the data, which are just totally bogus. Just by chance alone they occurred, okay? That's a theorem. And it's what happens in real life. So you start working with companies that have this kind of data, it's just start rolling out simple methodology. They get all kinds of things which are just total white noise and false positives just swamp them and they turn off the machine. So that's actually what it really happens. It's not they have computational problems and they can't scale to the large data. It's that they get false positives and it kills them. All right, the second part of the problem is more of the classical computer science one, which is that an inference algorithm runs in some amount of time, n cubed, p cubed, n log n, whatever. And if I have 10,000 data points and I have an hour and I wanna make the decision, that's probably pretty good. If I've got now a billion data points, I still wanna make a decision in an hour, then that algorithm is not gonna run in an hour. So then I have to run an approximation in the algorithm, maybe. But now my inferential guarantees are less good or I have to sub-sample, but now again my inferential guarantees maybe are gone because I took way too much data. And so you gave me more data and I may have given you back an answer, which is worse. All right, and that seems paradoxical. That doesn't seem right to me as an engineering science. My resource, classical computer science is the game of what are my resources and how do I play them in an algorithm so I get increase of performance as a function of my resources, time and space and energy and so on. So I'm giving you a resource now, which is data. And I'm giving you more and more of your resource and your performance is getting worse and it gets better for a while, then it gets worse and worse and worse. Okay, that's where that is the state we're in. And so I think that's just totally wrong. We're just totally immature in this field. All right, so here's one way to state a goal. I don't think we could solve this, but given an inferential goal and a fixed confidential budget, say an hour, provide a guarantee that the quality of inference will increase monotonically as data accrue without bound. So kind of the both the statisticians perspective of talking about risks and guarantees and the computer science statement of scalability. And moreover, I want to do this without bound. I don't have to face this problem again every ten years. I don't want to think principles that allow me to scale since this one. So we're far from there. Okay, so I can't solve that problem. I'm kind of working and thinking about that to some degree without much, too much success. There are aspects of the problem you can formalize and make progress on. But the overall problem I think is still really unposed and unsolved. So what I'm gonna talk about today is a different kind of more bottom up way of thinking. Let's take a particular computational principle, particular divide and conquer. Great principle. Lots of algorithms are divide and conquer algorithms and bring it more fully into contact with statistical inference. And I kind of like doing this and for kind of one reason is that statistical inference is somehow about aggregating things. When you pull things together, laws of large numbers and central limiters start to come into play and you can make inferences. You can talk about quantify things and error bars go get smaller. And somehow divide and conquer is the opposite of trying to pull things apart and so it fights in some way. But I believe you have to do divide and conquer to scale or something like that, so we need to face this issue. All right, so here's the first little piece. There's gonna be three little pieces here. And this is perhaps the most significant one for the computer science audience. So I'm gonna tell you a little about the bootstrap and why it can't be used on big data. And why there's another procedure, which I'll describe, which can be used on big data and can give you bootstrap interface intervals on big data. How many of you know what the bootstrap is already? All right, so that's about a third, it's pretty good. So if you don't know, it's time to learn. It's a really great idea, and if you're a statistician, you know what it is. And a lot of statisticians don't know what it is, I think that's a big problem. Okay, so here's the idea. The problem, the bootstrap is not just about making predictions and inferences and running an algorithm and get down on a number. It's about putting a confidence interval around it, assessing the quality of the inference, how good an inference it is. So if you ran some algorithm and it gave the number of 10.5 out, and I'm gonna make a life determining decision if it's bigger than 10 or not. I'd really like to know what the error bar is before I really feel comfortable about making that decision. It's a huge error bar, I'm not so comfortable with the tiny one, I'm comfortable. So I gotta get error bars on things. Everything, not just occasionally, I gotta get them on every inference I ever make. So the idea is you observe data, you calculate some estimate, let's call it a parameter estimate, but it can be a prediction of any kind. It could be a function, it doesn't have to be a parametric quantity. Let's call it a parameter estimate for now, it's a functional on the data. And I'm not interested in the parameter so much as I'm interested in the quality of the estimate. So let's say a confidence region. Okay, so the goal here is gonna be to get a procedure that estimates estimator quality. We wanna talk about how good an estimator is after you've seen data. And we want it to be accurate, meaning that this is the classical statistician's goal, which is that I told you that this confidence interval covers the truth 95% of the time. Well, if I do it a bunch of much at times, 95% of the time, it actually covered the truth. That's what it means to be accurate, and that's hard to achieve. That's what the field statistics has done pretty successfully. We want it to be automatic, meaning that for every brand new problem, rethink the problem and build new principles and all that. I want this thing just to work. And again, statisticians have done that. Lots and lots of situations or some procedures that work out of the box. What they haven't faced is the scalability issues. So that's what we wanna talk about today. Okay, so here's how a frequentist thinks about the world. There's kind of two branches of statistics. Then the frequentist is more about analyzing procedures versus the Bayesian. And here's how a frequentist thinks about analyzing procedures. So ideally, as a statistician, you would have not just one data set. You would have multiple data sets. And you would get your estimate or your prediction on each data set. And then you look at the spread of those predictions. And that would give you an estimate of the error bar. The fluctuations in your estimator. And if there's not much fluctuation, you're pretty confident in the individual estimate. So that's what a frequentist would love to do is have multiple data sets. But you don't have multiple data sets, you only have one data set. So it seems like you can't really think of frequentism as an actual implementable principle. But you can and that's the idea of the bootstrap. So here's the idea, which is that the data came from somewhere. I have a sample of that thing and the underlying generating mechanism. It's a distribution of some kind. And so let's draw it here as a continuous distribution. And let's call that the population. So data arrives at the population. We don't get to see the population, but the Supreme Being gets to see the population. The Supreme Being then generated data out of that population by sampling from that. And that gave us some data. So Supreme Being did that. Now ideally then, we would do this Supreme Being would do this multiple times. The first time, the second time, and the nth time. So we'd get m data sets out of this underlying population. And then you would get your estimator, m times. And then you put that into something like a confidence interval calculation. After you've got all these m values, you can just calculate a confidence interval. All right, so we can't do that. But instead, what we did is we got a sample out of the population. And then you think of the sample not as a list of numbers. Think of it as a histogram. Okay, so a histogram itself is a distribution. It's called the empirical distribution, all right? And in a well-quantifiable sense, it's an approximation to the true underlying distribution. In a uniform sense, in various senses, it's an approximation to the underlying object. So here's the kind of beautiful idea, which is that, you forget that there ever was a population. You take that histogram and you pretend that that's the population. You live in a world where that's the population. You're the Supreme Being in that world. So in that world, you can generate multiple data sets. Because you have a population that histogram, all right? So you generate a data set from that histogram. What does that mean? Well, you take the original data and you sample them again. If there was in data points, you sample them with replacement in times. You get out a new data set in which some of the original data points were copied several times and some of them didn't occur at all. That's what happens when you sample with replacement. On average, 0.632 of the original data points occurred. The guy in the rest of them didn't occur. All right, so that's a data set that came from that population. All right, and then you can now apply the estimator to that and get a number. But now you can do that again. You can resample from this guy again. And then you get another resampled data set. You apply the estimator, you apply get a different value. And now you do that a whole bunch of times. And how many times should you do that to get good estimates of error? About 200 is kind of the usual rule of thumb for point estimates. For other kind of problems, it could be a little bit more. Now, the beauty of this is that it's totally generalizable. You can do this for the mean, the median, the support vector machine, the whatever decision tree. You just resample the replacement 200 times on each training set. You apply your estimator, whatever it is, and you get a spread. Moreover, it's totally paralyzable. You do those 200 things in parallel. You send one resampled, one processor, another processor. You send it to your 200 processors. They all in parallel compute their estimators. And then you send them back to get the conference general. So when I would start a couple of years ago thinking about cloud computing and what it means for our world and how we do inference at scale and so on, that's just obviously jumped up in front of my eyes. Well, cloud computing is the perfect match to the bootstrap. So let's think about how we can think about using the bootstrap generically across all kinds of database applications as part of the database. And I still believe that's kind of true to a certain, but there's a gotcha. So let me just first of all summarize. This is a very famous piece of work due to Efron. And this is just a picture of what I've already said. So you take the original data and that replaces the population. You think of it as a histogram and you generate multiple resamplings. The star means a resampling. It's a bootstrap sample and you get your multiple of your estimator and you plug them in. So that's just what I said before in a flow diagram. As I said, you get 0.632 of the original data set. So if I have 1,000 data points, each sub-sample is 600. That's no big deal. What if I have a terabyte of data and I want to get error bars on quantities in a terabyte of data? That's what we need to do to these days. Well, each sub-sample is 632 gigabytes. And yeah, I can do those in parallel, but I've got to send 632 gigabytes out to each of my 200 processors. And that's just no good. That's going to ruin my network. All right, so we can't do that and it's not going to scale to a petabyte and so on, just no hope. All right, so that seems like a real problem. We have this beautiful procedure and it's not many such procedures that are automatic and accurate like the bootstrap. We can't do it on big data. So that makes you kind of unhappy. There's another idea that came out a little bit after the bootstrap, which is called sub-sampling, which seems on the surface to solve the problem, which is the following. You take that original data set and you take a small sub-sample of it. And b is maybe like square root of n. So it's a big reduction. And now you apply your estimator to that sub-sample. And now that's just one sub-sample. You could get another sub-sample. And you could do that again and again and again, say 200 times. You could apply your estimator to each one of those and now you get some notion of spread. The problem is that that's on the wrong scale. The error bars you want on data of size n, you're getting error bars on data of size b and the scale is wrong. And you don't know in general how to correct. If the estimator is a square root of n estimator, you would correct by multiplying by square root of n over b. But it's not necessarily a square root of n estimator. You don't know that a priori. So it becomes a lot less automatic. You would have to do some analysis to figure out how to do that, rescaling. That's one problem. Another problem is that it's hard to actually choose the right value of b. This thing is pretty sensitive to the choice of b. So much so that I don't think you'd want to view this as an automatic procedure. And I'll show you that examples of experiment here in just a second. In fact, I think right now, yes. We started doing some work on this because we thought this might be the way to go. So here's a little experiment with 100 dimensional covariate space, 50,000 data points. We sampled one of the ground shoes so we could actually calibrate the error bars. Are they correct? So we sampled from a student t distribution. We do least squares, estimate a parameter vector, and then calculate confidence intervals and evaluate them. And here's the main point of this slide, which is that b is chosen to be n to some power, say, 1 half up to 1, somewhere in that range. We're sub-sampling at some rate, and gamma quantifies that. Here's the results. What do you do this experiment? It's a function of the processing time that you're running this algorithm that's sub-sampling and computing error bars. Hopefully the bars will start to come down. They won't go to zero because you have a finite amount of data. And so here's what happens with the bootstrap. This is relative error. You want it to be small. The bootstrap kind of comes down and starts to stabilize. Sub-sampling, when gamma is 0.5, is way up there. It's just not converging. It's terrible. So the square root of it was too aggressive. If you bring it down to 0.6, you start to do better, but still not good. That's a big difference in the bootstrap. Below that, it actually beats the bootstrap. But then, if you go up to 0.9, it's worse again. So there was a little range at which it worked, but outside of the range, it really failed. And that range will change. It's different on different problems. So that's a problem. That's not a generic solution to this problem. OK, so we still are kind of left without a procedure. That'll do bootstrap style error bars on large data. All right, so we have a new procedure that we think does solve this problem. It's called bag of little bootstraps BLB. And I can describe it. It's pretty simple. So go back to this picture here where I took the underline population, sample endpoints, and then sample a subsample of that. And now, conceptually, what's happened is you actually got this thing directly from the population. It went through an intermediate stage, but this is a random sample from the population also. It's a little smaller, but it's still a random sample from the population. And it's also, therefore, an approximation of the population. And my drawing doesn't make it look like a very good approximation, but think we're B as a million. It's not going to be too bad. OK. And now it's a histogram. Forget that it was composed of B points. It's a distribution. And it's an approximation of the truth. So apply this exact same bootstrap principle as before and sample from that thing with replacement. But you used the bootstrap on that subsample. But now, when you resample, you don't do it B times. That would give you error bars on the wrong scale. That was probably the previous procedure. You resample the placement in times, because that's what the bootstrap should be doing. You're trying to get error bars on the scale of n. You've got a distribution. You sample from it n times. What does it mean to sample a histogram which has support on B points in times? Well, it just means you resample the replacement. A lot of those points will occur many, many times. Some of them won't occur at all, but most of them will occur a lot of times. So you record those B points plus the count. And that's one of your 200 data sets. OK, so I'm bootstrapping a distribution that's support of size B, but resampling n times. I'm doing the correct bootstrap at the right scale. I have to rescale in this procedure. So if I do that on one particular subsample, I'm actually getting correct bootstrap error bars. It's actually the correct scale. No rescaling needed, just automatic. It could be noisy though, because it's a small subsample. But why not do it now 200 times? Bootstrap 200 different subsamples. Every one of them is correct. So you might imagine that if you average their results, that would be the bagging part of the procedure then. You'll still get an estimator, which is correct. And that turns out to be true. OK, so that's the argument. These next couple of slides just fill that out. You pretend the subsample is the population. You resample from it with replacement. And critically, the big difference is you resample n times not B. So there's the summary, but I think that I'll skip that slide, just show you the picture. It's a two-nested loop, two-nested stages. Take the original data, you subsample them, and then you get a number of subsamples. On each subsample, think of it as a processor now, you take that subsample, you run the bootstrap as before. And that'll get you finally an estimator of the accuracy, the quality estimator for each subsample. And then you average the quality estimators to get the overall quality. So that's the new idea. And I don't know if I acknowledge my collaborators or colleagues on this. Ariel Kleiner, the first author, a student worked with me at Berkeley. Perna Sarkar and Amit Talwakar are all working with me at Berkeley on this project. Let me show you that this works. This is the same experiment as before. There's the bootstrap, same curve as before. And this is the new procedure for all the different values of gamma. So you see it really is beating the bootstrap actually for all values of gamma. So we think this is much more automatic kind of procedure. So this was on sort of 50,000 data points, modest scale problem. This has since been done in a real distributed architecture on a half a terabyte of data. So this can scale. All right, so there's some theorems here. I'm going to skip them in the interest of time. But we have a paper on all this. If you're interested in, one of the beautiful things about the bootstrap here, a theoretician, is that it actually beats the central limit theorem. It has convergence rate, which goes not as 1 over square root of n, but 1 over n. It's faster than the central limit theorem. It's another amazing, one of the reasons the bootstrap is so popular, so important part of statistics. And this procedure also happens to retain the higher correctness of the bootstrap. We have a consistency result, and I'm going to kind of just skip these slides. But here are some slides which give you a little bit of the outline of the final statement of the 1 over n accuracy of the new procedure. OK, that was the first part of the talk. I'm going to move on now. That was, again, kind of divide and conquer would do parallelism. And not just to take a machine learning algorithm, make it parallel. I think that's interesting too, but this was to how do you evaluate procedures and do that in parallel. I think that's even more interesting in some ways, because that's what we have, all these multiple hypotheses. And we're trying multiple models, and we're trying to get their individual notions of errors so we can calibrate and make decisions. OK, so that was one talk. Now, it turns out these slides are on separate files, because this work is also new. It hasn't yet been put in one file. OK, now that's the third part of the talk, the phylogeny part. There was another thing right here. There we go. OK, here we go. So let me now find the full screen mode. OK, good. OK, so I'm going to only tell you about the first part of this on matrix completion, and that's about the stuff on Stein's method. There are a list of collaborators here, several who were collaborating on the Stein's method part. The collaborators on the part, I'll tell you about a Lester Mackey and a Meet Tall Walker. And this is a very, very simple idea, and I'm just going to kind of mention it briefly, because it's a divide and conquer idea, and it's really simple. And valuable in practice, and I just think it's kind of worth chatting about it a little bit. So this is a approach to matrix completion that we call divide factor combine. So the matrix completion problem, very popular recent years, it's that you take a matrix in which many of the entries are unobserved, and you want to fill them in. So the Netflix prize is an example of this, where this would be the users, and there's the movies. These are ratings, and people have rated only a small subset of the movies, and you like to predict their ratings on movies they haven't seen yet. And there are many other examples of this problem. There are really nice algorithms that have some very strong guarantees, that if the matrix has certain properties, which I'll briefly mention here in a second, then you can recover, with high probability, the exact matrix. The problem for this talk is that they all rely on singular value decompositions, and in particular a truncated SVD, which doesn't scale, to the really big problems that we're interested in. It's a cubic algorithm, it just doesn't scale. So we've got to deal with that. So we're going to do a very simple divide and conquer, really just take the matrix and break it into pieces kind of algorithm, and then we're going to prove something about that, that it also provably correct algorithm under the same conditions as before. OK, so I'm going to skip really quick through some slides here, because I want to get to the third part, and there's kind of a lot of notation that I don't want you to absorb. I just want to give a high-level picture here to tell you a little about the theorem, and there's again a paper on this, if you want to dig into some details. So the basic story is that you have to make some assumptions about the matrix. You've got n times m degrees of freedom, and you have a very small subset of the observed entries, so without assumptions you can't fill in the matrix. So what do you assume, one of the more popular assumptions that does turn to work in practice is that it has low rank, very low rank. So you assume that can be factorized as a thin column matrix times a thin row matrix. Now that's not enough, you just have low rank, you have to have some more properties, particularly you can't just have like whole columns missing, you wouldn't be able to fill in the column. So you have to kind of assume some form of a uniform at random or some other sampling model. This is a particularly common one. It's not the only one. And another thing you have to assume is somehow the information is spread out about the structure, the rank. So you can't have, you don't want to have matrices like that, where if you just didn't sample the one you would learn nothing about the matrix. It would look like all zeroes. So you don't allow that kind of thing happening by having some kind of a coherence assumption. This is just one of the ways that it's been formalized to have a matrix be incoherent with a standard basis. It's a spread of information you guarantee you have spread. And if you make those kind of assumptions, then you solve the following optimization problem. You'd like to solve the one at the top there, minimize the rank, subject to match the entries that are observed kind of constraint. You can't, that's an NP hard problem. You relax it to a minimize the trace norm or nuclear norm, again subject to the same constraints. And that's a convex problem that you can solve in polynomial time, but unfortunately a high degree polynomial time. Anyway, there's a theorem that says if you solve this problem, then with high probability you get the actual matrix as if you'd solve the original problem up there. So here's an example of such a theorem. It says if the matrix has some properties, if you sample at a certain rate which is n log squared n, very nice, not nm, then you actually get an answer which has quality, was close with high probability of the true answer. OK, now, again the problem here is it's going to have to, it runs a truncated single divided composition so it doesn't scale. So what can you do? So we're going to do a very simple divide and conquer algorithm here. We're just going to take a big matrix, I think in the next page I show you. Yes, here I just want to show you the algorithm. Take a big matrix and you divide it c1, c2, these pieces. And each one of those is now a thinner column matrix. And then you take the existing matrix completion algorithms on each column matrix independently. So you do this in parallel. And you do matrix completion on the column matrix. And that gives me c1 hat, c2 hat, blah, blah, blah. Now I need to aggregate all that information. And how do I aggregate that information? Well, I project that onto one of the column spaces. So in this case c1. So I take all the matrices and I project that onto the column matrix, column space of c1 and that aggregates everybody together in the same way. So it's a map reduce kind of thing. You map out, everybody gets a little small matrix, a new matrix completion on. You get a factorized answer and you take the factors and then put them all back together on a particular column space. And now you can do this not just on c1, you can do that on multiple column spaces and get kind of an ensemble method, which is what we actually do in practice to get the best results. OK, so there's a very simple algorithm, very natural. Does it work? First of all, there's a theorem that says that it works. You get basically the same kind of rates as you get for the full matrix completion, sampling and advantageously for a small fraction of the columns of the matrix. So nice theorem. And if I were to give the third part of the talk, the more of the theory part of the talk on Stein's method, it would be how do you prove theorems like that? So there's a nice general idea based on something called Stein's method that allows you to talk about large deviations of random matrices that applies to problems like this and lots of others. Anyway, I'm not going to give that part of the talk. That was, again, just publicity. So it does work. And let me just show you that it works in practice. So here is an example. If I reveal 2% of the entries, very, very, very sparse data, I'm revealing very little of the actual overall matrix, the base matrix completion algorithm comes down does really well by 2%. It gets eventually better and better as you reveal up to 10%. But amazingly good, only two. Here's a bunch of versions of this new algorithm. I've told you about basically projection ensemble. And that's these blue curves here. These are some others which work less well. But at about my 4% of the data, it's performing as well as the base algorithm. At 2%, it's already starting to get, one of them is actually just as good as the base algorithm. So it really, working on the kind of classic problems that we would hope it would work on. And then here's the main take home, which is that there's your base matrix completion, and this is now time. And so this is complexity. And you're seeing the cubic growth that you would expect out of that. And here's all the new algorithms. And so they're growing, inevitably. This won't go forever. But that gives you quite a significant, makes a real practical algorithm. So I view this actually as the existing best method for large scale matrix completion. We did this on the Netflix data set. And so this is actually a really good, large scale problem, the full data set. It's 100 million ratings, 17,000 columns, and 480,000 rows. And we compare to the best single method, which is an algorithm known as APG, which is one of these SVD-based algorithms. And it gets an answer, a root mean squared error of 8.8433. And here's some versions of this new procedure, which do achieve the same error rate at an order of magnitude faster, shorter amount of time. OK, so that's part two of the talk. And I didn't connect part three that much, but just to summarize them briefly is that I believe this is not a pretty common paradigm. You take a fairly simple procedure. You paralyze it in a pretty naive way. And then you have to do some theory to make sure that you haven't lost something deeply along the way. And I actually think a lot of these theory all involve random matrices. And that's why I like Stein's method. It's a very nice method for doing it. OK, so I'm going to skip that part of the talk then, and now return to the third part, which is over here. Yes, OK. And again, a knowledge collaborator here. This is Alexandra Bouchard-Cotet, who is up in UBC. A professor of statistics up there. And again, I need to find a view. Oops, I'm not in the wrongs. Now, here it is. Oops, that was a mistake. I'm not good at PowerPoint. How do you, yeah, I remember how you do this. You go scroll through and you do that. I think you do this thing, right? Yeah, good. OK, so third part of the talk. Wash your brain or everything I said before. There's another divide and conquer idea, but a different divide and conquer idea. And I like this one probably the most because this is the most subtle one. This uses probability theory, not just kind of naively splitting things up. So this splits things up in a probabilistic way. And what does that mean to split things up in a probabilistic way? Well, basically that means the Poisson process. The Poisson process is even more beautiful in a Gaussian in some ways if you're a statistician or probabilist. It has all these combinatorial properties. And you can do a thinning of it. You can take a Poisson process, which is a bunch of random points, and you can sub-sample them randomly. And you get a thing object which is itself a Poisson process. And so it just leads itself to divide and conquer thinking. So you're going to see a problem here, which I think is a pretty interesting problem of all of your phylogenetics, which it doesn't seem to have any divide and conquer strategy. It just seems to be all tangled up and leads to dynamic programming algorithms, which are hopelessly complex. But if you think about it, for what we want to rearrange things, you see a Poisson process emerging. And then you can divide and conquer, and you can solve the problem. So that's the message of this part of the talk. And the other reason I talk about it is just that phylogenetic analysis is a pretty interesting compelling problem. It has to do with finding trees like this, but also aligning data. And alignment problems come up in all kinds of fields. And my field doesn't tend to address them nearly as much as we should. We assume the data are already aligned by somebody else. We have a design matrix in which the columns have a meaning, and all the data all come in, all aligned and everything. And that's just rarely the case in real life. So you want to both align and do your inference. And so phylogenetics kind of requires you to think about that issue. All right, so with that as background, what is the phylogenetic analysis problem? I've got, in this case, just four species. And I want to find a tree. And I want to find links of the branches that reflects the evolutionary distance among the species. And so this is often modeled as there's a random tree, uncertain tree that I'd like to infer. And the branch links are parameters of something like a continuous time Markov chain that represents the evolutionary path of some procedure as things mutate and insertions and lesions occur. All right, so now this problem is really hard. And most of the literature on it has made some real simplifications. And here's one common simplification. Let's suppose that for every one of my species, I only had one character, i.e. the DNA, I only took one letter, one site of the genome. Not a very good model. But this is what you did with dinosaurs. You maybe measured a small handful. They were called characters. Maybe the size of the cranium and the size of the foot or something, like two or three things. That was enough. Nowadays we have genomic data. We don't think this way anymore. But this is kind of the heritage of this field, was to think about a small number of characters. So let's think about the case where we have just one character per species, like ACGT. Well, then I have a so-called graphical model. I have a probabilistic model here in which the existing species are the leaves of this tree. Those nodes represent multinomial nodes. They could be in one of, say, for DNA, in one of four states, ACG or T. And there's an ancestral organism that was in some state. It's not shaded, meaning it's unknown. And then there was a mutation that occurred. And now two new species arose. And they, at some moment in time, had ACG or T. Those sort of mutations occurred along the path there. And they're still unshaded because they were not observed either. Then there was another, there were two more splittings, speciation events, that led to ACG and T in four existing species, which we did observe. So we'd like to infer the tree, where were the splittings, and also here I have branch links. Here they're buried in the formalism, which there's a transition probability there that's just parameterized by a branch link that's just buried in the actual formula used for the edge probability. Whereas in the phylogenic literature, you draw it out. If you did that, then that's, this is kind of what you learn about in graphical models 101. You learn how to do the EM algorithm on this and estimate the parameters and estimate the tree. And it's kind of an easy problem. It's a tree. Now, you don't have just one character. You have string-valued characters. So, well, how can you treat string-valued characters? Well, if all the strings were the same length and they already came in pre-aligned, and if you assumed that every, let's do it this way, they came in align, every column of the alignment was independent of every other column, if you made both those assumptions, then it's just in independent graphical models. And you usually put a box around that, or a k. k independent graphicals, you put a box around that to talk about replicates of a basic graphical model. And now the probability structure of this is the product over the individual probabilities. And you take the logarithm that becomes the sum, and EM algorithm, and everything goes through like before. No, nothing new happens there. Then this is what's really done in the literature. This is if you pick up a book on phylogenetics, you learn everything I just said. You learn, you know, it takes a few hundred pages to describe it all, but that's what you learn how to do. You write down the likelihood for that model, run the EM algorithm, and estimate parameters and tree. Anyway, I hope you agree that this is a way too simple. The real problem is that we need to do multiple sequence alignment. We have these strings coming in, and they don't have the same length, and more they're not aligned a priori. We've got to find how they align as part of the problem. So there's two representations of alignment. This is kind of nicer in some ways, and that's the one you often will see. OK, so we want to find what are called homologous nucleotides, which were actually had an ancestral nucleotide they both arose from. So the holy grail of the field has now been, for quite some time, about three decades, find the tree and find the alignment jointly. And so kind of one data structure to think about that, and it's kind of a tree where I have these paths, and those paths are the homologous edges. So I need to find all those paths, and do the alignment together with finding the tree. OK, so there has been nice work on this problem. Pretty sophisticated people have worked on this problem. And it was a beautiful paper by Thorn Eddall in 1991, truly got this started as a formal subject. And so they had a little continuous time Markov chain along paths in a tree. So here's just to take one edge of a tree. I have ATC at the top. And I want to evolve that forward in time, allowing for insertions and deletions so I can get alignment issues. If I had no insertions and deletions, I wouldn't have alignment issues. I need to allow that. So they have a little model that gives you insertions and deletions. So how does it work? Well, it's just a little Markov chain, continuous time Markov chain. So you have these alarm clocks at every site. And they're running independently in some exponential amount of time. And the first one that rings determines what kind of event you have, and then you make that event occur. So you either have an insertion where you put a new nucleotide between the existing ones, or a mutation or substitution, just change the nucleotide to something else, or you delete a nucleotide. So in this case, the insertion clock was the first one, so you insert something to the left of the A. And what do you insert there? Well, you sample from some distribution, usually the stationary distribution, and say we've got a C. And so now the new string is C-A-T-C. And so you evolve forward in time, and you get a string-valued Markov chain. Pretty simple, nice string-valued Markov chain. All right. Now, that's one path, that's one branch of the tree. And you think you could put that together on all the branches and take the product over the whole thing, and it's just a tree. It should be easy and nice and tractable. But if you thought, that's wrong. And the reason is, it's all this homology. So I've kind of figured out, as I walked on that branch of the tree, there's homologous who linked to who. And now as I go down some other branch, I've got to remember all that homology structure in making homology links for this part of the tree as well. And in general, I have to remember everything about this part of the tree to set up the homology for the new branch of the tree. So the state you need for any given branch is the rest of the tree. And in fact, you can write this out formally. You can turn this whole thing into a hidden Markov model, or the state is exponential in the size of the tree. It's all the stuff you have to remember to make homology decisions along any given branch of the tree. So it all couples into a nasty, tangled mess. And this has been realized, and there was a paper just in 2005 that really made this form, what gave a lower bound kind of argument that showed that exact computation of the total probability of the tree and the alignment, they use m for alignment, is exponential in the number of observed tacks of the leaves of the tree. So that's a killer, exponential in the number of data points, really. That's just no, that's a killer. So people can run this algorithm on about 10 species, far below what we really want to do for lots of biological problems. So why do you care about that probability? Well, when you're actually trying to figure out which tree, you have some kind of a Markov chain money car algorithm that starts at a given tree and then looks around for other trees nearby, decides whether to jump to them or not, and moves around the space of trees. And to evaluate whether you want to jump or not, you calculate the probability of the tree alignment here, the probability of the tree alignment here, and go uphill or not, depending on the ratio of the probabilities. So you need to calculate that probability. OK, so I'm nearing the end here, not much more to say. I'm going to show you one last algorithm here. OK, so that's a problem. And so both kind of intuitively, this is a dynamic programming procedure, a hidden Markov model, is needed here, and it has states exponential. It just seems like a nasty problem. And so what people have done in the intervening three decades is kind of, I guess that's only two decades, is think about approximation procedures. It's a dynamic program. Maybe I can nearly approximate it in this way or this way. But that's not always the right way to think. This is a cartoon model of biology. Real biology isn't that. And I can think of some other cartoon model of biology, which has maybe a nicer property, that might also be useful in the same way this model is potentially useful, but not really practicable. So that's a pretty common statisticians point of view. You don't take some God-given problem you have to solve. Dick Harp said you have to solve that problem, and let's find all the best approximations to it possible. No, the problem is to understand the biology. So you make a model of the biology, and there are many models. You don't have to work with that model. So this model's been around. It's been the canonical model for a while. You don't have to really work with it. OK, so anyway, long story short, there is another model, which is kind of close to this one, but in a way simpler, in some ways dumber, one could argue, but has a divide and conquer solution. And it's based on the Poisson process. So here's this other model. You're not going to see the Poisson process yet here, but there is one. So this model says there's only one insertion clock. You pull it outside. It's got a global variable. And it competes against all the others. But when it runs, it'll run for some amount of time. If it's the first clock to go, then you're going to make an insertion uniformly random on the current string. So you can still get insertion events anywhere on the string, and they happen at some rate, and you still can determine that rate, because that's a free parameter. But you don't get a length-dependent insertion process. So in that sense, it's a dumber model than TKF, where the deletions and the substitutions occur just like before. OK, so why would you go out of this model? Well, if you go to this model, it turns out there's a Poisson process. So there's another description of that process. So now this takes, there's a little jump here, where you have to write down some mathematics and prove some things. What you can show is there's another description of that model, which is really simple, is that you take the tree and you randomly sprinkle down onto it, according to a Poisson process, uniformly random on the topology of the tree, some insertion events. In this case, we've got three of them. So that's just a Poisson process. Can't be anything simpler than that. And then you treat each one of the insertion events independently. And what does that mean? So in this next picture, this is the main picture here. So we have three insertion events that were thrown down on this tree. So I'm re-describing in a completely different language the process I told you about. You can't see the connection. It's not an obvious connection. So I think this is a brand new model I'm describing to you. I threw down some insertion events. I picked one of them. Let's say I picked X2. All right, so starting up at X2, I'm going to now go down the tree and there are no more insertion events on that tree, because I'm treating them independently. There's only one insertion on that tree. So what can happen on that tree if there's no more insertions? Well, I can have a substitution in which the color changes, say, from red to green. Or I can have a deletion in which the thing goes black. And if I have a deletion, because I have no more insertions, that thing stays black forever. So this is a death process. It's a mutation and death process on a tree. All right, so that was for an insertion event, X2. And now I conceptually do the same thing for the other insertion events. So X1 is over there. I take X1 out of the tree, and it's subtree below it. And then I run forward in evolution to have a mutation process and a death process. And I similarly do that for the other. All right, now it's a plus on process, so I can do these things independently and put them back together. They don't interact with each other in any way. That's the beauty of this. And now a death process on a tree is a really simple calculation. It's just a matrix exponential. It just decays away according to the eigenvalues of a matrix exponential. So long story short, this can be all glued together. And you can do the inferential computation of probability of t, com, m. And so the next three slides have that argument mathematically. And I think I'm going to just put it real quickly, but I hope you get kind of trust me that there is divide and conquer here because of the Poisson process. And therefore, if you figure out what you're dividing and conquering, you can kind of get the complexity of this, and it turns out to be pretty simple. And so here is the argument for the more mathematically inclined is that this Poisson process is a property of exchangeability, which is kind of uniform at random. And so the ordering of the leaves doesn't matter. If I swap them around, I get the same probability distribution. That's called exchangeability. And so the probability of a total alignment, all the columns are exchangeable. Therefore, it suffices to calculate the probability of a single column in the alignment by exchangeability. So that's one part of the argument. The other part of the argument is the following, which is that as I'm running an algorithm to compute this probability, I'm going to do something like the following. It's a sampling algorithm. I'm going to take this current state of the tree, and I have a bunch of insertions somewhere. I'm trying to decide whether to put in a new insertion or not. So I'm going to look at some part of the tree and decide where to put an insertion there. So think about a Poisson process on the real line, which is where you probably learned about it on. And I suppose I'm doing inference under a Poisson process. So I might say, between 1.5 and 2.5, I want you to put on an insertion there. And so you have to pick out where. I've told you, between those two points. And because of the property of the Poisson process, you have to put it uniformly at random. That's just the fact about the Poisson. Once I've told you where, then it's uniform in that interval. Same thing happens here. If I told you I'm looking at some branch of the tree, then you put it uniformly at random along that branch. Anyway, and then there's a third property which I already alluded to, which is that this is a death process that, therefore, the likelihood for a single insertion as you go down the tree, you calculate the probability of all paths as a matrix exponential. So if you put those last three slides together and calculate the total number of operations you have to do, you find that this over algorithm is only not exponential. It's actually linear in the number of observed taxes and the number of columns in alignment. So bang, really big difference just by making a simple modeling choice change. So we've done some experiments on this. These are some early experiments comparing to some existing phylogenetic inference procedure, something called phi ML, and an existing sequence alignment procedure, kind of a standard cluster. These are not the best things in the world, but they're pretty standard baselines. And we generated some artificial data and evaluated according to some standard metrics. And the improvements in the new procedure over the baseline algorithms, or 27%, on the tree side and 43% on the alignment side. So we're getting interestingly, tantalizingly interesting improvements for both tree and alignment inference, and we're doing it jointly, not solving one problem or the other. I think I'm going to skip that slide. All right, so last slide, which is, again, kind of returning the status to this point of view, OK, you went to a dumber model, but convinced me you didn't lose the throw the baby out the bath water. And here's a kind of surprising fact, which is that if you think about the stationary distribution, if you run this thing forward a lot, a long period of time, it'll arrive at some stationary distribution on strings. It turns out the stationary distribution of the new process is exactly the same as the TKF model, surprisingly. So somehow the TKF model is not using all of its parameters. It's having insertions and deletions occurring in some way. They're somehow balancing and canceling. So the over probability, you don't have to have all those parameters to capture the stationary distribution. So this new thing is somehow, in some way, more attuned to that stationary distribution. Do you think that's a reasonable stationary distribution, which may or may not be true? Anyway, that's a very interesting point that kind of makes this seem less crazy than you might have thought. And then, moreover, it's a possible process. You can superimpose other probability processes on top of it and make it random in various ways. You can go the ways that, as statisticians, we often go. Make this more interesting and elaborate. All right, so I'm looking at the time there, and I think I'm finished. And I think I'm now going to just kind of bounce back to the top level. That slide was written a few months ago, in which there were no technical reports. And there are now technical reports on all the pieces of this on my website if you're interested. OK, so big data, this is kind of a talk about that in some ways. It's kind of principles of approaching big data problems. And you've probably read a lot about big data. There was an announcement from the White House about large amounts of money being given to certain universities in California to do big data, and lots of more money available for all the rest of you. And I believe it's a real problem. I think that if I spend more time now consulting than I used to, because when I go into real companies, they all have big data. And all their new business model is built on doing things with that data. And they have no idea how to do that thing that they want to do. They try simple things that at a dudden scale, they don't have off-the-shelf stuff. And more, if they do it, they get big error rates, and no one likes it, and et cetera, et cetera, et cetera. They just have huge problems. So they say, well, what do you really want me to do? They say, well, give us some more students who are trained in computation and statistics. That's the first thing they want to do is hire our students who have a little training in computer science and a little training in statistics. They think those people can come in and look at the problem and solve it on its own merits, develop scalable solutions to inference problems. So that's actually the main reason that drives me towards working on this. I believe this is where the world is heading. And I believe that we need to do a good job at this. I think that data can be a boon and a bane. It can give us wonderful new personalization services. And it can also make lots of really bad decisions that's going to ruin our lives. And so we've got to find good engineers and socially conscious people think of ways to make it be less of a bane and more of a boon. And but intellectually, I just think it's really fascinating, because it really does bring together this notion of dealing with uncertainty, coping with, I mean, that's what AIs always try to do is to talk about knowledge and how do you get knowledge. And most of it is about the representation. Well, I can have a representation of nonsense. It can be a rich language in which all the stuff inside its nonsense is not calibrated to the real world. And learning, people have said, the calibration's important. I've got to get it to make good predictions against the world. But then the data structures are often not that interesting. So we really have to combine. We've got to think about statistical principles that give us knowledge out of really large databases. That's knowledge in the statistical sense. It really means there's real error bars that are calibrated to the real world and how to do things like that at scale. I think we're far from doing that. There's going to be decades more work on this topic. Thank you. Do you have time for some quick time? Yes. I just want to do a quick clarification on that portion. The Netflix test, was that on their final data? And then you had the answer set available to pair against. This is on a published, I don't remember if this is on the final, final data. The person who actually did the experience is named Lester Mackey. He was on the team that came in second to the Netflix prize. And so he's a real expert on it. And I don't remember exactly. This is the standard public Netflix data set. And we did held out to evaluate this. My question was then, your presentation gave some examples of how to handle big data. That is, look at it, apply some divided conquer approach, or make it computationally feasible to come up with a solution. Is there some general principles that you employ when you're looking at big data problems to start to analyze and figure out, well, how can I divide and conquer this particular set, or is it kind of more complicated, each one so nuanced, you have to really look at it? More of the latter. I don't think I don't have any general principles other than 400 years of statistical background and 50 years of computer science background that lead you to approach problems in new ways. But any given really big problem, if you want to get a lot out of it, you've got to spend a lot of time on it. And so you've got to use these tools to strike it. Most data sets are heterogeneous. They were sampled in different ways, in different times, by different people with different goals, and there's missing data, all kinds of issues. And so many of the people who work with real big data, I really don't. I'm an academic who nibbles at it. But people really do. They spend a huge amount of time looking and thinking through their data and looking at different pieces of it and thinking about it and thinking about what models would be appropriate, and then testing and validating and all that. It's an unglorious work, to some degree, but it's what you have to do to really get the kind of results you would hope to. And Google for all the lovely things you see on the web, most of behind most of the things is a lot of data analysis, big data stuff, like all the spam stuff and all the make sure the web pages aren't spammy, and et cetera, et cetera. All there's lots of database decisions. And they're doing all that kind of careful, that's what all the MapReduce jobs are doing is data analysis. So yeah, there's principles. I mean, we're kind of training another generation of people who can go out and good engineers who have some rough and ready principles in their brain and have some tools. And they go out, though, and they have to solve the problem on its own merits. I'm not the kind of person who believes there's only one big box you throw in to eat an outcome's glory. In the what part? Last part? Yeah, so it's called the thinning theorem of Poisson processes, that if you take a Poisson process realization and you sample it randomly, that that object is itself a Poisson process. So there's a beautiful, if you want to learn about the Poisson process, I do like to educate myself and others around me. And there's a beautiful book by Kingman called the Poisson process, a little thin book. And it's not elementary. By the end of it, you get into some pretty impressive deep theory. And if you kind of like discrete math and all, it's really a beautiful thing to read and it gives you a lot of tools. And that theorem is the superposition theorem, the thinning theorem, and marking theorems are all kind of part of that. I consider that one of the main pieces of vocabulary of a current working probabilist and statistician. Yes. About which one? BLB? Yeah, good question. I mean, in simple sizes or small, I would probably just use the bootstrap. It is kind of vetted and simple. But I think this actually can be faster than the bootstrap in terms of computational speed, not even on the distributed architecture, but even on a single machine. The bootstrap in some sense is kind of the one end of the spectrum, where it takes the entire, if you think of bootstrap as a special case of this algorithm, where gamma is one, well, it's often one stream into the algorithm. And so it's getting, we're getting more diversity by our sub-sampling procedure, and diversity can be good, especially if you do the averaging. So as a way, if you allocate your resources, this algorithm kind of gives you more degrees of freedom to allocate the resources. The sub-sampling, why is it not working so well? Really, it's because when you take a little tiny sub-samples, they're diverse. You get a lot of them, but they're all really noisy, so you don't do so well. Then you start to do better. And then when you get to really big ones, now they're not noisy anymore, but they're not diverse. You don't have enough of them. And the bootstrap in that regime is doing fine, because it's doing the re-sampling with replacement. So our thing is kind of getting the best of both these. And the bootstrap's on one end of that spectrum, and it can't be right to be on the off on that end. So that's kind of an intuitive argument. Yeah, I wasn't, this is sort of not so much a theory talk today on lower bounds and all that. For matrix completion, they're lower bounds, and they match the upper bounds. And so this was really about the constants in all those bounds. These are all going up at cubic rates. But I don't want a cubic rate that's so bad that when I get to real problem size, I'm hopeless. So we know the character of this problem. It's theoretically kind of handled. But I want to bring the constants down to be in the real world. So this is kind of what this talk has been about. BLB is a bootstrap type algorithm, and there's long-standing beautiful theory on that. So we're kind of just borrowing that. And those aren't bounds. Those are based on Edgeworth expansions. They're good asymptotic approximations to things. And they tell you about the convergence rate, 1 over n. That's a fantastic convergence rate. Again, I'm now asking the computational question about how can I do those, get those error bars in real time. Yeah, those experiments were all on single course. We're actually kind of hurting ourselves. We're not getting any of the advantages of the distributed implementation that we can do. So in a distributed implementation, we're going to way beat the bootstrap in that sense. You can't run the bootstrap on a terabyte a day. You can't do it. We can do it, and we're on some more modest problems. If we do it in a distributed way, we're going to really do well. So maybe you take that back. Even on smaller data, I had this as a black box, and we're trying to kind of provide that now. I might use it, essentially, on any problem. So by the way, I should mention as a caveat, the bootstrap does not work on all problems. There's theory. That's kind of why some of the theory was developed here. It shows it fails on certain class problems. In particular, if you're trying to estimate the distribution as a finite range, a finite support, you might still be using the max of the observed data to estimate the upper end of that. Bootstrap will not give you correct error bars for the max. It'll work for quantiles and all kinds of other things that are like the max, but it won't work for the max. So our procedure won't work for that either. Is much larger than the set of Leroy's, or the set of Leroy's? Right. If you look at the rates, they have m and n appearing in there. And if they're equal, it becomes symmetric. If they're not, you can look at the form of the rates. In our algorithm, I told you about columns because I'm thinking about matrices like this. But if the matrix is like this, I'm going to divide it up the other way. Or you could do both. So all the math kind of goes agnostic to that, really. But the algorithm, you should prefer the one which gives you the simple components in the divide and conquer. Yeah. In the last part of the talk, you didn't appeal to biology to justify the bosson. You didn't mention anything about biology. I suspect you would be less pleased with the model if the stationary distribution didn't match up with the existing. So there are so many possible simplifications. Now, obviously, you've got great computational advantages, and your performance wasn't too effective to prove. That's a great positive thing. But as a biologist, as a scientist, somebody who cares about the science aspects of things, how do you, what do you tell them? Well, this is a cooperative endeavor over many, many years. This is not the end of the story. What we're trying to do now is provide just much better phylogenetic inference. So the phylogenetic inference has done so far is the one I mentioned. You assume the alignment's already given by some other ad hoc procedure. You throw it into this algorithm, and you get out some answers. And then the biologists look at these trees, and they try to do biological inference. We want to make that chain a better inferential thing. And so we are contributing to that. But now you also want to make it more biologically realistic too. Eventually, we want the whole thing to really be the biological model and then invert that for the person of inference. So what's good and bad about this, one from biological point of view? Well, the insertion rate being independent of the string length, I think that's kind of bad for lots of processes. If the length gets really, really long, I should have to see more insertions. If the insertion's endogenous to the DNA, what's happening why this is so good, actually, is because the TKF model doesn't allow some strings to get really long and others to be short. It tends to kind of favor things that are roughly similar lengths. That's kind of what's happening. On the other hand, there's things like retroviruses, where things come from the outside. And so the insertion rate is dependent on the outside thing. It's not has anything to do with the string length. And there are models that actually have better biological model. So you start to think about those issues as you make modeling efforts, and you realize, aha, here's some biology, here's some not biology. And not one model is going to be optimum for all situations, but you work in tandem with biologists to do these kind of things in the long run. But biologists love these tools. They love to download a piece of software that you'll do Fajr and Inference, and then they run after that. And so we've got to also provide tools. We can't just sit with them over the shoulder on every single problem. Thank you.