 Almost everybody is back on time. That's great. And yeah, so here we go with the second half of today's lecture by Miles. OK, great. Welcome back, everybody. So thanks for enduring the first lecture with a lot of technical parts, like I mentioned. But I hope it was interesting. I think based on the questions that I got in the break, a lot of you found it interesting. So I was glad to see that. So let me get my pointer connected back here. OK, great. OK, so part two of today's talk, we're going to switch gears. So the first part was this algorithm I called tensor train recursive sketching. That's the name of that algorithm. That algorithm, as you'll remember from just a few minutes ago, you start from a pile of data. So you have this data from somewhere. Like maybe you sampled it from the true distribution, or you just found this data, and it processes it down, and it tries to estimate the true distribution. So that's an example of just machine learning from data, and you'd call generative modeling, because afterward you have a probability distribution, and you could sample more samples from it, generate new samples from that distribution. So that's that setting. This talk is going to be about a different tensor network algorithm. So the point of today is to show you that unlike yesterday, where I focused a lot more of just building an architecture, building representations of data, but then at the end, I kind of punted a little bit and said, how would we train them well? You'd just use gradient descent. The idea of today is to say we could do these really new and kind of radical algorithms that are just super different from what you would traditionally think of, and they might bring new capabilities. So we already saw about these sketches, and it lets you have some insight into the properties of the data and use that to use the data better. This talk will be about something even different from that, which is a totally different algorithm. So you know, wipe the slate clean. We're starting over with another one, called tensor cross interpolation. And this algorithm has a different job. So instead of starting from data, here you have kind of a higher bar. You actually have to start from a function. So this function, on the one hand, you don't have to have the data to start with. You could start with zero data, but on the other hand, you have to have something a little more demanding, which is a black box function. By black box, I mean that the code here below doesn't know what's inside the box. It just has to be able to reach in the box and pull out data on the fly out of the box as it needs. And you don't know a priori where is it going to ask for the data. It's like a smart algorithm that explores, and it'll figure out, as it runs, do I need more data here, or do I already have this part of the space figured out? And if it already has that part figured out, it leaves that part alone, and it goes over somewhere else and figures out another part of the space. So you have to have a function like that. But then, once you have the function, it does a process that you could call active learning where it starts by getting some values from the function. It starts building an approximation of the function that way, and then, based on its own approximation and some more calls to the function, it discovers new important inputs back in that it queries and it finds out more data, and it keeps running this loop until it converges. And it's a really nice algorithm because it's got this active learning property. You don't have to have a way of sampling the function. Also, what's nice is that it doesn't have to be a probability distribution. It could be, but it doesn't have to be. So the function could really just be anything. It could just be a curve you want to fit. Later, I'll show you many examples. One of them will actually be a function that's some kind of integrand from perturbation theory that is equal to the sum of a bunch of Feynman diagrams. So there's all these various different things that this function could be. And this algorithm was actually invented in 2010. So it's actually a pretty old algorithm by now. However, my impression is that, for a while, it was a bit underused by, I mean, I think the people that invented it were using it a lot, and they were using it very creatively, but there was only so many of them. And I think only in the last few years, my impression is that it's really picked up steam. And I think even in the applied math community, just a couple years ago, there was this kind of explosion of papers using it that I'll show later, where people are using this as a strategy for optimization. And I'll show you a picture later of, they're actually controlling like a robot arm using this algorithm. Some people from applied math and robotics. And then it's kind of gotten popular in some corners of physics recently, too, after this Feynman diagram paper that I'll tell you about came out. But some of the people behind this are Ivan Oceladets. He's the person who also coined the term tensor train. And Dmitry Savoystronov, he has these nice papers about a parallel implementation of this TCI tensor cross interpolation algorithm. And then more recently, this collaboration led by Uriel Nunez-Fernandez with people like Zarbier, Wintel, Olivier Parkele, Jason Kay, and others, where they were just, they didn't invent this method, but they were kind of refining some details of the method. And some of them have a new paper coming out soon where they describe the mathematics in some new ways and connect to some concepts from linear algebra. So there's some work going on from these people. Okay, so I'm gonna spend a lot of the first part of the talk just discussing the kind of like core linear algebra part behind the method. And then we'll see later how we lift it in high dimensions. And then what's neat is that at first it's just linear algebra, but then when you put it in high dimensions, it sort of like turns into machine learning automatically, which I find quite interesting. Then we'll show how, what are some applications of it in like some demos. So, and one kind of question maybe to keep in mind through this talk is that this stuff is all still a bit new in a way. So, I showed you this one algorithm from before that looks really different. This one's gonna look totally different from that algorithm. And yet they're both machine learning and they both involve tensor networks. Could they be connected somehow? So I think in the end maybe yes, but right now they just look really different and they solve different problems. So I think a future question could be, can these start to move closer and closer together? Okay, but let's start back with the basics of what's the motivation of this algorithm. The motivation is actually this really neat fact from linear algebra that I certainly didn't know until rather recently, which is called the interpolative decomposition of a matrix. And the statement is this. The statement is, if you're given some matrix M that it needs to have some structure, like it can't be a random matrix. It needs to be a matrix that has some patterns in the numbers that you can actually reconstruct the whole matrix on the left here from a subset of the column. So I'm highlighting certain columns that I'm imagining are somehow these like best columns or most representative or most typical columns. There could be any of the columns, but it's just some kind of typical columns. You can reconstruct the whole matrix just from a handful of columns. And we'll get into how many and where does that number come from? Why three, why not eight? And then what you do is you write it as exactly these original columns from the matrix. And importantly, they're in the same basis. So they're not all scrambled up like when you an SVD and you have a U and a V. Those numbers are hard to interpret. They're all scrambled up. They're just some weird numbers. These are the actual entries from the original matrix. So they're highly interpretable. You can look at it and say, yes, that's column one. I know what these numbers mean. These might be like movie ratings or these might be coordinates of some particle. And I know what space the particle is in. So then the thing on the right is like an interpolation matrix. And we'll kind of unpack what it does. But for example, some of the entries of this matrix are very simple. For example, this column is just saying, take only the first column here and put it back as the first column of the original matrix, right? Where this column is saying, zero, zero, one is saying, take the third column and put it in the, yeah, take the third column here and make it the third column there. This one's saying, take the second column here and make it the fourth column there, right? So those columns are just saying, the zero, zero, one, all that stuff is just saying, put those columns back where they belong in the original place. And all these other dots, these are just some numbers that you just have to work out. And what they're doing is they're saying, take some linear combination of these three and that'll give me column two or take some linear combination of these three and that'll give me column five and so on and so on. So that's already a pretty neat idea. So we can see that column two, for example, is just some arbitrary mixture of columns one, three and four of the original matrix. That's what this is saying, right? So that's why it's called interpolative because we're interpolating all the other columns. Imagine this original matrix having thousands of columns. It's like we could interpolate all those thousands of columns just by three of the columns, something like that, all right? That's the claim. I mean, it might not be three, it might be a different amount, but that's the idea. So what does this mean? Let's try to get some intuition for why this is even possible. So let's say we have a matrix of movie ratings. Let's say people rate these movies. So on the rows here are supposed to be the movies. There's some racing movie, romance movie, Robin Hood movie, like action movie, Walt and so on and so on. And people watch these movies and they give them a rating out of five, but sometimes they haven't seen the movie. So maybe we put a zero when they haven't seen the movie, right? Like only one person saw this movie, for example. This person didn't see this movie. And we take all these people and imagine hundreds of movies and millions of people and they all enter their ratings, right? And now your job is to reconstruct this matrix, but you don't want to store the whole matrix or maybe you want to predict new people how they would rate things or this kind of thing. Or you see a new movie and you want to predict what's the likely ratings this person would have of that movie. So you want to do these kind of reconstruction tasks. So the interpolated decomposition is saying, if it's true, it's saying that if there's K genres, basically mathematically, if the rank of the matrix is about K, and I'll tell you why I'm saying about K and not exactly K later, that means that there's sort of K explaining factors. It's like if you do PCA analysis of this matrix, there's sort of K important directions in the space. That means there's K representative people that's enough to reconstruct all the other people's ratings. So that's kind of profound, right? Like it's a little more obvious in a way with the K genres that there might be like K movies, because there might be like a typical romance movie where the plot's always the same as every romance movie or typical action movie where there's no plot at all and everyone's just fighting, you know? Like right, there's these kind of things. But with people, it's less obvious. Like we all think we're unique, right? We all think we're totally different from everyone else. But in a way, it's saying that just the fact that there's genres of movies means that sorry, we're not unique, that actually there's as many tastes we have as there are genres and that's it. And that's just like a mathematical fact. This is kind of a weird fact, but it's true, right? So in the end, we just need as many people as there are genres and we need those people to be like typical enough. Like they can't be like strange people. They have to be like average people in some sense. That's why that's what the tilde is about. That's why the number is not exactly the rank because sometimes you need a few more people because a few of them might be strange. So you need to have a little more people to get rid of the strange people. That's sort of the, we'll see what I mean. But once you have enough people and they're normal enough, you can reconstruct all the other people. That's the idea. So this sounds funny, but this is actually true. So this idea, these ideas like interpolative decomposition, it actually comes from machine learning. Things like recommender systems is the jargon that they use. Ideas like the Netflix prize where they said, could you come up with better algorithms to predict what ratings people would give to movies they haven't seen yet. This comes from that field. So how is this mathematically possible? It seems counterintuitive. So let's see why. So this is not a rigorous argument, but just kind of a rough argument would be if the matrix has an approximate rank of r, then by the SVD we can decompose it as u times s times v where this alpha runs from one to r. What I mean by approximate rank is that the singular values may not strictly go to zero. They may just get very small and we have some error threshold and we say, below this threshold we will treat these as zero and we're gonna cut the rank at a certain point. Okay, we call that the approximate SVD rank. And SVD finds the optimal rank. So you have this error in mind, you do the SVD, that's the best possible rank under multiple norms for that matrix, meaning that's the smallest sum you can do in any basis to reconstruct the matrix. So in that, from that setting, you can just prove some simple propositions. You can say that means that each column of the original matrix is made of exactly r, up to that threshold that you're working in, is made of exactly r columns of the u matrix. So there's u and then what you do is you basically say, if I want a certain column, column j. So you say pick column j. What that means is you fix the j on the v to a fixed value and then you sum over, you just use mush the s over into v and call that coefficient c alpha. And then what it's saying is that I can now run down the rows in the jth column and I can reconstruct that column by just summing over all these singular values times these elements of v and I can reconstruct that column. So the point is that u has r columns, there's some linear combination that I can take to get each of the columns of the original matrix, that's all I'm saying. So then the argument is that if we go back to the original matrix, flipping this logic around, every column of the original matrix in this picture contains information about u, if we flip this equation around, unless one of the c's is accidentally zero or too small or something. But that's what I meant about a normal person. A normal person, all these c's would not be zero. Somehow they've seen enough movies or they have broad taste and they like all kinds of movies so for them the c's are not all zero or some of the c's are not zero. But in practice some of those columns that we pick out certain columns, they might be deficient, i.e. missing certain singular vectors. So to reconstruct the whole matrix we might need a few more than r, a few more than the true rank. So if the true rank is r, for this interpolative thing to work we might need r plus p and please call it oversampling parameter. So you say I believe there's really 10 genres so I'm gonna go pick 15 people or 20 people and I'm gonna use them as these typical people and then I can reconstruct the whole matrix. So this interpolative decomposition is not as tight as the SVD. You don't get away with as small of matrices but it's like close to as good as the SVD because of this problem that some of the columns might be a bit deficient. So that's sort of the background. So for the rest of the talk this part may be a little unsatisfying to you but I'm not gonna go through how do you actually compute the id, the interpolative decomposition. Let's just assume we have a routine that does it. So basically you can think of this as like a laypack routine, there's a package that will go through a matrix, will do some SVD analysis and things and then it will go back in your original basis and it'll say, based on the matrix you gave me I've determined that the best columns to take are columns one, three and four and then I'm gonna build this other matrix that will reconstruct your original one. So any questions about that or hopefully that's okay, we just have a routine that'll do that. Okay, great. Now why is this algorithm, the algorithm I'm gonna talk about is called tensor cross interpolation and I just talked about interpolative decomposition. So this is just one slide to say why do people talk about cross? The reason is that in the traditional formulation of this algorithm and all the papers that I'm gonna reference they like this double-sided interpolative decomposition. So the one they like, it is very cool actually, it's where these are some original columns of the matrix. So this is a very nice figure from this Feynman diagrams paper. So see these columns that are highlighted in this different color. It's like column maybe six and eight and then this next to last column here. That's these original columns here and then there's also special rows. So these rows that are marked in these special colors here, here and here, they're over here and then you have these things called pivots which are where the columns and rows cross and you take those elements and you fill up a small matrix of them and you invert it and you put it in the middle. This is called the cross approximation. So how was this algorithm invented? My understanding is basically Ivan Ocelides and his collaborators they knew this algorithm from linear algebra which I think was only discovered in the early 2000s if I'm not mistaken. It's not a very old algorithm. And then they knew tensor trains because they had rediscovered or discovered tensor trains and they thought there has to be some way to mix these together and that's how they came up with this idea. So this is like a two-sided or two-way interpolative decomposition called the cross approximation. It's a very elegant idea. I'm just gonna stick with the one-sided interpolative decomposition because I like that a little better. But this is if you read all the papers they use this one because it's more flexible but I just find it to be a little bit more complicated to talk about. So I'm gonna stick with this interpolative form. So now I'm gonna introduce the algorithm in kind of two passes. The first pass is not really gonna be an algorithm yet. It's just gonna be a funny form you can put a tensor into. And then once we have this strange form of a tensor or form of an NPS then as we do the backwards pass it turns into a machine learning algorithm. So in the first pass it's just a way of chopping up a tensor in a funny way. Then when you go backwards it starts doing machine learning sort of almost automatically which I find kind of remarkable. So all tensor cross interpolation is like a high dimensional lifting of this idea, these two ideas into a tensor, into a high dimension. So you basically have the notion of you say what if we had a tensor and we matricize it? So we treat this index as the row index and all of these other indices, these N minus one indices collectively as this huge column space like this enormous column space. In principle the ID, the interpolated decomposition should exist or will exist for this. And what it'll say is that out of this sort of exponentially big column space we could just pick a few of those columns and reconstruct the whole tensor. Then this first matrix will just be those columns. It'll be like scanning down the rows and the columns, which columns we pick is labeled by this bond going through. And then all that other stuff all those other purple tensors they're just these special entries that reconstruct the rest of the columns that interpolate through all of them. And to make this efficient we have to do it through this matrix product state or tensor train formalism. And the algorithm for this part only requires two things. It's the idea of matricizing a tensor which like say if you're using Python or Julia this is just where you say reshape and you just say here's this array please interpret it as a matrix with these indices on the left these indices on the right. It's just looking at the data a different way there's not really much going on there. Kind of reorganizing your data and then applying the interpolative decomposition to that matrix. So but this would be a good place to pause. Do you have any questions about what do I mean by matricizing a tensor and how do you call like a matrix routine on a reshape tensor? If anyone has a question about that please ask. Maybe the question is how do you do that? Or just, you know, is that allowed? Mm-hmm. Yeah. Oh yeah, so this one is not, there's no details yet. So this one is just a motivation to say but it's a good place to ask a question. I agree, so that's what the next few slides will be about is walking you through how do we get all those tensors. But so the idea would just be what if we first had a matrix that had a small set of rows like maybe four rows and then a giant number of columns. So let's say what if this was like a thousand, you know, that's still just a matrix and we can still ask for its interpretive decomposition and we might find that we get, here's four, this might be like two and then this will still be a thousand and we could call this matrix C and this one Z. That's a common name I've seen in that math literature and C means that it's original columns of the original matrix. So out of this thousand, you pick two columns of all thousand of these and you say somehow these two columns out of the thousand is enough to reconstruct all other 998 columns and then this matrix will be this funny matrix where it's like some numbers and then somewhere in here we'll have like one zero and then more numbers and then somewhere else we'll have zero one and that'll just be like remembering that the columns we picked were maybe column 18 and the last column or something or it could be the other way we might have this. It's just some permutation matrix and then a bunch of other numbers. I think if I'm not mistaken too, there's some weird theorem in this field where you can even bound to these numbers. You can say they're always like less than two or something weird like that. So that's the idea. But then the idea is that really what if this thousand, like what if this was 1024 and so that's really two to the 10. So what if actually that was a bunch of smaller indices? This would still work but then could we keep recursively doing this? Could we break it again here and here and here and here? Could we like recursively break this thing down? What would we get? Well, how would we do that? So that's what these slides will show. But then the main ingredient is just each time we just fold the tensor into a big matrix and call this ID factorization on it. So to be really concrete, let's consider a tensor with six indices, each one of dimension two. It just helps with showing the details. So you could think of this as being like that ising partition, sorry that ising probability distribution I showed in the last talk where you have six ising spins, you know each one is dimension two because they're these spins, something like that. And these could be these positive numbers inside, for example. So the first step is, first let's separate the last index from the rest but all I mean by separate is just make this into a two to the five by two matrix. So it's just a very rectangular matrix. But it's the same data, just you just take it out of the tensor and you just put it into a matrix and reorganize the numbers into that shape. So nothing has really happened. It's just making a big matrix with these numbers. Then this is not a really efficient algorithm. This is just more of a motivating algorithm that you think about doing. And later I'll show you the efficient way of doing it. So you compute the interpolated decomposition of that matrix. And in this case, because it's so rectangular you just keep both columns. So you say, you know, there's only two columns, let's just keep both of them. And you store a little array on the bond that tells you which columns you kept. You keep one and you keep two. So it's just saying keep both, one and two. So nothing's really happened. It's just kind of a decomposition. But then now what you do is you take this guy, this one that you see it's S1 through S5 and then this new index that you introduced that's index is saying which columns. You just keep that tensor only. So this last one here, you keep it to the side. You put it aside and you store it for later. So it's gone out of the picture. Okay, so now you take this one and you see there's that new index we introduced. Now you fold S5 over to the right with it and you keep S1 through S4 in the last. We do a different matrixization, right? And you see it's kind of like, it's gonna start being a little bit like renormalization in physics where this is gonna be some latent or internal index. It's gonna have some information about like a core screening. This is just an analogy, but something about like some minimal information we keep as we go. And then we bring in a new variable kind of like when you do an RG, right? So this new variable we're bringing over to participate in the process is variable five now. So now we do interpolated decomposition of that. And then now out of all the different four columns we could have gotten, the four columns being one, one, one, two, two, one, two, two, which are labeling the four different settings of variables S5 and S6. We just keep two of them. So we're just gonna keep the two one column and the two two column of this matrix, right? What I mean by the first number is the setting of S5. The second number is the setting of S6, but kind of process back through this new index that we'd introduced, right? So out of the four, we only keep two. Why two? It could be three, it could be one. It depends on the numbers inside. I'm just kind of giving a made up example, right? But let's say this matrix has a lot of structure. The ID can exploit that structure and it can figure out actually out of the four possible only two are needed because there's like a low rank property in the data. So it could be because that model, that original tensor was one of those Markov models and those only have a bond dimension of two. So it'll figure that out automatically, okay? So you record the columns that the ID keeps, it's doing this dynamically. It's just doing linear algebra, figuring out the best columns to keep, it can figure out how many it can do all that for you. Okay, great. Then you take this left tensor. I think this one you keep to the side. That's gonna be the second tensor from the right of the MPS. You take this left tensor, it has the same kind of structure. It's got the previous variables, this new internal one that's now smaller than it has to be. It's automatically thinning it down for you. You bring over, I think there's a little error here. I probably needed an S4. You bring over S4 to the right. This should be an S4 on the left as well. You make this matrix, you do the ID, you split the variables again. Now these indices, these labels are the ones from before. They're saying out of those four settings of S5 and S6, we only kept these two. Now we're bringing S4 in the picture. So we could have had eight settings of S4, S5, S6. We already dropped some of them between S5 and S6. We're gonna drop even more. So we're only gonna have the settings of like, S4 is one and going with both of these and also S4 is two and only connecting with that one. So out of these possible four columns, we're only gonna keep three this time. I just made up that number three, but it depends on the data you have. It's an adaptive method, right? So you keep doing this and doing this and doing this, scanning from right to left, saving all those interpolating matrices that you got going from right to left. And when you're done, you get a matrix product state. So all we're doing is we're reshaping matrices, calling ID, reshaping, calling ID, like that. And somehow when you put all these pieces back together, all you do is you just look at these purple tensors I made. You see it has S6 and this internal bond and that internal bond is exactly the right one that connects here. And here's S5 and the next internal bond. So each of these has three indices every time. It's the old internal bond, the next site, and then the new internal bond, three indices, right? So each one of these is three, three, three, three, three. It's a matrix product state. And then what do these columns mean? We'll unpack this. What are these labels mean? These sets are called column pivots and it records which columns the interpolated decomposition decided were the important ones at every step. And that linear algebra just figures that out. So one of the interesting payoffs of this is that the entries of the first tensor are also entries of the original high dimensional tensor. It's a slice of the original tensor. See this huge high dimensional space and somehow you've rolled it all up into the first tensor and somehow these four numbers in here are like four actual numbers of the original tensor. And then the rest are all these strange numbers that were figured out by the ID. Some of them are zero and one. Some of them are these other small numbers. And somehow those can reconstruct the whole tensor for you. This exponential number of variables can be reconstructed by this very small polynomial number of parameters. And it's very interpretable. So what do I mean by that? So you know exactly which entries they are. You have that information. So you can say which four numbers in here are the ones in the original tensor. It's these four. So it's these four settings of the outer indices. Those are the four entries that you have. But somehow you can reconstruct all the other entries too. So for example, you also can say that, or this is those four numbers that I'm just gonna show. So you say for example, do I have perfect access to the entry two, two, one, one, two, two? And the answer is yes, you do. It's an exact equality. So you just look at these sets. You say two and then I can put either of these on the rest, either of these pivots on the rest of the outer indices. So the next one could be two, that's okay. And then I can pick from one of these. So I pick one, one. That could be either of these two. Right, one, one. And then here I have the choice at the end of either putting two, one or two, two. So I choose two, two. So what I'm doing is I'm kind of reading from left to right and I'm unrolling and I'm following through one of these sets. So basically I picked, let's see, I picked this one here and then here I picked that one and then here I picked one, two, two. So that's this one and then two, two and then two. So I just have to always pick from those sets on the bonds. And as long as I do that, then if I contract those tensors now, now these are matrices because I fixed all their external indices. If I contract them together, that number that I get will exactly equal this entry of the original tensor. I haven't lost that information. I still have that information. Okay, any questions there? I know that's probably a lot, but that's the idea. So this is this backwards decomposition of a tensor, kind of a recursive decomposition where you keep all this column information. It's also exact for this value and that value. Yeah, mm-hmm, mm-hmm. Oh, sorry, so this two here is that one here. Yeah, so I'm peeling them back if you. Right up, right up. Yeah, these are only five entries here. And so that's these five. It's a good question. It's good to sit with this a minute. Yeah, no, it's confusing. I probably should have made those one because having them be two, it kind of makes you think of second index. But here the two just means spin down, yeah, exactly. So here there's five, here there's four, here there's three. Exactly the right number to fill out the rest, exactly. So what those are saying is these are the settings that we still know will be exact. All the other ones are interpolated and that's the next slide. It's just say you can also ask for the other ones. It's not like they're zero. So you could say how do I reconstruct the one, one, one, one, one, one entry of the original tensor. So you fix all these outer indices. Now those are now matrices because I clamped those third indices. I contract all those matrices together on the top. I get a number. I'll get a number and that number will be something. It's just no longer guaranteed to be perfectly the original one. It'll be close if this is done correctly. If you did these IDs accurately, it'll be close. It's just gonna be interpolated. Meaning that it's gonna be made out of all these other columns that you saved along the way. But it's just gonna be an approximation. And that approximation is the usual one of a matrix product state. Like when we say the bond dimension is small and we're making an approximation to like the true wave function, this kind of thing. And we didn't choose the largest possible bond dimension. It's that same approximation. But just now with an extra layer of interpretability put on top. And then similar for all the other non-pivot. So you can still ask for all the exponentially many settings of the original tensor. It's just gonna interpolate them through the special set of columns that you made along the way. Okay. So this form is very interesting because it's actually a new type of gauge of NPS. So rather than this kind of standard gauge, if so what I mean by gauge is that for any NPS is that you can always take some NPS that someone gave you that maybe you ran DMRG, you get a tensor network. You can always put on the bond some matrix M and it's inverse. And that doesn't change the original tensor that you're approximating. It changes the numbers inside the NPS. But the original tensor that it approximates is rigorously the same, right? Because M and N inverse cancel out if you sum over these bond indices. But so this is just some fancy choice of M and N inverse on every bond that somehow tells you that this is a chain of ID of interpolative composition. So it tells you this interpretability of which points are exact, which points are interpolated and these kind of things, right? So I find that already quite interesting. And so I'm calling this interpolative gauge. I don't know if that's the best name but I think it's a nice name. And the point is that it has nearly the same bond dimension as the usual SVD-based orthogonal or canonical gauges but it could be a little bigger and that's because of that oversampling thing that you have to do to make the interpolative decomposition to be accurate. But it has much better interpretability. So you're trading off a little bit of bond dimension. It's maybe a bit larger but it's more interpretable. So we know how the entries correspond back to the original tensor. But remember tensor doesn't mean that we actually store the two to the N numbers. Tensor just means the true function that we're trying to approximate. That's really what we mean. So now we're gonna use that picture to turn this into a machine learning algorithm. So how do we go from interpolation to machine learning? So the way we do that is now we sweep backwards and as we do we can do machine learning. So let's say we're in this gauge on site one. So let's say somehow we just made up an approximation for this MPS. Maybe we just started with like some kind of bad MPS that we got or we just, sometimes what you can do is you can just choose a single column going backwards and that's just corresponding to picking out a single entry of the original tensor and a few others to kind of fill out the interpolation. So you can initialize this in various ways that are cheap. And I didn't make a slide about how, but there are ways to sort of shortcut the initialization problem. So let's say we start with some kind of guessed or made up column pivot sets. But they're not really optimal yet. Maybe we just guessed some random columns and we said, I don't know, let's guess these to get the things started. But we're not really confident that these are the best ones. So this is a bad interpolation of the original function or tensor that we're trying to make. So how do we improve it? Well, we can actually turn this into a machine learning method in the following way. So first of all, what we do is wherever I set original tensor, instead of thinking of it as a huge array of numbers, we think of it as a callable function. So that array of numbers is really just all the possible outputs of a function. And this is where it turns into machine learning. Is that this function could be literally anything you want. So any function you can code, that's what it could be, right? It just needs to be something that's kind of learnable. It can't be a function that just always returns zero and then somehow returns one when you've guessed the right password for someone's computer, right? That's a function that's not learnable because there's no structure. There's nothing to give a hint to the algorithm of where to look next, right? It has to have some structure. But as long as it has a structure, this could work. And so it could really just be a piece of code. It takes these indices, it runs the code, it computes something, I just put a made up function here and it returns a number. And you must be able to ask this function for any element this algorithm wants. So it's kind of a demanding algorithm that way. It's a little more demanding than just giving it a pile of data, right? Because with data, you can only ask about the data you have. Here, you have to be able to ask for any point in the space. But there's many problems where this is an okay thing to have. In fact, a good thing to have. So how do you now do machine learning? So you go backwards. You say, let's contract the first two tensors together. And you'll notice that I erased these pivots. Here, these columns. Because now they're kind of covered up and we're actually gonna improve them. We're gonna throw them out. Maybe they were bad, maybe they were just guesses. We're gonna try to find better ones. That's why it's like learning. And now, if everything is going well, this local tensor now, which has two of the original variables, the two of the original indices and this other bond here should approximate the full original tensor or function on all these entries, every setting of S1 and S2 and any of these j's. And I'm using this kind of like splatting notation that might be familiar from some programming languages. Like in Julia, you can take these tuples and splat them into a function. And when I splat it, it means plug S1, S2, and then for example, 1, 1, 2, 1 into the function. So that's saying, put that back in the original function. That number should be one of the numbers in this tensor. But at first, it won't be true. It'll be a poor approximation because when I covered up that index, what I did is I summed over it. It was running that interpolation. All those entries in that purple tensor that I covered up, many of those were just these interpolating numbers. And I maybe didn't pick the best columns yet on that backward pass. And maybe those bonds were too small. So really, those interpolations are very approximate, very crude. And so these entries might be very far from the real ones of the function. So first, it'll be a poor approximation. So a very simple thing you can do to start, this is where the learning starts to happen, is you just fill up the tensor with the correct values. So you just run the function over the small number of values. So there'll be four settings here times whatever number you have here, which might be kind of small, like four. So maybe it's just 16 values that you call from the function. So you call the function 16 times, you get all those values. It's not that many. Then you redo the interpolated decomposition, and now it's gonna go better because you actually filled it up with true data from the function again. So you kind of refresh the data, right? So now you do it, but you do it the other way around. You just transpose the matrix, you do the ID, but now columns become rows. So now you are learning what are the best rows to keep? So now you have a row pivot set. So you have this I1, and in this case, since we're at the edge of the system, maybe we just keep both rows. But later, as we go further in, we'll only keep a subset of possible rows. So now your MPS has moved over. You're in this interesting state where you have, these are the columns you kept going from the right and compressing, these are the rows you kept coming from the left. And it's telling you information like, if I put a one here and then a one here and then I pick one of these that has a one and then I keep filling out to the right, it tells you which ones are exact and which ones are interpolated. So you can kind of keep interpreting things that way. Now you keep going, you merge tensors two and three. And the point is that keeping this row pivots and column pivot information, it tells you how these elements should relate to the original function. That's the interpretability that you keep. And so you can tell how well you're doing. That's one thing is you get an error estimate. So you can look at this tensor that you made and you can compare it to entries from the function and some of them will be very close and some of them will be very far because you're not done yet. And you can say, first of all, how well am I doing? So you can calculate all those errors. And what's nice is you can actually work in what's called the infinity norm in this algorithm. You can actually say what's the maximum error that I'm encountering. So that's one thing you can do. And then you can fill this tensor up. You can overwrite it with actual calls to the original function. And then you can re-decompose it, but now we're moving left to right. So we decompose on rows. So we get this row interpolating tensor here that interpolates from previous rows. The next variable, I think that should be an S2. Maybe I'll just make it on the fly fix right now. Let me just do that. It's an S2. That should be S3. Okay, great. And then we keep going. Great. That should be J3. I missed a bunch here. Great. And so then you do these sweeps back and forth. So each time you can imagine you merge two sites. That kind of reminds you of two site DMRG if you know that algorithm. But then instead of calling an eigen solver, you compare to your function, you fill it up with new values, and then you decompose, but you do this interpretable decomposition, this ID, and you keep going. So it's pretty neat. It's actually very short code in the end. I've been going slowly showing you all the steps, but it's very short code. It's like call the function a bunch, call the ID, go on to the next step. That's it. So it's kind of like a function DMRG or something like that. Another thing that I glossed over, but I put a note on this slide, is that there's actually a more efficient way to do this. I put a star here to say, instead of filling up the tensor with all fresh values, you can also just fill up certain rows and columns with fresh values. So there's a thing called rook pivoting where you move around like a rook and chess, and you keep hunting for the place where you have the biggest error. And once you find it, which you're not guaranteed to find it, but you just find some place where you think the error is the biggest. Now you fill that column and row with new values in the function. This can be a somewhat more efficient way to do it. So there's little variations like that that you can do. So you do this back and forth a handful of times, maybe three or four times, and it converges very fast in practice. And I'll show you some demos. Why is it active machine learning? Well, the input is just this code, it's just this function. And each step, there's about D squared, chi-squared calls to the function that are made, which is not a lot, right? Chi-squared is a very low number compared to DMRG, the complexity is chi-cubed, which is a whole power of chi-more. So it's a very smart algorithm. It uses information from these pivot sets in the bonds, these row sets and column sets to know here's the important places to look at the function of this huge, high-dimensional space. We're just gonna do these low-dimensional slices and then re-decompose based on the new information that we gained. So compared to D to the N, D is the size of the outer indices, compared to that exponential number of possible function calls we could have made. This only makes something like N times D squared times chi-squared calls to the function. So it takes that N from being in the exponent all the way to the front. So from exponential scaling to polynomial scaling, right? So we crack an exponential wall completely down if this succeeds. So it's something very powerful. And then, you know, afterward we have basically the whole function, this huge exponential function with all the different exponential values that it could have stored in an MPS. All possible outputs of the function are well-approximated when this is done. So it's something very powerful, right? So we could sum them all up, we could sample from the function. Any other thing we know how to do with tensor networks is now, we can now do on this function because we've managed to learn it in, right? So some very powerful things could be done. So now let's talk about applications of that. But any questions about that idea? I know it's kind of a lot of technicalities, so, yeah. Well, the tensor chain decomposition itself is just the idea of this structure. So it's just saying, we could represent a tensor as a chain of these tensors. So it's just the idea of the MPS format. But just the tensor train is just an alternative name. But then there's also a special algorithm that's called TT-SVD. And that's the algorithm where you say, give me a big tensor and I'll actually do a sequence of SVDs to break it down into this train, right? And that's actually an algorithm. So that's the one where you say, you know, do an SVD here and I get a U and then this is like SV, right? And then do another SVD here and then I get U1, U2. This is like S2, V2 and so on. And then do another SVD. And then when all the dust settles, you get, you know, all your use and then you get your last SV here. That algorithm is in the math community is called TT-SVD. And someone asked me one time, what do we call it in the physics community? And as far as I know, we just call it the thing where you do a bunch of SVDs and get an MPS or something like that. I think we just are not very good at naming things, but. But so tensor train just means this format and then TT-SVD is this particular way of getting an MPS or a tensor train. Is that helpful? Yeah, and then this is just now a different way of getting a tensor train. This is the cross interpolation way. Oh, well yeah, so how to relate them, that question there would be, what information do you have and what scaling can you tolerate? So here, this one scales exponentially with the length of the number of indices, but that's okay if there's not that many. So this is very useful. For example, if you have a small tensor, like say you're doing a PEPS calculation in 2D, those tensors have like five indices in the square lattice. One thing that, for example, Philippe Corbeau does in some of his work is he will take those PEPS tensors and he'll unroll them as MPS in order to do certain operations more efficiently. Then once he's done them, he puts the MPS back together and puts it back in. So this can be useful in a small scale and when you have the whole tensor already. But then this tensor cross, this is like when you don't have the whole tensor and not only don't, but you couldn't because there's way too many indices. Like what if there's a thousand indices? You never could store the whole tensor. You could never do this algorithm. And you don't want to because you'd have to somehow evaluate all the entries anyway to be able to do these. So the tensor cross, what it's doing is it's like a sparse sampling. It's just jumping in and grabbing values at points and then somehow skipping all of these steps where you'd have to have the whole tensor and just jumping right to here. So it's very powerful in that sense. Yeah, yeah, it's done. Well in the, okay, great. So you mean in this pass where I'm going from left to right? Like for right to left? Yeah, that part I was intentionally a little loose about because it's like, that's just the like how do you initialize kind of question. It's like, how do you get the whole thing started from the beginning? There are ways to do it, but it's just a little technical like, let me see. So I think the way I thought of was to, let me see, what was the way I was doing? I worked it out one time when honestly I don't even remember now the details, but you basically end up, you can just sit down in a quiet room sometime and you can kind of work out little formulas for these tensors on the forward pass to initialize the whole thing. And you just end up with formulas that are like little ratios. So you end up with things like, like basically what you do is you think of the original tensor as a giant matrix. So you say, what if the original tensor was this enormous matrix and say we split it left and right, okay? And then we look at this matrix and what we could do is we could just pick some cross out of the matrix. So maybe we pick row eight and column 16 or something like that. And then out of that cross, we could approximate the whole matrix as a big outer product. So this will be column 16. This will be row eight. But then we're kind of over counting in a sense because if we just product those together, we're not gonna get the right thing. So what you have to do is you have to take this entry, which would be like m eight comma 16 and you have to divide by it. So you say times one over m eight comma 16. And this is that cross approximation formula, but with one cross. And when you do this, you can think about doing this all the different ways where I fold the first index from the rest, the first two from the rest, the first three from the rest. You can actually sit down and work out. If I just guess an entry, like one entry, I just say somehow there's a certain entry here that I'm gonna base my initialization on. Like say it's that function setting and you already have a pretty good guess of like an important value of the function. Like maybe it's a curve and you say, I think the maximum is sort of around here. You just pick that value or maybe you just pick it randomly. You can do all these different crosses of all the different unfoldings on paper. It's like pretty easy to do. You just write formulas like this down and then you can fill up all those matrices going back. And then in that case, the J's are just single size. Like all these arrays that are on the screen just have one number in them. So it's highly technical, but very simple. It's like kind of like a little mean field approximation for the MPS. It's technical, but it's something that you can write in like 10 lines of code and it'll work for anything. And it's just an initialization. It's just like a cold start thing. So I didn't have a totally good answer prepared, but that should give you an idea. So that's just a little cold starting trick, but then you run the algorithm and then the algorithm corrects and fixes back from all of that. Thanks, yeah. So the simplest thing you can do is just take the S's to be bigger. You could just take them to be like 100, but then I mentioned that the scaling might be like outer dimension squared. So you might not be very happy after a while if you take it to be much more than 100. So the really clever thing you can do, this was actually something Steve White did in 1998, I think with Eric Jeklman for Bosons. And then it was also Jose Latorri as a paper from 2005 doing this for images. It's been kind of rediscovered. Also, Aceleides who did all this tensor train, he and Kromsky, they rediscovered it. It's this idea I've mentioned a little bit. I'm gonna talk about it quite a bit later, but is that to just have groups of indices. So you say, because you'll notice that the algorithm scales linearly in the number of indices, right? So I think I put that on here too, yes, that it's only linear and in. So having more indices is never a problem. The only problem is if they're big. So you just say, if I have an index of size eight, what if I have three indices of size two? And it seems like you're cheating, but actually it's very deep why this works in some cases because what you're doing when you do that is they're actually exposing kind of hierarchical structure between in that space, and I'll kind of unpack why a little bit. So that really works in many cases, it's very neat. Okay, so let's see how I'm doing on time. I think doing okay. So that was just to give you some appreciation of what's going on under the hood. The nice thing about it though is that someone can code this up and you can just have the code and I'll show you an open source code that's already been made for this that you can just download today. It's actually made by a PhD student of Jan van Delfts who's been working on it. So you can just use this today and try it out on many things. It's extremely flexible, right? Like you really, all you need is a little bit of code that you write and then you could point this at anything you want. So it's kind of fun to play with. Like, so Xavier Wintol who's one of the people working on that with Mark and Jan and some people, he's just been playing around with using this everywhere. Like they've been doing things like he's told me that he's taken like a Hamiltonian of the kind you see in chemistry where you have TijVijkl and he's too lazy, he says, to try to like write a code to turn it into an MPO. So he just takes the legs of the MPO, folds them up where it makes it an NPS with twice as many sites, calls his TCI code and it figures out this chemistry Hamiltonian in like a minute or two and he says it's as good as like all these fancy MPO making codes and, you know, so he's just using it everywhere for like everything and finding all kinds of stuff. So it's kind of fun to see how quickly it works for everything. So, okay, so let's talk about some applications. So one of the neatest ones, and this is one of the original ones that like Kromsky and Esseledates were trying around 2010, 2012 was loading functions for the purpose of integrating them. So this is exactly kind of to the point about the multiple indices is that we discussed briefly yesterday this idea of using bits to input continuous variables to tensors and that idea is called quantics tensor train. So I thought I would take a minute since the lecture is on the long side to show you a bit more of kind of go through that idea again to kind of set up this application of doing continuous integrals with TCI and everything, it's pretty neat. So the idea is this. This is now sort of orthogonal to this cross interpolation. This is a separate idea, but it turns out to be a very nice match for this toolbox of being able to learn the function. This idea is just a certain way of using tensor networks to do continuous variables. So let's say we wanna do the simplest thing of a function of one continuous variable. So how do we input that continuous variable to a tensor so we can use these tools, right? So we can imagine discretizing on a fine grid, but that's gonna be exponential cost. So if we want the spacing to be very small, one over two to the end, that's gonna cost us two to the end memory because we just store all the points, right? But we can just imagine doing that. And then we say that when we imagine doing that, the grid points are labeled by what are called binary fractions. So these B's take the value zero or one, zero or one, zero or one, and they tell me am I on the first half or the second half and then am I on the first quarter or the second quarter and so on and so on. And this is just nothing more than the usual system we use for numbers just in base two, right? And we can shorthand this and say we call this zero point B1, B2 to BN. And then we can say what are the values that F takes on that grid? And we just put that expression for the grid into F. This is just a formal thing. And then that defines a tensor immediately. So we say that already is explicitly a tensor just for the reason that it's a set of numbers labeled by these integers B0, B1, B2, B3, B4, BN. And just to be clear, I'll just add a note to say that B sub j equals zero or one, right? That's what I mean by these Bs is they take two values, right? I wanna say that here too, okay. So that's what these are like. And so this tensor is this huge high dimensional exponentially big tensor with all the values of F inside rolled up. So it's like a big trunk full of numbers that are the values of F on that grid, okay? So so far we haven't gained anything, it's just exponential. So just to get an idea of what this big F is like, if we ask for F of zero, but just set all the indices to zero, and that gives me this number of F, which is this first one on the interval down here, right? If I set the first index to one, that's the number one half in base two. So it says that's F of one half. There's that number, right? If I set the second number to one, that's one quarter in base two. So that says that's F of one quarter, right? And then if I toggle that last index, that's the least significant bit. So that moves me the smallest amount on the grid. That moves me over by one over two to the end, right? So I can kind of toggle that one to do very fine steps. I have another little error on my slide. I should have that last bit set there. Let me fix that too. Okay, you can tell these are newer slides. Okay, great. Cool, so then the key question, that's just a tensorization story. It doesn't buy you anything. It's just one exponential to a different exponential. But it's kind of neat to think about because it's like a mini body looking exponential, but it's really a continuum. So it's sort of a different exponential. It's the exponential of a continuum, right? The key question would be, can we use tensor networks to break this exponential? So if we imagine rolling up all these exponential numbers on this exponentially resolved grid into this tensor, is it a low rank tensor network? It does have a small bond dimension. Does it have low entanglement if I put it in a quantum computer, right? And surprisingly, the answer is yes for many cases. And this has only been appreciated kind of over the last decade, I would say, and even in recent years. So it turns out that it's actually low entanglement. So if your plan was to put this function on a quantum computer, that would be a kind of state that would be classically preparable, or we could just simulate classically. And you can prove this for all smooth functions in 1D, also for functions that have a finite number of costs or discontinuities, still a bond dimension and rank is low, and any Fourier transform of these. And this relates to this recent result that the quantum Fourier transform also has a small bond dimension. So you can just rigorously prove that a huge class of functions is actually an MPS in 1D, in fact, in the continuum. And this is with exponential precision in the grid spacing and very high accuracy. So some concrete examples would be a single cosine has exactly bond dimension two. You can actually show that an exponential has bond dimension one. It's kind of a neat result. Maybe I'll do that result too, because I think we're doing okay on time. So that's a fun result. So how do you do an exponential as a tensor network? So what you do is you say if I have e to the x, right? I could say write that as e to the b1 over two plus b2 over four, e3 over eight, dot. And then because an exponential of a sum is a product of exponentials, I get e to the b1 over two times e to the b2 over four, e to the bn over two to the n. And then that is just e to the one half to the power of b1 times e to the one quarter to the power of b2, e to the one over two to the n. And you can see some neat things. Like one over two to the n is almost zero. So this is like almost one. So then toggling bn barely changes the function at all. So there's the smoothness of the function is the fact that this is like not doing very much. This changes you a lot, right? And that's manifestly a tensor network of bond dimension one. So, right, because this is just b1, b2. These are numbers that just go through two values, zero or one, zero or one, right? So that's just, as I wrote it, that's just a product of vectors. There's a tensor network, bond dimension one, okay? So every exponential, no matter what base it has, even if it's complex, this still works. It's bond dimension one. And then a cosine is a sum of two exponentials. So it's bond dimension two. So that's that result, right? And then you can kind of see why this works for smooth functions. Because if I sum up a lot of cosines, I can do Fourier transforms. I can represent smooth functions. The bond dimensions just add. But then a neat thing happens. So if you sum 20 cosines, you'd think the bond dimension should be 40, but it's actually only 18. So there's some kind of feature sharing that goes on where you could like recompress back because there's a lot of structure in the coefficients. And then you start adding sharp cusps and you think things are gonna blow up, but it actually goes okay. So you add a few cusps and the bond dimension is still slow. So it can still represent that. And this kind of trick works in 1D, but also can work in 2D. So 2D, you just put two indices, which is like X bit, Y bit, X bit, Y bit, alternating. 3D would just be three indices. So you can do 3D functions. You can do things like that. And so we already saw yesterday this demo where I used TCI, although I hadn't told you what it was, to learn these 40 Gaussian. So I can run that one again. So this is 40 Gaussian's random locations, widths and heights, plus a sharp step. And we can learn it into a tensor network. So that was this demo, which I need to rerun. So we saw that example. And the idea is that we use TCI and now it's gonna go in and just, it's working in this formalism that I just set up where we use this bit encoding of a smooth function. And then it's just gonna go and do that forward and backward sweeping, calling the function over and over again, which I just explicitly coded, adding up all those exponentials at a handful of points and it can reconstruct the whole function. So that's the idea. So it's compiling a little bit of code. So it'll go a little slow the first time. But there it goes. That was the whole TCI algorithm, just ran that fast. And then now it's plotting. There you go. So there's the reconstruction. So there's the random function underneath this black curve and then the points are the points on the last pass that it called the function. We can call it again faster that time because there's no more code to compile. And there's another one. So the different random function, there's a sharp step at point four. These are the points that it sampled. And it reports, I showed you this yesterday, but it shows it only had to call about 1,000 points to reconstruct the function on 200, almost 300,000 grid points. So it's a very small percentage used. So that's a neat result. So that's just a little homemade TCI code doing that. Okay, now a more maybe, you know, applied example would be, how do we estimate pi? So let's try to use TCI to estimate pi. What I'm trying to do is show you the, like you can get creative with these examples, right? So that was just some random function. What if we actually want to do some tasks? So we want to estimate pi. So we can look up that if you take the unnormalized Coshi distribution parameterized by this number C, as C goes to zero, this integral gets closer and closer to pi. So that's not something we can use. So we can learn this kind of learncy and shape and then we can integrate it. And how do we do the integral? So we'll use this TCI to learn this continuous function with this bits encoding of the X. So all these indices together are X, right? So F of X is like all these indices are X going into this tensor to give you the value F of X. How do we integrate it? This one's kind of neat. So you attach these summation vectors. So we saw in the last talk that for a discrete probability distribution, you sum by attaching the vectors one, one, one, one, one, one. But here we're doing an integral. So you need a DX. You need an integration measure. And because we're on a grid, we're spacing one over two to the end. The measure is one over two to the end. That's what DX is. But you can take the one over two to the end and write it as a half times a half times a half times a half N times. And those halves go into these vectors. So each one is like an integral measure at a different scale. So this is the finest scale, coarser, coarser, coarser and coarser scale. So you can do the integral measure as like a multi-scale sum going from find a course going back. But all it really means is you just attach these vectors and sum. And to do that diagram, it's just five lines in the Itenser software. So all you do is you say, run this loop, grab the first NPS tensor. So that's this purple one here. Attach this vector one half, one half to it. And then multiply those together downward. And then save the result and multiply it into the kind of current thing that we're building up. And then attach the next tensor, multiply one half, one half onto its index, multiply that into the thing we're doing. So you just zip across left to right. That's what those five lines of code are doing. So they just, once you have a function, it computes this integral in like a split second on an exponentially fine grid. So the only hard part is how do you get the function? So you can do that also with TCI. You can, so here's the code. So the code is, all you have to write is the function. So you just write that yourself like in Julia here, it's just a one liner. And that's just the bit of code that runs and evaluates the Cauchy distribution on different points. You throw this into this TCI code. I called it monitor TCI because I modified it to give me some statistics afterward about like how many points were evaluated and so on. And then it does the integral. There's that little function integrate that I showed you the code for on the previous slide. It calls integrate. And then it just gonna give us the result afterwards. So here's the demo. So learn the function. It gives us the output. So let's try different values of C. So if we do C is 10 to the minus three, we run it and it has an error of four times 10 to the minus three from pi. So you see it's wrong there in the second digit. So I is the integral that we got from learning it and then pi is right there and it's showing above the TCI reconstruction error. So at first it's making big errors. As it's sweeping around it's like missing the function a lot but as it fills in more values and then does these interpolative decompositions it learns it better and better and better and very soon it learns it perfectly. So now we can lower C so we make the Cauchy distribution more narrow and then we get an error that turns out to be proportional to C for some reason which is kind of neat. So now we got the second digit and the third digit right and it's the error of like four times 10 to the minus four. We can make it narrower. It keeps working 10 to the minus five, 10 to the minus six. We're just doing narrower and narrower. So you saw the TCI had to work a little harder that time to kind of learn this pretty narrow peak but we're getting something like six digits now of pi. So we can just keep going down kind of exponentially that way. Yeah. Yeah, that's a good question. So I guess partly because the integral will add up all the errors will somehow participate. So I think that's basically probably why. I don't know if I know the exact reason but I just think that, because that max error is an infinity norm error so it's really saying like on a single point of the function where did it deviate the most? But then there'll be other errors that are, yeah the other errors will also accumulate and add as well. Thanks. Yeah, good question. Great. So then, okay last demo of mine I think is 2D function, right? So what if we now try 2D? So all that means is that every tensor is gonna have two indices attached an X bit and a Y bit. So the first bit, second bit, third bit but X and Y, right? So now we can do something pretty serious business. So we can say 52D Gaussians in random locations, widths and heights. I shouldn't have the sharp steps. So no more sharp step. And then that's where the kind of thing that we're gonna get. So let's try that out. Okay, so here's my 2D code TCI2D. So then I make these random Gaussians in 2D. So this is just a little piece of code that runs and all it's doing is taking in an X and Y value, looping over this data which are my random widths and heights that I made and just plugging them into a Gaussian. So there's the Gaussian E to the X minus the center of X squared Y minus the center of Y. So this is just some classical code that's running point-wise. And we take that F and here's a little lambda function that just maps the indices into F. So it takes the indices, splits them into two groups. That's the one to N here and the N plus one to the end. It throws them into a helper function called B to C. B to C is just the thing that's saying binary to continuous. So that's the thing that takes each bit, divides by powers of two, adds it up, throws it into F and then TCI learns the tensor network from that. And then we can do some analysis afterwards. So let's see what happens. Okay, it's running TCI, learning the 2D function. It's having to work harder because it's 2D. So I gave it more sweeps. I said let's really make sure it's learned. There's the bond dimensions afterward. They're the biggest one I think is 58. And then they kind of go down. It's kind of neat to see why do they go down in the tail? It's because it's a smooth function. So similar to this exponential, the bond dimension actually thins out as you go because once you get into where it's smooth, the last bits barely matter. So that's what happens. And there's the plot. So it learned that function. And you can see it actually does sampling because you can do perfect sampling of an MPS. So I do a bunch of perfect sampling and I query the function. I get what's the MPS value and I plug those back into the original function and I get an error and I can actually sample around and see what's the average error and it's 10 to the minus six. So it's got this function to 10 to the minus six and 2D and you can do 3D and it's learned it on the fly. So you can do more. This will be a different random function. Okay, there's a different random function. So then you may wonder what are those crosses? So those are actually its estimate of the global max and min of this function that has many local minima, right? And it's actually able to figure that out. So how does it figure it out? That's another benefit of TCI that I think it's not proven yet but there's a paper that I'll reference on an upcoming slide that says that TCI can also give you the global max and minimum. So here's the next slide. So that there's this interesting proposal by Aselides and collaborators that they call it TTopt. And the claim is that if you actually watch when TCI is running that with some high probability that when you look at the pivots, these points where it's actually discovering in this interpretive decomposition if it's saying these are these important rows and columns that one of them will contain the global min and one will contain the global max. And I'm not totally convinced this is true. In fact, I think I can construct some counter examples like what if you have a very flat region and this kind of thing but in practice it works a lot. And they have a second paper. I think the referees were correctly a bit hard on them in the first paper saying because the reports are online. I've looked at the referee reports and they were saying, you didn't prove it. And I mean, fair enough. And so they have another paper. I don't know if they managed to prove it but they have another paper where they propose a different strategy based on the perfect sampling than Priya. So you can do a very neat, non probabilistic deterministic sampling. This is in the second paper where you run the perfect sampling algorithm I sketched earlier but instead of doing a random sample, you just make a tree of all the possible samples and then you start pruning the tree with low probability weights. You just prune those parts of the tree. So you have a kind of a width that you keep and you try to hone in on the global max and min that way. I mean, this certainly won't work every time but for a nice enough function it can work with very like high probability. So this idea is called TTopt. It's the idea that you can machine learn the function hone in on the global max and minimum even if the function has a lot of very rough landscape. Because here this function has many local maxes and min. I did not do any gradient to try to find the global max and min. I just found it through these strategies. And so there was kind of an explosion of papers in 2022 in the applied math community using this idea. And I thought this one was particularly fun. So it's using this TTopt idea. So this is a clipping from the paper where these robotics people are actually looking at cross approximations and interpolative decomposition. That's actually how they draw an MPS. So somehow in that field they still don't use the pin rose diagrams very often. I don't know why. But they like to draw tensors as like cubes and rectangles which I find very difficult to draw. But yeah, I guess it's more explicit. And I'm just picking on them a little bit. But then here's like their simulation of the robot arm and you can see different trajectories that they're studying. I don't know all the details. But then this is an actual photo of a time lapse of the real robot arm in the lab being controlled by a matrix product state. So at this point, these technologies have really left quantum mechanics very far. So this is not a wave function at all. This is some applied math, but it looks like quantum mechanics and it's controlling a classical robot arm. So it's just to make the point that all this knowledge that a lot of you have is starting to have very broad applications. And some of you are asking me like, does this stuff really work? And I was saying, well, we can't do ImageNet, but there's other more niche applications of machine learning. Well, this is the thing I was talking about, right? So even though tensor networks maybe don't do as well on images as neural networks, they can do other stuff that is quite powerful. So there's some pretty neat things. And then last application that I'm gonna go into any detail about is this one which really, I think, grabbed the attention of some people in kind of strongly correlated physics, which was that this TCI algorithm was also used by some people at CCQ and in France to learn exponentially many Feynman diagrams. And once you've learned them, you can now sum them very efficiently just in these strategies I already mentioned about summing in PS. So without going into all the technicalities, the idea is that this is Feynman diagrams for an interacting quantum dot. So in this case, you don't have any spatial variables in the simplest setting. So think of like single-imperity Anderson model, right? So you just have, you don't have any space, you just have a dot. But what you have is time variables. So the legs of the Feynman diagrams are these time variables. So when you have diagrams with more and more legs and loops and things, you have all these different T's. And then the idea, the key innovation in the paper, one of the innovations was that the bond dimension is only good when you work in time differences. So when you work in time differences, you get a low bond dimension. If you put times, it explodes and it doesn't work at all. So that was one thing they figured out is time differences. And then the length of the MPS is like the order of the Feynman diagram. So you have order four, you have four tensors, you go to order 30, you have 30 tensors, and they really were able to go up to something like order 30. And this is kind of very high order perturbation theory. So they're really very far from the perturbative limit. And they get these very high precision results. So this is actually an impurity embedded in a 2D lattice. So imagine like an infinite 2D lattice of non-interacting electrons and then you have one interacting site and you do this kind of this usual trick where you kind of make this Wilson chain from the 2D fermion hybridization function and then you make this 1D model, then you solve it with this technique and they're showing how precise the Green's function is. So there's the real and imaginary parts in blue and in orange. And then this is the same Green's function on the right. But what they're doing is they're taking the time way out to 10 and out to 50 and showing you that there's also these very fine oscillations in the Green's function and those are captured as well. So they have like this very, very high precision results for an interacting dot in this 2D lattice. And just to kind of show off the method some more. So this was done by Uriel Nunez Fernandez and Philip de Matrescu, Xavier Wantol, Olivier Parcollet. Those are the people involved. So the functions that they learned were extremely complicated. So this is just showing one of the functions that comes in the perturbation theory that you sum to do the final calculations and I think in this case it was, they were looking at the charge of the dot. They look at these functions that are going in in the intermediate steps and they do different cuts. So this is an order seven function so they do seven different cuts. Fixing all the other variables and then looking at the first variable, scanning that one, scanning the second variable, the third variable, the fourth variable. And the red points are the exact values calculated at high cost just to get the exact values. And then the black curves that you kind of see threading behind are TCI reconstructions for small bond dimension is the one where it deviates. Like right here you see a big deviation and then as you make the bond dimension bigger and bigger it just is right on top of these very complicated functions and you can see there's these very tiny oscillations on the right and they get all of those as well. So that's in a regime where if you tried quantum Monte Carlo like diagrammatic quantum Monte Carlo you could have some pretty bad sign problem issues but they can just go right through it. So kind of to wrap up, there's many other applications of TCI that are happening right now and there's papers already coming out. So more of these TT opt papers from applied math, there's applications of things like protein docking, optimizing parameterized quantum circuits which might be something a lot of you would be interested in taking a look at. This thing called the quadratic unconstrained binary optimization problem cubo. I think in this case they were actually criticizing a paper by D wave a little bit that was claiming a kind of quantum advantage and they were like we can do it with this TCI. That was by oscillidates and collaborators. Also people are solving PDEs like this is the Navier Stokes equation using this format that I went through earlier and they're using TCI sometimes to say initialize the PDE or to learn some of the operators to put into the PDE solving or to square some functions when you have nonlinearities things like that. And then another very different use is that people are using this algorithm to do orbitals for quantum chemistry. So instead of Gaussians you can actually find these kind of Slater type orbitals that actually have sharp cuts and do the integrals efficiently. So all kinds of things are happening. So to wrap up this tensor cross interpolation are also known as TT cross TCI or TT cross. You get an interpretable form of the MPS and from this form you get this active machine learning method where you can learn from kind of a black box function. Afterward you can do all kinds of things perfect sampling integrals analyzing the function. Try it yourself so here's this code that's already available by Mark Ritter and he also has an overlay called Quantix DCI. Quantix means that trick with the bits where you do the continuous variables. So you can try it out right now. It's really easy to install Julia and then these codes. You basically just install Julia, follow the instructions on the Julia website and then you just in the Julia console you just say you go to the package mode and you say add tensor cross interpolation. It'll download all the dependencies and install it in a few minutes and you'll have it. And you can follow the examples and everything and I'll share these slides to you if you want them. And so just one slide to think about the future. So this algorithm right it exploits, it explores the function and exploits the information. So could it have a connection to reinforcement learning? So that's one question I think could be interesting to follow up on. Another question I don't know if I put on this slide would be how does this algorithm relate to the previous one I showed? They look quite different in the details. One starts with data, this one starts with the function. Can we mix them together? So one idea I had would be, what if this is something I wanna try? What if the function that we give TCI, what if TI says why want this value? And we say sorry, you can't have that value, we don't have that value. I'm gonna give you this other value instead in this other place that you're just gonna have to live with it. Could we alter TCI to do that? I think we could. I think we could just say nope, use this value instead, sorry. And we force it to use columns that happen to be ones where we have the data instead of the ideal best columns. Probably we could do that and make it work. Also TCI is probably implicitly sketching the data in a certain way. We could probably think about what is that sketch and relate it to the other algorithm. So I'd say the applications are really just at the beginning right now. All those applications I showed, robot arms, function optimization, et cetera. I think we're gonna see a lot more of those. Just the fact that we can optimize general functions is a far-reaching potential. So maybe we could be using this to help other machine learning methods. Like what if we wanna optimize a neural network? Could we be optimizing it using a tensor network? So maybe instead of comparing these two directly head to head, maybe one is gonna support the other. And then finally these ideas are going out into applied math or sometimes originating in applied math and they're coming back into physics in interesting ways. So we used to think that a tensor network should be a wave function, but now it's representing high order terms in a perturbative series to model a quantum dot and you can go to arbitrarily long times. And also people are doing tensor networks in time. So there's all kinds of crazy things happening that we wouldn't have thought of like 10 years ago. So that's kind of all I have. So thanks for your attention. Thank you. Thanks a lot. We have time for questions. Yeah, right. No, it definitely fails. So the main examples would be one that I've seen a lot and I've talked with people like Mark and people who are putting together these codes is that I think a lot of people have seen a failure for a very narrow, like if you have a very narrow Gaussian, it sometimes will, it'll get the Gaussian, but it'll do a very odd thing. Like I don't understand this one fully. I think I sort of know why it's doing it, but it'll get the Gaussian quite well at a large scale, but when you zoom in on it, what it'll sometimes do is it'll have a step like that, like in the corner. So when you're trying to be very precise, there'll just be a step there. And I think I kind of know why. I think the reason is when you look back at this bit encoding, it's these things called dyadic grids. So you have, when you fix the leading indices, you're doing some kind of resolving of the function that's sort of like that. It's got this kind of funny pattern to it. And I think what's happening is something like, one of those dyadic grid points is like right there. And somehow it's messing up when it's doing the searching. It's doing these high dimensional slices and it's somehow doing it wrong. It's not like the code is buggy. I think it's just like a weakness of the algorithm so far. But I think these are like very fixable things. I think that probably the fix is going to be something like similar like in DMRG, there's this idea called subspace expansions where you take your current solution and you kind of add a perturbative correction. And then it kind of enriches the matrix before you do the decomposition. I think there might be tricks like that. Also, I think someone yesterday was asking about this kind of hamming distance problem and that's kind of related to this. Like maybe we could use better grids that aren't so weird and that'll work better too. Mm-hmm, yes. Sort of, let's see. So I mean a more basic thing that people are trying is functions where you really should be integrating in like polar coordinates but they're just doing it in regular grid coordinates. And that actually can go a little bit bad. It's not a problem with the TCI part, the loading part. It's more of a problem with just the representation. Like afterward the bond dimension could be a bit big. And so they're doing tricks like they're rounding the function down and kind of smoothing it out so that the grid works okay. But it's still a sort of a mismatch problem because you have this circle shape or something and then you're putting this very fine grid but the grid is a square grid. So you have all these kind of like pixelization issues and stuff that make the bond dimension to be too big. And it's probably likely that there's another way to do it that could bring the bond dimension down. Some people are more pessimistic. They're like, oh, this isn't gonna work. Other people like me are more like, we'll figure out a way. So I think right now there's a little bit of a debate of like, will those problems be fully solved or will they just be a bit of a pain point? It has to be worked on. So you showed that to learn Feynman diagrams and integrate over loops. They found that, yeah, time is not a good coordinate but difference of time is good. So that made me think, like back to representation of wave functions, you know that NPS are not good when they are highly entangled. Okay, but maybe they are highly entangled because one uses the position representation, let's say. So I was wondering if one puts one matrix product operator, which is somehow, which mixes the coordinates intelligent way, maybe trained by a network on some specific function. Does it help if we want to study, for instance, the dynamics of the, so it learns once at one time, let's say the good coordinate transformation, non-linear, whatever, in this case is linear, and then it applies over. So it's a combination of neural networks and NPS, which are maybe faster when both dimensions move. Is it? I think these ideas are certainly on the table. So I think the closest thing I could think of to this time difference idea might be that maybe we should be working with space differences for a wave function, but then that would mean we would need a way to work in first quantization, which I think there are ways, but it's a bit of an open direction right now where traditionally we're working with second quantization, so we don't have a way to sort of subtract the differences because those variables are orbital occupations or something, so you can't really meaningfully subtract them, but if we could go back to first quantization, now these are positions and now it makes sense to add and subtract them. So that would be a more simple direction out of this. The idea with the neural nets we'd have to discuss, but I think quite possibly, I mean your idea might relate also back to this where we wanna know some kind of reparameterization for which the tensor network works better, and then I think it would just be all about the details like how would we write an appropriate cost function to train the neural network, but I think we could, like maybe like entanglement. Yeah, on dimension. That would be really interesting. I think these kind of hybrid ideas could be very nice because I think in the past we wouldn't have known how to even really do them because we'd say, well you have a neural network and you have a tensor network, how do you do like DMRG with the neural network there because you can't fit it into DMRG, right? But now this TCI is much more flexible like because you could say all I need to do is evaluate it on a point and that you can do through a neural network. So now you can actually bring that kind of thing into the mixture with tensor networks a lot easier because you're just doing these point-wise evaluations. So new ideas like that can start to happen. In the coding samples that you were showing us, there was this maximum error. This was on the training set, right? Oh, well those, I sort of, let's see. Training set, it's more like the, those bonds are labeled by those arrays of called pivots which are these values where the ID is exact, the interpolative decomposition. The idea is that you can, you only get to see the true function there. All the other parts of the function are sort of implicitly there through contracting the tensors but you couldn't really visualize them because they live in an exponentially big space. So that error is something like of the elements we have access to, we can compare the ones reconstructed in the tensor to the calls to the function and so it's just reporting a max absolute error. And then you were showing like an average over perfect sampling, right? And this is a kind of test set if you want. Oh yes, that's right, yeah. That one would be more like, exactly. So that one's more like, if you believe that the function is pretty close, then you can trust that the sampling has good important sampling. If it was too bad, the function might be zero somewhere that it shouldn't be, then the sampling will be bad. But here let's say it's already pretty close, then the sampling will be good, sampling. Then we can just point-wise evaluate. And it is an important sampling. That sampling primarily targets larger values of the function so it sort of focuses the sampling automatically more on those. And then I just do a infinity norm error of like where do I get the biggest deviation? So it's kind of like a test set but sort of generating it at the end if you want. So it's very robust. Somehow you're able to take the worst possible examples and to check them. Yeah, it's robust I would say in the sense of if you're already in the kind of downward slope of the algorithm working, like if it's already working, like you can plot it and you can see by eye that all the major features of the function are captured, then it's gonna work. But it could fail if like, if it's just really failing like if it's just missing an entire peak of the function, then you can't trust that kind of error estimate actually. So it's interesting to think through. Yeah, but I mean it sometimes is challenging to evaluate whether it worked or not. So it's very easy for example to feed an NP-hard problem into this. You could just say like I gave that password example like the function could be the bit strings are just characters of a password. And then F is just type the password in the computer and tell me if it worked, you know? Like that's an F you could code in like a couple lines but TCI can't find it because it's just zero everywhere and then one somewhere and then okay. So the very last question, just the point that I wanted to get to is like if you have to a computer that say an observable on the data points that you are right like if the NPS is approximating some physical function and then you want to compute expectation value on it. So is there a way to compute the error on the observable? That's a good question. I don't think there's a provably correct way to do it but I think there's like an empirical way to do it and I think it would just be like something like maybe run the TCI a couple times and let the bond dimension grow and see how much is it changing sort of an empirical post hoc kind of way to do it, I would say. Also again, you can, you're getting the data from somewhere, right? So the data you have coming in, you trust because you got it from some kind of exact thing. So you also have that same error estimate where you say point wise, how did I do back to the things I know and how well do I interpolate in between? So you can kind of do like a validation set type of thing but what you're saying is a very good point. That could be one of the applications like you could imagine learning like correlation function data out of a Monte Carlo algorithm or something and you can just by getting some of the correlations you can reconstruct all the others. So I think these kind of physics applications could really take off. Okay, so for the same bond dimension, is the MPS that has been derived from TCI equivalent to the MPS that has been derived by TTSVD? Because TTSVD is exact like, I mean, if you don't renormalize. So are they exact because, sorry, equivalent because I mean, under the hood here there is a certain sampling so, I mean. So no, they're not, but theoretically they're close together. So there's that idea of the over sampling from the interpolative decomposition that says that if the rank is R, that the columns you need is K which is like R plus P. So there's this idea but you don't know a priori what is P. And so it actually in a way relates to sketching. So when you actually talk to the experts about this, they'll say that what you really need to do is taking a matrix is that in original basis the columns can be funny, like there can be kind of exceptional columns. So what you do is you just mix them all up and when you mix them together you sort of take those exceptional columns and you're blending them with the more average columns and now the columns are all more alike similar to each other because they're all a mixture of all the other ones. So then you can get much closer to the true rank. You can say now I only need like R plus two columns instead of before I might've needed two times R columns or something like that. So by changing basis you can make the bound tighter to the original rank but it's never guaranteed to be equal. So only the SVD is sort of provably optimal and this ID is a little off but it's close that's sort of the idea. But okay it was not mentioned but under the hood the ID also has SVD and like. It doesn't really use the SVD underneath actually. So I wish I had more time and slides to explain but it actually uses something much closer to the. There's different ways. You can do a kind of rank revealing QR with so you can do this sketched rank revealing QR is one way to do it. There's also this really elegant thing called the LDU lower diagonal upper decomposition and that one is really neat and it can uncover the ID also but basically the answer to your question is that in TCI there's gonna two sources of error. There's the ID part that's the decomposition but also there's the part about just that you you're kind of doing like a search. So you have the previous pivots which some of them might be kind of bad and then they're steering you toward where the next ones will be but then that since they were bad it can kind of pollute the next ones. So you can have a little bit of this like pivot pollution issue. That's a little bit of what was going on in that answer I gave here. And like as if someone could just force you to put a pivot here you'd be fine but the algorithm is trying to discover that and it keeps missing you know. So that can also make the bond dimension too big because you keep missing important data so the function might look not smooth to you but it actually is smooth and if you could just get it to be smooth the bond dimension would come down. So you have these two sources of error that can push the bond dimension up a little bit. But here's a neat last point about that is that after TCI is done running for very low cost you can do a second pass and put it into the SVD gauge so you can thin the bonds down and then there's now technology to actually go back and forth between the two gauges that's coming out in a new paper by those people I mentioned like Xavier Wintel, Mark Ritter. They're putting together a paper where they're gonna explain how if you're given one in the SVD gauge you can go to the TCI gauge and back and it's really neat what you can do. Is there a question here? Yeah, thank you. I just wonder regarding the TCI for payment diagrams. So something like that happens fairly commonly in spectro functions. You have got to hold singularities same as example graphene, right, let's get to it. So how well does it fare representing these singularities? That's a good question, let's see. I don't think I just know the answer immediately but I would say that if they're dealing with the discrete variables in this bit way that I'm talking about that approach is not too challenged by singularities especially if they're in a kind of 1D function, say. So from thinking of density of states or if this is 1D because the dot is zero dimensional and you're in time I think basically this encoding it's hierarchical and multi scale like you see here with these exponential scales going down so it just can zoom in on these sharp points very well and it deals with them just fine. So I wouldn't worry about that too much in the simplest cases maybe when you have more dimensions like if you had multiple dots and then you have time and other things I think then if that singularity turns into a whole surface or something that could be more of a challenge and that gets back into this kind of pixelation. So these are things that right now they're not like absolute challenges for these methods but they're like pain points and I think that the question is like are they gonna stay as pain points or will we come up with clever ways to fix them also? Yeah. Good so if there are no absolutely urgent questions I would say thanks a lot again for this nice sequence of lectures.