 Okay, let's see, yeah, that's worth it. All right, welcome back everybody. Thanks for all the attention you paid yesterday. Got a lot of good questions about the lecture and some also in the breaks. I think this next one is probably gonna be a little bit on the shorter side, so maybe you'll get a longer coffee break today. Some people told me they wouldn't mind that, so we'll see how it goes. All right, so just to review, well first of all today, let me just say what we're gonna talk about. So I'm gonna try something a bit maybe challenging, especially first thing in the morning, which is to really dive into algorithms into some really kind of details of algorithms. So I'm gonna try to go slowly and please slow me down. If you, like any question you have, like if you kind of forgot how tensor diagrams work or you just didn't catch a step or something, please ask, it's sort of your school, so it's really for all of you to understand. And what I'm gonna talk about today is two very different algorithms. And especially, I mean, I kind of think these are pretty radical ideas in some ways, especially the second one, even though the second one's already by now something like a 10 year old algorithm, but I think it was a bit missed and underappreciated by the physics community. So I'll be kind of highlighting that one in the second. Like today, that one's called tensor cross interpolation or TT cross. The first one I don't think many people know because it's just only been about a year and a half since the paper came out. It is called tensor train recursive sketching. So we'll kind of see how these fit into the theme from yesterday. So what did I talk about yesterday? So yesterday, the basic idea was that tensors and tensor networks could be a broad alternative for machine learning as a different way of representing high dimensional functions and they could bring ideas over from physics. So the idea is that straightforwardly, for at least a function of many discrete variables, a tensor can capture any high dimensional function you want, but that's not buying you very much because working with a large tensor is very costly. Like once you have a lot of indices, it's the same cost of like doing an exact diagonalization calculation in physics. So even though that's a true statement, it's not very useful on its face. But there's this concept called tensor networks, which is one way to get a controlled approximation to a large high order tensor, which if you look at the first part of the slide it's like a large function of many variables, right? So tensor networks are really just a function approximator, just similar to how neural networks are a function approximator to our tensor networks and they can approximate very high dimensional functions in a controlled way. And what happens if you have continuous data, like say you have images and you have pixel values or you have a function that takes a continuous variable, there's different ways to encode that data into tensor networks that each have their pretty interesting details. So one way is to take each continuous variable and map it to some vector and then take products of these vectors, that's one way. There's probably tons more ways, these are just two. Another way is something that I'll unpack, I'll kind of revisit again in the second talk today and unpack in a lot more detail, which is this thing that you could call amplitude encoding. It also goes by this name quantics tensor train and other names I think that might be out there in the literature. But this is really nothing more than the idea of thinking of the indices of the tensor as the digits or the bits of a fraction of like a continuous variable. So you literally think like if this was in base 10, this might be 0.271282 or something and you would literally just put 271282 on the indices across, which seems really strange but it actually turns out to be a really brilliant idea. It actually leads to like a multi-scale encoding of data in tensors, it's really interesting. I'll unpack that again in the second lecture some more. So we'll kind of revisit some of these themes a bit later. And then kind of last recap from yesterday is that when you put all these ideas together, it's pretty natural to end up with models that resemble what are called kernel machines. And all I mean by kernel machines is that instead of having w.x, like if x is your input, x could be your image or like an audio signal or a function or whatever, instead of just straight forwardly combining that with the weights, you like lift the data up to higher dimensions to make the model more powerful, then you hit it with the weights. And then you also use a tensor network to make the weights tractable, to make it so that you can actually not have too many weights and you can actually affordably optimize. So then that's the kind of the basic, really lightening recap from yesterday. So they're kind of provably or straightforwardly very powerful models in the sense of like, it's very easy to write down super general functions that could fit anything. But I would say that, and people have optimized them and have gotten in some cases either close to or actually state of the art results on some data sets, but there hasn't been enough work in my opinion on trying out enough types of data and ideas. And one problem is that if you just try gradient descent, there's a lot of parameters and it's a bit slow to train. So you can do it, but if you try even larger data sets, it just gets a bit slow and it's a little hard. So the question then in my mind, always from the beginning was, are better algorithms possible? That's one reason why I mentioned DMRG yesterday to say that even from the very beginning of TensorFlow Networks, we've already known that there's alternatives to gradient descent. Gradient descent has the power that you can apply it to like almost anything, right? But it's a bit slow. So what about all the other algorithms besides gradient descent? Like DMRG is really at its core more of like a Krilov-Langshos algorithm, which is a very fast converging kind of linear algebra based algorithm. Can we find other algorithms like that? And then that's what I wanna try to convince you of today. So we would ideally use something like linear algebra. If you think about linear algebra, it's really quite profound, right? So I showed you in detail the SVD yesterday. It really just takes a matrix, does a bunch of deterministic steps, and then afterward provably returns like the optimal low rank factorization of that matrix. So it's doing in a sense some kind of very like nonlinear, very general optimization, right? And things like the QR factorization, I'll talk later about the interpolative decomposition. There's this huge toolbox of tools that sort of solve these very tricky problems to extremely high precision. Can we use those tools? So that's kind of my thesis in a way. So today I'm gonna talk about an algorithm called TensorFlow Recursive Sketching, which I'm still honestly kind of wrapping my head around, even though I was on the paper, the first paper about it. And so basically it was sort of an idea that I had tried and I had kind of kicked around, but I had done in like a really naive way. And I even wrote a paper with some friends of mine in New York about a sort of a much more naive version of this algorithm. And I was chatting with the person named Yoo Ha-Koo, who I'll mention in a minute, who ended up kind of leading the effort on this algorithm that I'll talk about today. And I told him what I was doing and he basically was like, you're doing it wrong. And I was like, what do you mean? Look, the math is provably right when you have all the data. He's like, sure, when you have all the data, it's right. But you never have all the data. What are you talking about? So he said, you need these things called sketches. And I was like, what are sketches? And he said, it's this thing that the math community is known about for a long time and you should use them. And so I was really interested in what his idea was. And so I'll talk about sketches quite a bit today and what that's about. But so what is TensorFlow recursive sketching out in a nutshell? It's just the usual framework we would have of generative modeling where you imagine you have some true distribution which you don't have immediate access to. Well, what you're given is a bunch of samples with the promise that these samples really came from that distribution. So with the correct frequency from that distribution, so you have all these samples. And then your goal is to take them and then somehow produce the matrix product state or tensor train approximation to the true distribution. So you'll never really see the whole true distribution. And you couldn't anyway. It's some huge tensor in this sense. We'll just think of the setting where it's discrete variables. So we'll be in this setting for this talk where we just have discrete variables. And so a tensor is just a function taking indiscrete variables. Once you set all the legs, out comes a number, right? So any number you want, so this could be any function. And so that's what we're gonna try to approximate but through this matrix product state. And I'm mentioning the name tensor train because that's the name on the algorithm. That's the name the math community prefers for those. Any questions kind of about that background or setting for a minute or where we're going? It's a good question. I didn't really look deeply into how the approximation approaches it. I'll show later that the errors go one over root in within being number of samples. So that's gonna be the typical convergence rate. But other than that, I haven't really recalled sort of the details of how close it gets in a bound. Oh, I see, yes. Oh, thank you, that's a good question. Right, so that's a really neat point, right? So here I'm thinking this cartoon, actually I had a precise meaning in mind, which is that these black and white circles, I was meaning them to be basis vectors. So that was just kind of in my head. I didn't really explain that, but I was thinking of it like that the empty circle would be a basis vector of one, zero. And then the filled one, I mean I didn't mean it to be so precise, but we actually in this case were lucky and my drawing does have like a precise meaning that we could put and it would just be that. That meaning. So then when you take a product of these, it's like a delta function distribution in that sense, right? So if you took a product like this, then it would be like saying that this is the distribution where we have kind of, yeah, exactly, like delta S one is zero, delta S two is one, delta S three is zero, delta S four is one, exactly. So it's like a discrete multi-dimensional delta function. So as a distribution, it would just have a delta somewhere and the rest is zero, you know, exactly. And so then even, so it is a distribution, like in the sense that you could sample from it, but you would just always get that sample. But then what's neat is we could take that one and dot product it with the true one and that'll pick out the actual P of that one if we dot product it on with the full distribution later. Great, thanks. Okay, so this is TensorFlow train recursive sketching. And you can see already it's a little bit technical, you know, from this little cartoon of it, but we'll walk through how this goes. And this is described in these two papers, the original one and then the follow-up, which is like a hierarchical version of that. And I believe there's been some other follow-ups too, but these are just sort of the basic two. Okay, I mean this was done with some interesting folks, mostly from University of Chicago, Jeremy Yoo-Ha and Yunheng, and then Michael Lindsay who's now at Berkeley. And these are some really interesting people because they're really more in the applied math community, but they're very conversant in like tensor networks and ideas from physics, classical and quantum physics too. So it's really fun to work with people like that. There's more and more of them out there these days and I think our field needs their input because they know a lot of stuff that we don't have in our background as physicists as much. Okay, so what's the motivation of the algorithm? So the motivation, as I said, is that we imagine the ground truth distribution that the data came from as a large tensor. And like I said, there's no loss of generality there. That's not really saying very much because you can represent any distribution whatsoever that way. And now for this first half of today, I'm gonna do something just to connect with how this paper was written. I'm gonna do something maybe a little bit unfamiliar to some of you, which is I'm gonna take this tensor and treat it as a classical probability, which I'm calling kind of one norm formalism of probability. And what I'm contrasting here is the Born rule, two norm rule of probability that we know from quantum mechanics. So if you're trying to think of this as like a wave function, that's not the formalism that I meant. So I'm not gonna square this and get one. I'm just gonna get one by summing up all the numbers. Just usual probability. So what I'm demanding is that all the entries of this tensor are non-negative and that they total to one. So unlike the usual norm we would have and if this was a wave function where we would square it, the tensor only appears once. So how do you actually normalize it? What you do is you put these like summing vectors onto the legs and so you put these vectors, I'll use these a lot in this talk in the next one. You put these summing vectors on all the legs and then once you contract that network, you get the sum of all the values inside and that should be one, right? Any questions about that part? Okay, great. You can see already that there's something kind of nicer about the kind of Born rule of probability which is that things enter more symmetrically. Like this rule of normalization is very asymmetric looking, right? You have your network on the bottom and then on top you have this funny product network of ones. So it's sort of, I think speaks to why the squaring is sort of nicer in some ways. But this is the traditional way, right? Okay. Now say that somebody already went to this distribution maybe out into the world or they have the true distribution and they're behind some screen and they're able to generate samples from it. So they generate all our samples and you're allowed to see many samples twice. It's okay, we're not gonna node the P's in front. We're just gonna see them sampled and then if some are more likely, we'll see those more often and the rare ones we'll just see once or not at all, right? So now we can define something called the empirical distribution and the empirical distribution would just be like a sum of all the samples we have, okay? So we could say if we take more and more samples and we just keep summing, eventually the ones that repeat, those will add together and we'll get like just to imagine like a histogram, we'll get like a bar chart and some of them the bar will pile up more high, some of them will see less often so maybe the bar will be shorter, some will just see once, some will see not at all. So it'll be this very peaky kind of histogram thing that we'll get and eventually if we summed millions and millions and millions of these and just kept sampling and kept sampling, we'll get this very kind of boxy approximation to the true distribution but that's not gonna be a very good way to do it and I'm just drawing this in two different ways. So here's where we actually see the indices that we sampled, here's a more kind of diagram way of drawing the same thing using that notation. Okay, great. So if the true distribution, let's say if we folded down to low dimensions so really we're gonna learn something very high dimensional but let's just say we could see the distribution in low dimensions then the true distribution might be something kind of smooth and nice like this, like this green curve back here but the empirical distribution at least for low and medium amount of samples will be pretty terrible looking, it'll just be this very spiky thing, right? Now eventually if we kept, like I said if we just kept getting millions and millions of samples and normalizing back to one and normalizing back to one eventually we'll get this very kind of boxy approximation, right? So eventually we'll get this kind of kind of peaky boxy thing maybe with some missing stuff but it'll eventually start to match but it'll be really pretty terrible, it'll have all these steps and things and it won't be such a good way to approximate. It'll eventually reconstruct P but it won't be very efficient with data. So the sum would work a lot better if we actually did something smart. We said we have some knowledge about the distribution, right? You can see it's smooth in this case. I mean this is just an example, right? But let's say we knew or believe it's smooth. One thing we could do is we could do some kind of like low pass filter over the data or some broadening and then we brought in all these delta functions into small Gaussians. Now if we add these Gaussians together this is gonna fit back to the approximate distribution a lot faster, right? Because all these tails will kind of mix together and in regions where we didn't get any data the tails will sort of reach into these regions and still give kind of a smooth shape here. So even though we'll have some places where it misses so red is just the picture in my head of what you would get if you added up all these Gaussians that I drew in black. We'll still have some errors but it's gonna converge a lot better and you'll have like a much more kind of regular behavior to the approximate distribution you get from just adding up all these broadened peaks. So we can choose this broadening to reflect our belief about how much data we have and about what the true distribution is actually like. Okay, so any questions about that or any objections or anything? Yeah, mm-hmm, we'll get both of you, yeah, sure. Yeah, okay, just what was going on with that picture, you mean? Yeah, okay, so just you mean why is that the right thing to do? Okay, yeah, it's a good thing to unpack for sure. So all that's going on there is if you, let's start with if your distribution is a single variable, right? So we could have P of S1. Then that could just be some arbitrary like P0 and P1. So then you say, how do I add them up? Cause just this is classical regular probability. So we just say that the total is like Z is P0 plus P1. And then we can see that in this case, it's just the same as if we did P0, P1, dot 11. It just works, right? Like that will just be P0 plus P1. So then if we have two, it'll be, there's S1, there's S2. This will be this matrix, okay. And then you can see basically by inspection that if we dot with these vectors where these are both 11, basically if we just run the sum, so if we just say S1, S2, if we put those indices back here, or just write them explicitly, then for every setting of S1 and S2, we'll reach in and grab one of those piece. And then all they will do is just kind of hang out and say that's okay, that term is allowed. If we put a zero here, it would actually say that term where this one is equal to two is zeroed out. So that's what's going on. These are in a way are just giving permission for the sum to run over all the, but then what's nice is that because we can take these away again, now that's that tensor diagram because the lines connecting means that sum already. So that's a good kind of morning brain exercise of like how did tensor diagrams work? So just the fact that they're touching, there's a sum already in that right hand figure. Okay, thanks. Yeah, great. And you had a question too. Mm-hmm, mm-hmm, mm-hmm, mm-hmm. Yeah, it's a good question. This might be a little past my own expertise there about if there's, if people try that kind of thing. So what I was mostly thinking about on this slide was there's this popular, well popular at least in the old school technique called radial basis functions from kernel learning. Have you heard of that one? It's nothing more than attaching a Gaussian to your points basically. We're using kind of a Gaussian distance between points when you do this kernel, so to speak, which I haven't really explained kernel, but it's just basically this idea that I'm showing on the slide. And as far as I know, I think you basically just do something like, this may be wrong. I mean, maybe someone who will be watching this video who's a real kernel expert might cringe, but I think it's basically nothing more than you try different widths as you have some amount of data and you just say try making them narrower and narrower and then you just, you have like a validation set of data that you hold out, you know. And then you kind of see like if I'd sampled would I generate the validation set or would the probabilities of those be reconstructed well? And once you make it too narrow, it's gonna start to fail dramatically. So you can kind of just retrain the model a few times, keep making it narrower and see where does it finally fail and just kind of use that I think would be a good way to estimate. And then maybe the same way also to see if it's too broad as well. So the idea, I think that's what's kind of, that's basically the idea I'm trying to get across is that in a sense, this is like the right thing to do, but it's too idealistic, you know. It would work if you had all the data, but you don't have all the data. So you have to do, you have to add in something that's like wrong, but it's wrong in exactly the right way. It's wrong because it's filling in the fact that you don't have all the data and you kind of never will. So that was in a way the thing that Yoo-Ha was telling me is he said the thing I was trying would work if you had all the data, but you don't. So you have to do something else, right? So intuitively it's like squinting at the data, right? If you kind of step back and squint at it and try to blur it, it'll look like all the data even though you don't have all the data. So I'm gonna kind of repeat that idea a lot during the talk. And then this actually brings in a nice idea, kind of conventional idea in the field of machine learning which is this idea of what's called the bias variance trade-off. And what you're trying to do is really balance two things. So you're saying that there's the true distribution P, okay? And you can write the error that you make, your approximation of, so A of P hat, what does that mean? That means we start with the empirical distribution. Remember P hat stood for the sum of these delta functions, the very spiky thing. It's like the actual raw data that you have. A of P means like the thing you do with the data, like you process the data down, you try to fit a model to it, you maybe brought in all that stuff, okay? And then you wanna make this small, of course. You wanna get back to the true distribution, right? So you can write this as actually two terms because all we're doing here is we're gonna add A of P, A of P, so here's P minus A of P hat. Here's A of P twice with two different signs. And we can imagine two things. One is the bias and bias is just defined as this difference and you can see what is this difference? This difference is saying if we took the full distribution, the whole thing, not the data, and fed it into our approximation, how much error would our approximation induce? So this would be like taking the original tensor and squeezing it as to an MPS, like putting those bonds in or working with some finite bond dimension. So kind of trying to fit it into our architecture. Or like a neural network, this is like the approximation you're making by choosing a certain number of layers. You're like I'm gonna have a three layer neural network, but how do you know it's three layers, right? Maybe the true distribution needed four layers, something like that. So this is like your bias architecture error, this kind of thing. This is the error you would have if you had all the data and tried to feed it into your machine, you know, your training process. But then very importantly, there's this other term variance and this is the part about that you don't ever have all the data, right? So it's called variance because you go and you get a training set and you train, but someone else might do the same thing, but they might get a different training set, right? They might go out into the world and collect a different set of samples. Or you might sub-sample. You might take your training set and break it into subsets and then train on those separately to get some idea of how different or different parts of the training set. And these fluctuations is called the variance. And so this is the idea again of like, how different is your actual data? This p-hats like your actual data. How different of an approximation will you get versus having all the data? And so this broadening that I mentioned, you know, taking those peaks and broadening them, what you're trying to do is you're trying to push down this variance, but you're paying a price in the bias, right? So you're gonna do something kind of ad hoc that actually is gonna mess up. If you had all the data, this process is gonna mess up your result a little bit. But it's gonna be good to do because even though it's gonna mess up this part a little bit, it's gonna greatly reduce this part. So overall, your total error will be better. So this was the thing that Yuhal was trying to tell me is he was saying that I was just really focusing on the limit of having all the data basically like this part, but this part was enormous. And so this part was gonna kill me way before this part was ever gonna be a concern. And so what you really wanna optimize is not the distance to your, basically what I was gonna try to do without broadening was gonna memorize the data, right? You don't wanna memorize the data. You wanna reduce the distance of your approximation to the actual true distribution. You're not trying to get to your data. You're trying to get to the actual ground truth. And that's what this splitting, splitting in your mind, these two parts helps with. So the idea is that if we get this balance right, we could learn with fewer samples. And the bias is okay because what we're doing is we're putting in some prior that we have about the distribution, some kind of belief that comes with the data. And so what we're gonna do is we're gonna put in this bias by what's called sketching the data. And we're gonna sketch it with tensors just to stay within this kind of tensor network framework. So it'll be more compatible with the kind of model that we're trying to learn. Okay, so we're gonna have two things. We're gonna use left sketch and right sketch. And we're gonna have this window where we don't sketch the data in a small region. And I'll go through what this is gonna be all about. And we're gonna have the data. We're gonna take a very, you know, kind of hard look at the data in this window, but like a very blurry look out in these wings away from that window. And we're gonna slide the window across and then process these tensors down and we'll end up with the matrix products data at the end. That's kind of the rough idea of how this works. So I'll go through the steps. So we'll have this high resolution near a certain variable and a lower resolution further away. So we'll define these tensors called phi, which are these ones. It'll be phi four is the one that has high resolution near variable number four, site four. You know, phi three would be centered on S3, this kind of thing. And we'll come back to what the actually, the sketches could be. For now we'll assume that we just got them from somewhere. Someone handed us these blue square tensors, right? To me, they even kind of look a little bit like a quantum circuit. So that could be some inspiration of where they could come from. Although they're missing the legs on the top or you could think of those as being post-selected to something. But I'll actually give an example, a concrete example of a sketch later for something called a Markov model. And I'll go through what is a Markov model and all of that later. So for now, just think of these as some kind of, you know, you could even think of these as being literally like random unitaries, which then you fix one of the legs. That could actually be a sketch to try in the spirit of random SVD, if you know those ideas, a randomized linear algebra. So you could even just think of these blue tensors as being random features, say. And then this leg sticking out would just be saying maybe I make, you know, 10 of these random features and then this leg would have sized 10 up here. And it's saying, you know, which of these 10 am I applying to the data or something like that, right? So for now, let's assume we know them and I'll come back to a concrete example of one later. All right, and let's go through how we would use these to actually discover from some data how to actually uncover in a matrix products data approximation of the actual distribution. Okay, so first of all, let's say that we are told or promised that from the person who generated the data and handed it to us, that they got the data from a matrix product state. So let's just say, so we know that this thing is gonna work. Someone whipped up a matrix product state, they sampled a bunch of data from it and there actually is a very nice algorithm. I wish I had time to explain to actually do the sampling. Well, maybe I could explain that one right now because we already have this stuff on the board. So let me just kind of give you a quick idea of how that works. So let's say someone has this matrix product state and they wanna collect samples. What you can actually do is you can compute what's called a marginal. What you do is you say, I'm just gonna sum out, let me keep that up there and say, marginal is, I'm gonna sum over most of the variables except one. So these dots are the same like summing vectors. So that's that dot. And you can see that that's just the sum over S2, S3, S4 of P of all of them. The thing with all the variables is called the joint distribution or you could just call it like the full distribution. So that's P of S1 and you can see that that is a vector depending on S1. So it has to be because it has one variable but you can also see from the diagram that all these just contract over and just get a different vector here. And then what you can do is you can just take these two numbers, so this will be P of zero, P of one. And then you can just flip a weighted coin. What you can do is you can just draw a random number between zero and one on the computer from a random number generator and you can just say, you can just partition the interval into, you can say, here's P zero and then P zero plus P one of course has to equal one if everything is set up properly. So you just draw your random number, maybe it lands here and since it landed to the right of P zero you say, we will sample S1 equals one, okay? And then you take that and put it back and you say P of S2 condition on S1 equals one is gonna be this network. So you put one here and then you open up S2 and then you keep S3 and S4 summed out. You keep them marginalized. In quantum mechanics we would trace these but this is standard probability so I'm putting these summing vectors instead. This one just means the vector zero one, okay? So you just clamp that variable to one. So now this is a function only of S2 condition on one. There's also a normalization you have to put two. So you have to put P one like P of S one equals one to normalize this thing properly, okay? I think I hopefully got that normalization correct. And then this as you see is another vector depending on S2, okay? So that's another, it's just P of zero of S2. I'll put two or something. And then you do the same thing here. Maybe here the cut is down here. This might be P zero two zero plus P one two, okay? And then you call your random number generator. Maybe it generates a number there. So you say we're gonna get, I'm kind of switching. Here I'm just working with zero and one for some reason. Okay, so you pick S2 equals zero, this kind of thing. All right, and then you just keep going from left to right across. And that actually is a perfect sampling method. So you do that left to right. You just keep repeating this process where you marginalize out to the right, you condition on the left and you just kind of scan from left to right and you just do a draw from distribution. And what's neat about it is you get this draw and when you're done, you just throw away all your scratch, like all the tensors that you had. And when someone asks you for the next sample, you just start the process over fresh from the beginning. So you just start back, again, marginalizing back to this first vector sample back left to right. So it has no auto correlation effects, no Markov chain, nothing. You're just drawing, drawing, drawing from the distribution. Okay, yeah. Let's see, yeah, why don't we? Well, if you flip the four coins at the same time, the problem is that, let's see, what's wrong with that? So, well, because we want to flip weighted coins in a sense, right? So in a way, this part was how do we kind of weight the coin? But there we would need some kind of multi-dimensional weighted coin in a sense. But I guess are you saying just flip four, just call four random numbers, basically? Yeah, I mean, I see what you mean. It's a little hard for me to answer on the fly because it seems intuitively correct. I think it's probably because you're imagining the distribution, the space being low-dimensional. This is like a one-dimensional space so you can do it. But when you think about those four variables, it's not sort of, it's not like kind of an additive thing. It's not like the complexity goes additively. It really goes like, it's multiplicative. So you'd have this super high-dimensional space that you'd have to draw in. So I think you would just have to draw too high-dimensional in the space is maybe the answer. I don't have maybe the best answer why that doesn't work, but this. Exactly. Yeah, you'd somehow have to have like an exponentially big coin sort of space that you're drawing from. And so this is a way of avoiding that. Well, sort of. So I have, let's see, my internet, yes. So I have a web page where because people don't really write papers about this stuff in our field because their referees knock it down. So we can't get our stuff into journals about this. So I thought why don't we have a web page where we can just collect little bits of math and things. And so this is called the tensor network. And I just put from time to time bits and pieces of math about tensor networks on here and kind of collect results from the literature, things that are published but maybe people don't know about this kind of thing. And it's actually editable by anyone through GitHub. So you just make a PR and submit and edit and I'll pull it in if it's good. And I have a section here called elementary MPSTT algorithms and one of them is sampling. But this one is done in the squaring formalism in the Born Rule. And there's a little bit of an issue I've got to fix with these bars, but let's see. So the idea here is that it actually walks you through the squaring, the two norm version of the sampling algorithm where what you do is instead of putting the summing vectors you actually trace on the square and then you get the density matrix which is the analog of these marginals. And then here's the conditioning part, the conditional part in fixing the norm and so on and so on. So it kind of goes through those details. So you can check that out. Yeah. Okay, great. So this was first published in a paper actually about finite temperature methods for DMRG. But then I thought it'd be nice to kind of pull it out and put it on somewhere by itself. Okay, great. So let's say someone promised us that the true distribution is a matrix product state. The other thing is that, let's say that they promised us that it's in this thing called left orthogonal form. And what that means is that these tensors without loss of generality you can take these tensors to have the following property. This may not be so important for what I'm gonna say but it's just for people who are paying super close attention to some details this is just be a nice like detail. So if you don't follow this part don't worry too much about it, it's more technical. But the idea is when I draw one of these yellow boxes and I use the letter U is that anything I call U has this property. So it'll be U1 dagger. If I sum over the visible external index I get an identity. So this will just be like the matrix 1001 that line without any decoration on it. And then for all the other ones for like U2 and U3 it'll have this property, same property but just there's this extra leg you have to take into account. Again, equal to the identity. And what's going on here? So this another way to draw this picture would be like this. So it's like a isometric map or a one-sided unitary. It's saying that squares to the identity, right? And the neat thing about matrix product states is you can always impose that form without loss of generality. If someone gives you a matrix product state and they fill up all the tensors with random numbers there's actually a simple process you can do to bring it into this form no matter what. So it's kind of a neat detail about something you can do with NPS always. So let's just assume that someone's already done that for us and brought it into that form. It'll just make the math a little cleaner. Okay, so now what we do is we, someone promised us that this is how the true distribution looks. We don't have the true distribution. What we have are all these samples, right? So someone gave us all these samples. We just have them in a big bag, okay? And then the first step will be to take all this data and sketch on the right. So we have these sketching tensors that I'll go through an example later of what they could be. For now, just think of them as some random tensors. They could literally be random, it's one idea. And then what we're gonna do is we're gonna sum over all the data and the training set which we can think of as product states or rank one tensors meaning a product of vectors. So the sum here D over D just means where I've put these gray boxes just put one of the data elements, contract that network and then add it in. I think I have an animation of that, yeah. So I take one of my samples, I contract this network down, you see all of this part contract, contract, contract, contracts down to a vector depending on R2. And then there's this vector S1. Multiply those vectors together and just add it into this tensor which starts out as zero and now I get the first one of these phi ones. Now I put a different piece of data on the bottom, contract down, I add that into phi one, okay? And I keep doing this, I keep running through all the data and as I do, I just keep adding them into phi one. So phi one is this matrix that I'm building up and you can kind of normalize this matrix back or whatever but I add them all up and I get phi one. So you can see this is linear in my data set which is already a good property. So I just run through all the data once, generate this matrix phi one, okay, great. So now let's kind of imagine under the hood what's happening. So the idea is that on the right I put this gray box to say you don't actually do the part on the right. The part on the right is in your mind. You're actually only doing the thing on the left, right? But this is motivating why the next steps we're gonna do are motivated or work. So imagine instead of having what's down here which is the empirical distribution which is the sum of these delta functions, right? Which we're blurring with these blue boxes. What if we actually put the true distribution there? Then what would our process be doing? So what it would be doing is like rolling up the entire right of the true distribution and kind of blurring it out and keeping some information about it captured by this index R2. And we can visualize that as this black tensor here which is kind of a remainder part that we don't know much about. And that's just that whole right-hand side contracted. But then the point is we would still in a way have access to U1, the first MPS tensor through this variable S1 that's exposed here, right? So underneath in our mind, if we had all the data or if we have very good data we'd imagine this matrix phi one if we could somehow know how to split it will give us U1 and this other part that we can ignore and kind of throw away. But the point is we could recover U1 out of it by doing some kind of factorization. Okay, does that make sense? So that's just to say that what we'll actually have is phi one, but we could imagine what information is contained in it. So what we'll do is exactly that. We'll just take phi one, we'll treat it as a matrix between S1 and R2, throw it into an SVD routine and we'll just take the U out of the SVD and we'll keep that to the side and we'll save that. And then we'll say we don't need S1 and V1, they're kind of like polluted by this sketch that we just kind of made up. So we'll just throw that away and just keep U1, all right? So in a way we've just recovered the first MPS tensor. There's something already kind of really neat about that. Okay, so we're already getting a glimpse of our true distribution just from some data. So now we're just gonna slide the sketch over by one. So we shift our window, so that was the window centered on variable one. Now we're just gonna shift over and you see this sketch was designed to have this layering property that we could always peel back the layers. So at first this right sketch was this like staircase of tensors. We can always just remove that one, just peel it away and we could expose this next variable which just means contract again but just don't go as far, just stop here, right? And similarly the left sketch is like a staircase. So we could just put the first tensor of the left sketch on the left, right? And then run the same kind of sum that we did before. I don't have the animation this time but imagine this gray box is at the bottom just running through all the data and for each data we just add them up and we just make this array again, okay? And then this time the array has three indices. It'll have this left sketch index, it'll have the site and it'll have the right sketch index. And now let's imagine again what's happening if we had the true data instead of the empirical distribution underneath. So the left thing is what we actually do. The right side is just in our imagination. So we know what we actually do over here but then in our imagination if we had the true distribution underneath the right sketch comes down as before and kind of rolls up all this part which we're not ready to deal with yet and puts it into some kind of black box. And then the left sketch comes down and now something in a way kind of bad happens which is that U2 is kind of nicely sitting there but we have this left part which is something in a way that we don't want in a sense, right? Or that we have to somehow get rid of or factorize off in order to expose U2. So we have to kind of do this like double factorization. We have to get rid of A1, what I'm calling A1 here and this thing on the right. Okay, so now more stuff has to happen. So now you have to do two steps and this will be the whole algorithm. We've now seen the whole algorithm. Everything after this is just gonna be a loop. The whole algorithm is that you actually make A1. So you can make A1, right? Cause that's U1, we have that from the previous step. So we can actually make A1. We can say bring this left sketch down onto U1 which we had saved and reserved from before. So we can make A1. We can put A1, we have phi two over here and we can make an equation. So in this equation, this purple box is the unknown. We have A1, we have phi two. We can actually back solve for this thing called omega two, right? This is an AX equals B type of equation. We can solve this, right? So it doesn't look like a normal AX equals B equation but it actually is, right? So let's see for a second how that could go. I told you today is gonna be a bit more technical so thanks for bearing with me. So the idea is that we can, let me see, maybe actually have a slide on this. Let me check. No, okay. So it doesn't look like a standard AX equals B equation but here's how you can think of it as being one, right? So what you can do is first of all, just draw this in a different way. You can just do what's called matricizing the equation and which this step is really not doing anything. It's just a different way of drawing the diagram. Though on a computer if you want, you can think of fusing these indices together into like a larger index and then it really is like now a product of matrices. So then this really is just something saying like matrix A, matrix omega equals some matrix phi. And then what you can do is you can just think of taking the columns of phi, right? This could be some AIJ, JK and then this is IK. You can just fix K, so imagine fixing K and when you do, this would just turn into A omega equals, let's say if that's capital phi equals little phi, right? And these are just vectors now. So this vector of little phi will just be the first column of this matrix, capital phi and then this is just some vector we have to find. This is just a standard AX equals B kind of problem. You can just solve this with like QR or LU or whatever. So you can solve this and then you can just tick through all the indices K and solve these one at a time and just fill in this omega, right? But in the end, this is just a code you can write once and just call it. But just to convince you that you can kind of roll this all back and you can actually solve an equation like this. So you have this known, known. And if the numbers and these are good and the problem is kind of well defined and well conditioned, you can actually solve for this, okay? So that's a neat thing that you can do. It definitely won't always be unique or well defined. So if A is sort of like ill conditioned kind of thing then you could have problems. So there has to be, it has to be set up well. I don't quite have the right words to say what's the exact condition that you need. But basically, well it's just that A has to be invertible for each one of these kind of mini equations. That's all it is, yeah. So if A is invertible. Well, so here U has kind of gone out of the problem at this part. So this is like a subroutine that we're thinking of running. So this was this A and this is this omega that we don't know and this is phi which we got from all the data. And so here I'm just folding it as matrices to motivate this writing. And then if you think of fixing these Ks, you see now these only have one index. So then this is just a standard, standard, right? And then the solution is just A inverse is just that, right? But now of course you don't really want to do it this way. Of course, you use these other ways where you know QRA first in this kind of business. But the idea is that as long as A is invertible then you're in business and it's the idea. And so I didn't go through why this process would guarantee A to be invertible but in practice that is seen in this algorithm. Thanks, yeah. Okay, yes. That's a really good question. I think so, yeah. You mean so you could get more adaptivity in the, well in a way we'll see later something very close to that actually when I show you one of the sketches. So let's go on but that's a really good question. What we'll see though is that you actually do already get that adaptivity that you want. So in a way the second line here shows if that's what you're after is doing like two site DMRG and you get to do an SVD and adjust the bond dimension. You actually do get that here as well already in the one site version. So you back solve for this omega two. And in a way all this is doing is it's sort of getting rid of A1 because it sort of has this sketch information that we want to get rid of. The sketch is just there to make the machine learning go better but ultimately we want to get rid of the sketches because we made them up, right? They're kind of made up. So we want to get to the true distribution. So we invert off A1 in a sense through this process that's more correct to get this omega two. And now omega two is like this part. It's just this yellow thing and this black thing now because we got A1 out of there, right? So now we can actually pull out U2 through an SVD. So now we do an SVD on omega two. We get U2 and some stuff and then we throw that stuff away too because this is the stuff that's polluted by the right sketch. So we don't want that either. So we're kind of getting rid of the left sketch, getting rid of the right sketch. We've got like our gold now which is U2 and we save it. And then kind of in a way you're asking about two site at least if the way I took your question was can we discover this bond dimension adaptively? This SVD will do that. So we cut here and we treat S2 in this left index as the row. R3 is the column. We SVD and then we can actually learn this bond dimension from looking at the singular values. And if singular values are small, we can throw them away and threshold them and we can actually adjust the bond dimension on the fly of the MPS that we get. So that's another feature of this algorithm is that it's adaptively figuring out how many parameters of the learned model that we need based on the data and the sketching. Okay, so then now that's the whole algorithm. Everything after that just loops. So the third step is you just move the window over again. You sketch from the right, you sketch from the left. You have this A again, but this A now is hitting, it's your left sketch hitting the first two. Tensors that you've saved in reserve of your MPS. You do the same steps as before. This slide is almost identical. You get rid of the left sketch by back solving for Omega. You get rid of the right sketch by SVDing. So you're kind of throwing away the left, throwing away the right. You're exposing again a piece of gold which is like a part of your true distribution again. U3 and that's it. And then the steps for U4 are the same, kind of very similar. You just do shift the window, left sketch, right sketch, make A, back solve, SVD. So it's technical, but that's not that much code in the end, right? And I'll show you the code. So that's how it's looking. You're just kind of moving the sketches, doing this loop, and each time you're, if you look in the upper right part of the slide, you're just recovering another tensor. And then the last one, you don't need an SVD. You just kind of run to the end of the system. You back solve for Omega and then you just put it in. And it's the last tensor and there's no, you could do an SVD, it's just that the S and the V would be trivial. They would be like one by one matrices. And so you just don't need to do that part at the end. Okay, so you see we've recovered a whole MPS. Now, by again, we kind of pollute the data, so to speak, but then we kind of do this linear algebra to kind of remove that ad hoc thing that we did and we recovered the MPS. So when we're done, we actually have an MPS estimate with a whole underlying distribution from our data. And we could evaluate its quality by checking with some validation set, or maybe we were doing some kind of benchmarking where we sneakily had the true distribution all along and now we can bring it out and compare it. Or we could generate new samples and look at the samples. There's a lot we could do with this in our hands. Okay, so let me just kind of flash the code for you two just to show you that this in the end is a pretty short code. So you see, here's those steps. So let me just kind of walk you through the code I wrote to do this. So there's some basic parameters that we set up and I'll give you an example in a minute of a concrete example, but I say, how many variables is in my distribution? What's the size of my training set and things like that? Then I am just generating, in this case, I'm generating an actual underlying distribution to draw samples from. So here's where I draw the samples. I have something where I make a certain set of sketches. I'm not gonna show you that right now. But those sketches are just basically arrays of tensors. But just to show you how short the code is, this is the whole algorithm right here. So I apply sketch, so I take the data, I make it into an MPS temporarily, so that's what this part is doing. So that's just this little product of tensors. Apply sketch is a very short function, so here's apply sketch. That's the whole thing right there. So all it does, this is just a little bit of setup, but all it does is it just does a loop from the left to see i going from one to some j minus one, and it's just contracting tensors, L-I-D-I, L-I-D-I. It's just contract, contract, contract, contract. Right sketch is the same thing, R-I-D-I, R-I-D-I. This is just the same picture I had in the slides, just contracting these right blue tensors with these product of data tensors from the right, and then just contract those all together. So this is using the iTensor library, so you just say star, and it knows how to contract tensors based on the indices already, so you don't have to code the details yourself. Then I just do this back solving, so I say A, I'm making on the fly at the bottom here at the end of the loop, so I already have A from the previous loop, I say linear solve, I back solve for omega, I grab some indices to say which ones go on U and which ones go on V, and I called the iTensor SVD routine, whoops, okay. I call the SVD routine, and then all I do is I just save the U, and then this comma underscore is just Julia notation that says I don't care about S and V, don't store it or something, right, so I'm just storing it to some kind of throwaway variable, and then I update the left projection by I have the previous A and the next U, and then the next piece of the left sketch, and then I just loop it. So that's like the whole algorithm, very short, so even though it's technical, it's a very short bit of code, okay. All right, great. So let's go into now a bit more of an example of like, what are the sketches all about? What's a concrete example of these? So we're gonna talk about a very simple, nice kind of 1D distribution, yeah, still. That's right, it's a good question. So it scales linearly with the size of the data set though, because each loop you just run over the data one time, but then you do have to do it in the way I coded it at each window, right, so you also have to do it, it also is gonna scale with the length of the system as well. But the point being that at each step, the work you do is just linear in the data set size, so that's a good scaling to have, right, yeah. Thanks. Okay, so let's focus now on a particular kind of illustrative type of model that we're gonna learn, and this is called a Markov model. And the reason we're gonna pick these is because they're easy to understand, and for these we can actually develop kind of like an optimal sketching process. So I find that's one of the main takeaways from this talk that I wanted to say, which is that this is where the kind of like interpretability, I don't know if I'm using the word correctly, but something like interpretability comes in, where we can actually have a theory about what we're doing. We can say for this kind of data, there's this algorithm with this procedure, which is sort of provably the right one if the data has this property. So we'll actually see that we can get that here. So a Markov model is a distribution where when you ask for the conditional probabilities, you say, what's the probability of S4, S5 to SN, conditioned on say S1, S2, S3, this is just an example, that it only ends up being depending on the previous variable. So it actually, the conditional doesn't actually depend on S1 and S2, only on S3, only on the variable that's the one just before the set that we're keeping. And I'll illustrate this a few ways. So don't worry, I'm gonna explain this some more. So it just has this kind of nearest neighbor conditional dependence property. Intuitively, if we think of these variables as somehow like a classical stat mech system, this means we have nearest neighbor interactions in this kind of classical stat mech picture. And I'll make a concrete example of that too. So let's go back a bit. What's the definition of conditional probability? The definition is just that you take the joint probability, so the probability of all the variables, and divide by the marginals. So you say what if we sum out over all these other variables here, and just look at the marginal probability of the probabilities of these that we're conditioning on, you put those in the denominator. So in a way, those are just a normalization, just to get the thing right. Marginal probability means that the ones you don't write have been summed out. Okay, summed over. This is just in the way of writing the same thing. So you say the conditional probability of S4 to SN, condition on these, is all of them divided by just exposing the ones in the right underneath, okay? And then these black tensors are just these summing tensors like before, okay? This is just standard probability theory. This isn't anything I'm adding. This is just textbook probability, okay? And then my assertion is that any Markov model, meaning a model where you have this property, this thing where in the conditional you only have the previous variable, that one previous variable, is that you can write any Markov model as the following kind of special type of MPS. And I think this is kind of a neat picture. And what it is is that it's a special MPS where the site tensors are actually have this delta tensor property. We'll unpack this a bit. And then on the bonds, you have these kind of bond or weight matrices. And I'm also gonna say they have positive entries to make it a proper probability distribution. But the Markov thing is the fact that all the complexity of this MPS is in the bonds. And the sites are something extremely simple, right? So what does it mean to have these kind of special type of site tensors? What it means is the following. It means things like this. If I take these site tensors and then if I clamp one of the variables, it's gonna cut the system. So let's say this is S4. So let's say I set S4 to one. Then what that's gonna do is this line is gonna be equal to one. And it's gonna come in here and cut. And then it's gonna make these both equal to one. So I'm gonna have one here and a one here, right? Because this is a delta tensor. Let me say a bit more. A delta tensor means all the legs have to equal. So it's just like a high dimensional, you know, chronic or delta. So if they're all equal to one, you get the number one. If they're all equals to two, you get the number one. But if they're different, if even one of them is different or they're all different, you get zero. So it's like a rule that clamps these lines together. Another name for this is a hyper edge. It means like a line, but it's like a multi line, right? So here I was saying this would be what would happen if I set that to one. It like splits, right? So it actually cuts the chain, okay? That's something very special. So it's some kind of, you know, conditional dependence property. These Markov models, if you clamp one of the variables, it like cuts all the dependence between the left and the right, okay? So let's check this kind of Markov property to see if my onsatz is correct, if it has this property. So we'll compute the conditional probability of S4 to SN conditioned on S123. So what I'm gonna, all I'm doing is I'm taking this standard definition just written as diagrams and then I'm gonna feed in my onsatz for this Markov model into it. So you see four through six are summed over here. See those dots, right? And on top is the full probability and down here we expose one, two and three. And let's see if this Markov property holds. And I think we get a neat like graphical proof of this, all right? So first of all, we use this cutting property that I just put. So think of these S's as fixed yet arbitrary, right? So they're fixed values in the sense on the right. So then they come down and they cut. And then I still have these weights, but the weights are determined now by the values on their left and the right. So here I just get some number, some particular number, that's like the S2, S3 entry of this matrix. And this number is like the S1, S2 entry of that matrix, right? But down here I have the same number, okay? And down here I have the same number. So I can cancel those, right? These are just numbers, because I'm thinking of the S's as fixed, but arbitrary, right? So they always cancel on the left. And then I can throw them away. And then I can say that clearly now this is only depending on S3, S1 and S2 are just gone. That ratio made it so those fell away, right? So then I see that it's actually really conditionally dependent only on S3 and on S1, S2. So I like that, it's like a nice graphical proof of that this does give you a Markov model, okay? Any question about that? Great. And so as I mentioned in words a minute ago, a good example of a Markov model would be a classical nearest neighbor Ising model. You can actually write its probability distribution in this way where if you just pick the, here I'm just taking uniform J's. So you think of the energy as just being sum over SJ, SJ plus one, just nearest neighbor, there's the energy. And then you just make the Boltzmann weight by saying e to the minus beta where beta is one over the temperature of that energy. And then you can, there's this classic thing that's been known for a long time, which is that you can write it as a chain of what's called transfer matrices, right? The transfer matrices are just these matrices. And then what we're doing is we're putting these delta tensors so that we can kind of like look inside the partition function. So each of these legs going in is like saying, I want to control the values of these S's. I don't want to sum over them all. I actually want to like pick what they are, right? And we would get the sum if we just actually summed over all the S's. But here we're writing the joint probability down for this classical system. Any questions about that? Okay, so just to emphasize that last point, if we took that construction, if we wanted to get Z the partition function, we would just sum over all the S's, okay? Of P, S1, S2, SN. And that would be like doing this. So you can play these diagram games, okay? And these are my same summing tensors that I have been using a lot, right? And so then we can use these properties of delta tensors to just absorb those summing tensors so we can just write it this way. Actually here we just get a straight line, okay? And then we just have a contraction left to right that just zips through and it just contracts these transfer matrices together. And this is a well-known form for the Ising model partition function. It gives you the partition function of the Ising model. So this is actually one way you can just solve the Ising model in 1D, okay? So I just think it's nice that you can use these diagrams to see things that might be a bit obscure otherwise. So then the neat part, the reason I was doing all this work to set up Markov models and talk about them is that one neat thing about this paper with Yoo-Ha and Jeremy and Michael and Yun-Hang is that they actually devised an ideal or optimal sketching strategy for this class of models, once and for all. So it doesn't matter what beta is. It doesn't matter whether it's the Ising model or some other Markov model. You can actually devise and prove an optimal sketching strategy. So I think that kind of thing is very neat. That's what inspires me about trying to bring this area of tensor networks to machine learning is that instead of having to kind of guess, maybe we use this kernel or that kernel is the right one or maybe we use this amount of weight layers or that amount of weight layers and how do we know it's the right amount? Here you can actually like prove things and really set things up properly in some cases. I think that's very neat. And so before I was very obscure about what were the sketches, what's the details going on in here? I said, let's just let these be random tensors. But for these Markov models, you can show that the actual ideal sketch is the following. Is that you actually just sum over all the variables to the right except for the one that's neighboring. So if you're trying to expose the second index in this example, you just expose the third one also through the right sketch and all the rest you just sum out. You just marginalize them away. That turns out to be the correct sketch for Markov models. And that's intuitive because you have this nearest neighbor only conditional dependence that I went through great pains just to show a minute ago. And this is a little very technical thing I threw together last night. So if it doesn't go through, don't worry about it. But it was just a kind of a very quick proof sketch of why this is the optimal sketch. It was just to say that if we fold the distribution in a matrix splitting between this first two indices and the rest, you can actually explicitly show that this is low rank factorizing these kind of left and right vectors. But the point is they're labeled by S3. This just is just a visual proof that if you cut, it's low rank with the size of the variable. And so that means the column space of this matrixization would be unaffected by marginalizing variables four through six because the labeling that labels are cut like our low rank decomposition, it only feeds through, the point is is that these vectors are uniquely determined by S3. That's just a property of these Markov models. So the point is is that once I said S3, I can think of this as uniquely giving me a vector on the right. So the whole column space is just like every unique independent vector in the column space is in one to one correspondence with a setting of S3. So the point is that S3 is all the information. So that's why the sketch preserving that is kind of sufficient to reconstruct the distribution. Anyway, that part's very technical. I was just kind of throw that out there to show there is some thinking behind it and you can prove things. And then just one other thing to say before I give a demo is that you can also generalize this idea. So you can also talk about higher order Markov models. These are things like second order Markov models where you have not just nearest neighbor interactions but second neighbor interactions. But then you still have this kind of blockwise conditional independence. You can still come up with an ideal sketch for them, which is just that you expose two of the variables instead of one and the rest you sum over. And so then you can take these sketches and you can feed them to that algorithm I showed earlier. And then the point is is that these are like universal sketches. So no matter what beta you did for the Ising model, no matter whether the Markov model is an Ising model or something else, these sketches are known. You can use them. And then they'll just have nice convergence properties as you get more and more data. You'll reconstruct the true distribution. Okay. So then just to kind of wrap up, let me just kind of show you a demo. So we can use this TTRS algorithm and these kinds of Markov sketches to reconstruct probability distribution of a disordered 1D classical Ising chain. So let's just take the coefficients to be drawn once from a Gaussian distribution. So we run that once and we save those coefficients. We'll take beta to be two. So that's inverse temperature and we'll just take 20 spins just to make the code go faster. And then what I'm doing is just generate the data, different amounts of data always with the same Js and then run this TTRS algorithm with these special Markov sketches where you just trace out the right, trace the left and marginalize the right and left and just expose the nearest variable. So let me show you what that looks like to run the code. So it's the same code I showed. Everything was very general. And then here's the part I didn't show you earlier where I actually make the sketches. So these are these Markov sketches. And all I'm doing is I am making, in some cases, these pass-through tensors and this is just see the identity matrix right there. So that's just one of these lines that's going through. And then I also have these tensors which are the summing tensors. So see this one's like a vector one, one going in. So what this code is doing is it's taking a bunch of those and it's multiplying these summing tensors times these lines which are these and putting that all together. So it's just making a repeating pattern like the following. It's saying that the first sketch would just be pass-through and then the next sketch will say sum on that pass-through and then put another pass-through, right? So that's the next box. And then the next one will say sum on that previous one and put another line passing through and so on, right? So those blue boxes that I showed earlier, when you look inside them, they actually have a structure like this. It's saying look at the previous variable but on the next step throw it away but bring the next variable into the game, right? And then throw it away on the next sketch and then, you know, so that's actually what's inside those coming from the right and from the left it's the same thing in reverse. Okay, so that's what this code that I'm showing is just making those boxes, right? And then the rest of the code is the same thing I showed earlier where you apply the sketches, do the back-solving, do the SVD, make the next a zip from left to right once over all the data. And then if you say take a training set size of 1,000, then you can run the code. So it's showing me that the j's that were generated is the fixed j's because I set a random seed and that was the whole run of the code and I get a 99% fidelity to the true distribution and a small L2 distance. I can run it again for 4,000 training set. It runs a little slower but it's just a linear scaling. But the point is it just runs once from left to right and it learns the entire Ising model distribution just from these 4,000 samples, right? And you see the fidelity is a lot better. It's 99.8% now, right? So it went up quite a bit. We almost got a third nine going from 1,000 samples to 4,000 samples. Great, okay. And then here's showing some more results that I collected. So if you actually go from 100 samples up to 20,000 samples and you plot this distance from the true distribution. So p star is the true distribution that I got the data from held in reserve and I minus the distribution that I learned and then I just divide by the norm to make things properly normalized. You can see this error is going down and down and down and down on a log scale. So it's coming down very nicely. As I plot as a log of a number of my samples and in the asymptotic regime, you get this behavior like it's going like one over square root of the number of samples which is kind of like an ideal behavior theoretically that you should have in this regime. So it's a nicely behaving algorithm in this case. All right, so then as promised we'll wrap up a little bit early which I think you won't mind. So you've earned it because thanks for hanging with me in a very technical talk. So we discussed this machine learning method for tensor networks based on sketching the data. And so again, sketching was this idea of somehow blurring out the data and left and the right but then kind of scanning through in this 1D space in a way that we still get glimpses of the data in some principled way and we use linear algebra to kind of like throw away the errors and the sketches. So I find something very interesting about all this. It's like we take the right data that we actually got and we like mess it up on purpose. We actually come how replace it with like wrong data but then we undo that and somehow all this process when it boils down we actually get closer to the true distribution. So you have to kind of sit with it for a while to think about this. Why is it the right thing to do to like distort or mess up your data? Aren't you making the problem worse? Like you're actually messing up the data in some way that you made up but somehow that's pushing you closer to the true distribution. And I think the answer is very simple. It's the wrong philosophy to wanna protect this nice data that you have because the data has all this holes in it. There's also all this other data that you didn't collect. And so the sketching is trying to model or fill in all those holes. It's like all the other millions and millions of data that you could have collected but didn't have the time to or the resources to. It's trying to kind of patch through all of that and fill in all those holes. So even though it's ad hoc, it has the effect of filling in all the missing data but then also you undo the effect of it. All that back solving linear algebra, SVD it was getting all the benefits of that ad hoc kind of blurring of the data but then also not letting too much of that bias seep into your final result. So that's what all that linear algebra was doing is removing it at the end. And in the end you get a method that scales linearly with the size of your training set. No gradient descent anywhere. That's another reason I like this approach is that you'll hear a lot of people talking about things like dynamics of gradient descent and saddle points and maybe that's the key to learning and maybe our brains are doing gradient descent. And you hear all these things. And when I hear that, I'm always thinking it's a little premature because here's a method that doesn't use gradient descent use linear algebra. So are our brains doing linear algebra? I don't know. So all I'm trying to say is, I think we should be a little cautious to say what is intelligence or how our brains work just based on one math that works once because there's other math that also works for the same problem. So I'm just trying to say we should be very scientific about that kind of thing. So machine learning is not gradient descent. They're not the same thing. You don't have to use gradient descent, although it's an excellent tool for some things. And we do a single pass over the variables from left to right. Also something that I didn't explore, but it could have been a nice plot that I could have shown was that we could imagine for other data that we could be less sure about whether we have the right sketch. So for these Markov models, there's this provable optimal sketch, everything is very nice. What if we don't have something as clean as that? Then you would have to make up a sketch maybe with some kind of randomized linear algebra, something like that. And what we could do is we could say we don't know if we made it up. Maybe we made the sketch too blurry or too tight. We don't know. You could just do the algorithm a few times and just compare to some validation set or compare it to your previous result and just see as we make the sketch more and more blurry or more and more fine, how much does the data change? So that's a nice kind of control knob that the method gives you. And it's nice that in some cases you can actually prove and discover optimal sketches. Something that we just need to think about now, this is a pretty new idea, is what other sketching approaches can be designed. And the intuition is that we just wanna somehow think of creating a small basis of functions that somehow captures the correct nearby correlations or local properties of whatever data we're learning. So whether it's continuous data or whether it could be like audio or text or language, we need to think of like, if you're interested in this algorithm, there's more work needs to be done to figure out what's the right sketching in each case. And maybe in a lot of cases this could be even provably figured out. So I think it's quite possible that for say, smooth functions with some notion of like, it's differentiable to this order or it's fitable by a polynomial of this degree, probably an ideal sketch could be proved and designed for functions like that. So I think there could be some neat work that could be done. Another thing that I've been kind of bugging Yuha about, and he'd probably be bothered by me putting this on the slide, is do we really need the left sketch? I mean, we already got those previous use from before. What if we just self-sketch the data? So what if we just make the left sketch to be Yu inverse or Yu dagger on the left, then we don't even need that whole left part of the algorithm, we don't need all that back solving. But I think that probably works worse and I think the reason is that the Yu's depend on our data set and so they have this variance property that they depend on the size of our data set. So we might be feeding that variance back into the algorithm. So maybe that's not a good idea. It's just something I wanna try and think more about. And then how far can we push this algorithm? Can we get this to do stuff besides these kind of physics problems? Can we do data, like, sorry, can we do images and audio data and text and things like that? I think possibly yes, but we need to work through the details of how. So, okay, that's all I have for the first lecture today. Thanks for your patience with a really technical talk. So, thanks. Thank you. Yeah, thanks a lot, Miles, for this very insightful lecture. Are there still questions? It's a good question. I think I'm just gonna have to repeat the question. The question, I think, was just that is there a way to know something like a priori how many samples? Oh, right, right. So this is where working, being on a math project was interesting because that's what the, that's what they were, I was more interested in things like, can we put some diagrams in the paper? So this is my contribution to the paper was like, let's put some diagrams. And so I kept trying to say, we need more diagrams because I wanted the paper to just look like my slides that I just showed you today, right? So I actually just, this is the first time I think I ever drew these diagrams because I was trying to understand kind of like my own paper that I was involved in. So I was like, I'm gonna get my head around this once and for all the way I like to think about it. But you can see the paper is actually much more like this, right? So this is what it's like working with these applied math folks. So I'm just picking on them a little bit. But what's also great about them is that they will go and prove things that I wouldn't know how to prove or I wouldn't take the time to prove. So somewhere in here they actually do, they do, they grapple with exactly what you're asking, which is like the convergence properties of the algorithm as a function of dimension, but you can see I'm kind of struggling to find where it is, so. But what I'm trying to say basically is my answer is the answer exists and it's in there and we can discuss it offline. And it's a really good question, but I believe it's a very good dependence. I think it's like, you know, very weakly dependent on the problem size. Question over there, bring the mic. Yeah, I think that's a great question. Quantum tomography might be kind of an ideal application. I think it's kind of a natural fit. So I think it's really so new. I don't know a lot of what's the range of what it could do, but I think that could be an ideal fit. And what I could really envision here is that the sketching fits very well, I think with the idea of POVMs. Because if you think about it, POVMs is like this overcomplete basis kind of thing. That's a lot like what the sketches are doing. It's kind of like saying, let's look at the data from all these different ways, make sure that it's not too much information, but just the right amount of information that sort of captures everything. You know, and also I like it because it's a way of putting in priors that we might have about the state. So if we think it's a gapped state with an area law in 1D, we know the correlations decay exponentially. So we know that this form is sort of right where this index here only accesses you to a certain distance and then beyond that it kind of suppresses the information. So I think it's sort of well set up for that type of thing. Oh, and I like your question too for a different reason as well, which is that I ran through the algorithm as it's shown in that paper. So in a way, this talk was in a way, I was kind of talking to myself a little bit because I wanted to understand that math better in the paper, you know, which I understood it, but I wanted to like really understand it in my bones in the way I like to understand it with these diagrams. And what that helped me to do also is that I wanna go back through all the math and devise the quantum version of the algorithm, right? And the quantum version of the algorithm will slightly different in many respects. So instead of, you know, marginalizing by putting these vectors here in the sketch, the analogous thing in the quantum case would actually be like kind of a double-sided sketch where you would actually have your data like twice. And then if it was, if it had something like a quantum analog of that Markov property, you wouldn't have these vectors. What you would have is are these traces. So then the sketch has to be something like a kind of super operator that goes around here. So the details have to be slightly different. And so I think it leads to some interesting questions, but I think they're very like, you know, questions we could easily solve, but it'll just look really different in the kind of classical versus quantum case, I think. So I think it's very interesting. Thanks. I'll run with the mic. We have a few people online for them, it might be better. Hi, thanks for the great talk. So I have a question about, like in the long term, if this is to be something that can be applied to several machine learning problems, we also need to scale it up. And one of the advantages of machine, let's say in neural networks, and it was stochastic gradient algorithms is that they kind of scale almost ideally. So what can you say about how to scale those tensor networks? Because from my understanding in at least like computational quantum physics, it's very hard to scale them up to exploiting multiple CPUs, multiple nodes, et cetera, et cetera. So yeah, something to say about that? Yeah, that's a good question on a lot of different fronts. So let's see. So one thing is that this algorithm is very parallelizable in certain parts, right? So that part I showed where we're applying the sketch to each data, that's very like data parallel. So that part we could do in a pretty ideal parallelization there. Also generally here we're using like dense tensors and we've gotten those parallelized on GPU now quite well. So that could all be sped up I think with multiple levels of parallelism quite a bit. Also probably using some sparsity and things there as well. Now I think the biggest challenge of all this would be more like scaling up in dimensionality, right? Like as you mentioned, I think it's totally correct that they're like neural networks, say for physics have this nice advantage that you could just go straight ahead into 2D or straight ahead into the continuum, like do chemistry, do 2D models straight ahead and there's not really a lot of extra like technical work that you have to do. I think here for now I would say this is probably still gonna spend some time being more of like a 1D oriented method and maybe doing 2D by some kind of zigzagging or snaking kind of thing. But then certainly one thing we could already do is work on trees and I'll revisit this idea of continuous variables in the next talk. But a very nice thing that could be done, this is pretty sketchy pun intended I guess, but something that I have kind of floating around in my head which is to say 1D sounds like a big limitation but maybe it's not such a big limitation which is that we could imagine architectures like this where we have a tree. By tree I just mean a graph with new loops, right? Where we have lots of indices like this and then each of these collections of indices is resolving a continuous variable like X1, X2, X3, or actually we can go even further. We could have X1, Y1, Z1, X2, Y2, Z2 and so on. And then this could be a function, to be suggestive about it, this could be Psi, X1, Y1, Z1, X2, Y2, Z2 and so on. And this could actually now in a sense be in 3D space at least formally in 3D space in the sense that there's X, O, Y and a Z and you could try to capture this kind of function so we could already be in like 3D in a certain sense but in more of a first quantized sense, right? Now there's still a 1D aspect to it which is that I made it a linear network, right? So it's still something like 3D balls connected by springs but they are in 3D, okay? But they still have a 1D kind of correlational structure to them. So you can imagine something like a 3D tube and you have particles in there but they're confined into this tube but they still have some amount of freedom to move in 3D. I think stuff like that is on the table but now I don't think like arbitrary 2D and 3D is totally on the table yet because we're still a bit stuck with working with trees and chains and things like that. But that's where also I think newer ideas new to quantum physics but maybe very old to stat mech are coming in now like belief propagation and things like this could also help us to kind of start to crack 2D and 3D a bit more as well. So I think that's some of the future directions I see possibly opening up but I think you're right that right now things like neural networks are a lot more flexible. Like if we need a result like today, later this year or next year I think they're much more flexible right now about dimensionality and this kind of thing. So I think these are the trade-offs we're sort of wrestling with. Yeah, question right if you're in the front. Mm-hmm. So regarding hot wear, we're moving sort of in direction to make algorithms that run in memory so neuromorphic computing rather well computing what I would like to call it. Like do you see a future for this algorithm in that regard because the energy concerns for AI are real. Yeah, I really like this question. That's actually something that is my one of my biggest motivations is the thing about energy usage definitely but also in a way just philosophically like I guess I'm just kind of an algorithm person and in a way when people say I see these maybe I'm a little bit too much on like Twitter or something but I'll see all these kind of silicon but sometimes they're pretty serious people like it'll be like Sam Alton and head of OpenAI or something and they'll say with a very straight face online they'll say like the future is whoever has the most compute will control the future and he's going around to these venture capitalists and people and these sovereign wealth funds of countries and saying I'm gonna raise $7 trillion like it's a real number that you may have heard. To get all this GPUs and all this kind of thing and I find that on the one hand interesting but ultimately I find that vision of computing rather depressing because it's kind of like saying we have nothing to offer as human beings and our brains have nothing more to offer all we have to do is just go collect a lot of metals from the earth and just heat up the air and melt the ocean and you know I'm like that's really that's it that's all we're gonna do I mean you know like I think we can do something smarter than that you know it's kind of my hope you know and so I know you're not saying that but I'm just saying when I hear Sam Waltman say that I'm like is that really what our greatest minds of our generation are that's their plan you know so to me if that's really what if that's all you need is the phrase that people throw around a lot that's kind of depressing frankly and so to me what we can do what we could try to offer back I hope would be to say what if instead of just throwing more compute at something and just getting more data what if we just could use the data better and maybe we can't and maybe there's no way to do that but that's what I liked about this this idea of sketching was to say maybe there's a way of like processing the data to kind of get it ready for learning so you don't need as much of it or that whatever you have you use it to the fullest extent or this kind of thing and that's where I think like you know your question was really good and I wish I had a better answer about like what's the exact kind of scaling and data efficiency of this type of thing that's where I need to sit more and spend more time understanding how does this compare to other machine learning methods because I only kind of dip in and out of machine learning you know I'm sort of moonlighting as a machine learning person and then going back to quantum physics so I'm not always as versed as I should be and what are the exact theorems for this style but I think partly also in some of the other methods there's not always known exactly what's really the provable scaling with data set size of convergence and these kind of things so all I'm trying to say is that I hope this is like a little bit of inspiration for me and for you that we could try to be more like you know use our data better and like you see that's why I like this Markov example because you actually have like a some sense there that at least for this algorithm there's some provably best way to treat the data so that you need the least of it and that you make the most out of what you have and that kind of thing so I hope that kind of addresses your question. Thanks. Good, I would suggest that we take further questions to the break. Okay, great. So let's thank Miles once more.