 Good to see you all again. So today I'll be following up on the rest of material I started yesterday about introducing machine learning. We'll go quickly through that and then we'll go into applications of tensor networks to machine learning which is kind of a new area that I'm trying to launch. So it's definitely kind of experimental if I can use that word although it's theory. But I'm hoping that some of you find it interesting and useful. So first of all, yesterday some people have requested that I post the slides ahead of time. So I wanted to put a link up. There was also an email that went around with this link. So if you have the email, you'll have this link. Otherwise, I'll just wait a second if you want to copy it down, or yeah, take a picture actually, it's probably the fastest way. Okay, it's on the conference website as well, so. Okay, anybody need more time to write that link down? Okay, good, all right, so we'll get started. So yesterday I introduced tensor networks, mostly talking about matrix product states, but I did touch on PEPs and Mira as well just to let you know about those. But those are still kind of future areas I would say for tensor networks like I wouldn't start there if I were you, you can talk to some people. I was yesterday talking to someone who's working on PEPs and it's a very tricky topic still but I think it's coming along rapidly. So we'll see big improvements in that area. Mira is also very interesting and I think we'll see more uses of it as the field kind of builds. The matrix product states are still the workhorse of this field, so talked about them and computations of matrix product states like how you can do efficient things efficiently with them that would be intractable otherwise and we started talking about machine learning. So let's continue with that now today. So this is just a review from yesterday. I was summarizing what are like the three kinds of machine learning tasks and you know, they don't all have to fit into these categories but this is just helpful to have some categories and they are supervised learning where you have labeled data, unsupervised learning which are like all the things you can do with unlabeled data and then reinforcement learning which I just wanted to mention for completeness because it's this kind of third big area that's kind of a hot area in machine learning and remember this has to do with how much information you have about your data up front. So supervised learning, right? That's the idea of you're given data with labels and then you're supposed to produce a function that can assign the correct label to further data that you didn't train on. Unsupervised learning is more like things like you have some kind of notion of probability of seeing the different examples in the data and you're trying to kind of fit that probability distribution just from the examples. You don't have it but you imagine it's there and then you try to make a function that approximates it. You can also do things like clustering and making reduced representations. If we have time at the end, I'm gonna talk about one way to do this, making reduced representations of data with tensor networks. So I think that's a pretty interesting area for them. And so then now to the slides we didn't get to yesterday. So kind of to summarize the approaches in machine learning, the idea is that instead of the solution to a problem being like a formula that you write down or some kind of logical code that you program like this kind of if then and you produce this sort of pristine piece of code, in machine learning the solution to your problem is viewed as just being some function. So you just say okay, in the space of all functions there exists a function that would solve my problem. I just have to go find it or something near it. So what you do instead though of looking in the space of all functions which would be way too big, you can't even really feasibly work with that space. What you do instead is you parameterize a very flexible functions. Just kind of any functions will do as long as they're extremely flexible and like have lots of adjustable parameters and hopefully other good properties as well we could get into that. But that's still kind of an area I think machine learning is trying to figure out of like is it really about those functions being very limited or is it more about their flexibility or is it something more subtle? There's this topic in machine learning called inductive bias which is in some sense just what functions are the right functions but I mean what does that mean? So they're trying to kind of figure that out. One example could be that we know in physics that basically wave functions have some kind of hierarchical and structure imparted on them by like kind of these RG ideas. So we know in physics that like some kind of hierarchical structure is the right inductive bias for many body wave function. So people are kind of looking for that analog in machine learning still I would say. This also this point about picking very flexible functions just kind of any flexible functions. This is also where I'd say it's a bit different from fitting because some people will say machine learning is just fitting right? But to me fitting kind of implies more something like that you know precisely why you're justified in using, I mean this may be a bit of a reach but you're kind of justified in knowing why you pick a certain function. You say well you know this is some curve that something is going to zero so there's a small parameter so I'm justified in expanding in a Taylor series so then I just need to fit the first few terms of a Taylor series so you have some kind of theory like that. Here you're not really trying to justify your choice of functions to that level. You're preferring just convenient functions, things that you can actually optimize over whether they're like exactly the right thing. And then how do you really pick the function? You say okay pick a bunch of functions that I can actually work with. Then there might be a bunch that all do equally well for sort of representing the true function. So then of all the ones that do basically equally well you prefer the simplest one. And that's where that's the idea that leads to generalization. The idea that out of all the ones that could sort of count for the data I've seen so far perhaps if I go with the simplest one that'll continue to work to data I haven't seen. Yeah. Sure. Yeah I mean I wouldn't want to stress that point too much. It's just that I would say maybe there's more to machine learning than just fitting the data that you have. So maybe it's kind of related to your point that to me when you're fitting it's somewhat more in the realm of like you kind of really want to go through the points that you have mostly. Or that you have some kind of knowledge of the correct function class. Here it's a little bit different. You're kind of thinking more statistically and you're not as worried about whether you have the correct function class because you're going to kind of fix that anyway by kind of also taking into account that you have limited data. And so there's also some kind of notion of regularization and that's what this point is getting toward. So I wouldn't want to say it's, I wouldn't want to say I could really define what fitting is. It was more motivational to say. Oh no there certainly is such a thing as overfitting. Yeah that's where you take your data too seriously or you take your function class too seriously and you say it must go through all the data that I have. But that's not correct unless you have all the data and the correct function class. So unless both of those things are true then it's overfitting. Yeah, good question. Okay so you could talk for days about philosophy of machine learning but that's just a little flavor. Okay so now to some more detailed things that this is really important to know. So in terms of what is the field of machine learning really doing? Like what do people actually do? And this is very helpful to understand where different ideas that you hear about sit in kind of the history of the thinking of people in that field. So this has to do with the kinds of model classes that people study. And the most important one for learning about this field is the linear model because it's something that you should always use as a tool and it's something good to know about and think about, it's very simple. And then some two other very important ones are these classes models that I would call kernel learning models or supportive vector machines is another name you'll hear. Basically this is a special case of these where you have a particular choice of cost function as well as model type. And then more generally you could just talk about kernel learning. And then finally neural networks and I'll say a bit about all these. So the linear model is just what you'd expect. It's just fitting a line basically. It's like some kind of hyperplane. So all you do is you say my data comes in the form of some kind of vector that I'm given. So it could be a vector that you're given or a vector that you prepare. So let's say it's images then X is like just a vector whose entries are all the pixels. Just numbers, let's say if it's grayscale pixels then there's numbers between zero and one or between zero and 255 or something like that. Then all you do is you say okay, my model is that I take the input as it is and I dot it into some unknown vector of weights w and I could have a constant as well. And then I just need to figure out the best w that will do whatever task I'm trying to do. And this can actually work extremely well. Here's an example from physics where this works very, very well. So this is this so-called linear prediction method. It actually ties back a little bit into DMRG and tensor networks here. So but this could be used with other methods as well. So the idea is that you have some method that can generate very accurate data for some kind of time dependent process. So this is some kind of quench and then you measure like say expectation value of SZ as a function of time and then it's decaying. But then you reach some time that you can't afford to go much past anymore. So then what you do is you divide your data into some kind of training data back here and then this is some kind of testing or validation data here and then you just do linear fitting. So you just say let me suppose there's a model that I can feed in some sequence of my points and then have it predict the next one and then I can run that forward by taking the next predicted point and using that as a new input and kind of walking that forward. And that's what these authors did. And then you can see the solid circles are the results of their linear prediction model and it sits right on top of the exact results. They were picking a model here where they could actually get the exact results to all times just to be just as a check. So that's pretty interesting and then they actually have a theory in the paper about why it works so well. It turns out that I think they use complex weights as one of the points and then it's able to model oscillatory and decaying terms and do superpositions of those. So anyway, that's just a point about this can actually be a very powerful model. How would you actually train this linear classifier or linear model? I just wanted to put this slide in because it's like, if you're gonna discuss how do you do training and machine learning, picking the easiest model makes it the easiest to explain. So remember for supervised learning, let's take that case of supervised learning. You can use linear model for other things too. So let's say you have a training set with elements xj or inputs xj and then you have these ideal outputs yj. That's like the thing you'd want the model to give. And then you minimize the cost function where you say, let's penalize the function for having different outputs from the ideal ones. And then let's take the outputs to be either plus one if xj is an a or minus one if xj is a b. And then we want to just tweak f until we can minimize this as much as possible. We may not be able to get it to go to zero. Probably we don't even want to. So then how do we do that for the linear model? Well, you just plug in the linear classifier here. I dropped the constant term, but it's not so important or you could absorb the constant term into the x anyway. You can just always have one of the components of x be just the number one and then make it one dimension bigger if you want. So then, so you put the linear model into this cost function. I put a little factor of two for convenience, it doesn't matter. Then you just take the gradient with respect to the weights. And so, you know, that pulls out, let's say if we take derivative with respect to wn. So that pulls out the nth component of x out front. The two comes down, you know, and then we have this term inside. And then you just update all the weights with the negative gradient times some empirical small step size. So you just say, okay, I get the gradient. So you can see how this is efficient to get. You just input all your training data. You get this number, you multiply it by the each coefficient of that vector. And you just add these up and you get, this is just one number that you get for each wn. Then you take that number, multiply it by some alpha that's like 0.001 or something. You just make it up. And then you just do this update a bunch of times. That's it. And then you can do, if you wanna get a little fancier, you can do things like decay the alpha, like have your program eventually have a schedule where alpha gets smaller and smaller and smaller, things like that. But it already usually works for this kind of simple model just to take alpha, something small. And you can just mess around with different sizes of alpha and to see what gives you good results. So let me just show you a quick demo of that. So this is actually taking what's called the imnist data set, which I'll jump ahead and show you what that looks like. Imnist is actually a bunch of handwriting samples from the post office. And you can train a model to actually do pretty well at recognizing them. So these are actual imnist digits. So imnist is just some kind of something institute of standards in technology or something. And they just sponsored this data set where they collected a bunch of scans of actual handwriting of digits from the post office. And so this is what you train on. And then the idea is you wanna get good at recognizing different handwritten digits when you go over to the test set, which consists of similar digits. So this is a little linear classifier code that I'll show you the code, actually it's not that long. So this is just the thing that holds the data. I'm not gonna go into much detail. This is a little conjugate gradient routine that I wrote. It's a little fancier than regular gradient descent. So that's, it's only that long. Here's something that just reads in the data, kinda wires up a bunch of vectors, and then it just throws it in the conjugate gradient. That's this line here. And then it just evaluates the performance, that's it. So it's not that much code. There's something at the end that turns it into a matrix product state. So let me show you what it looks like when it runs. So there it is reading the data. Then you see the cost is just being printed out. Just goes down, down, down, down, down, down. And then you get 98% of the, it can recognize 98% of the digits correctly, and that's only getting about 1,200 wrong out of 60,000. So the linear model can do pretty well, although this is considered a rather easy data set. And then you try it over on the testing set, and it does almost as well. So that's remarkable. So some very simple model can actually already get you like 98% handwriting recognition task. So that's already enough to use for a business or something like that. Like you could have a human come along and check all the cases that it wasn't sure about. But you would already get 98% of them right most of the time. So I think that's pretty interesting, and that's some code you could write yourself. Just write it in Python or Julia and try that out. Yeah, that's a good point. So no, I mean it's, I'm not sure, I mean every model, it depends on how you define that. You could just look at the, yeah, I'm not sure. I don't think it probably doesn't so easily. Yeah, yeah. Well, no actually, I take that back, sorry. You could always at least do this. You could always at least look at the outputs for different labels. So I believe here there's actually different weights, one for each label. Like there's one that's recognizing this label, that label, that label, that label. And you could just see whether they're almost tied or whether there's clearly one output that's much bigger than the others. So you could at least do that. But I mean, how much bigger should it be? This gets into like, you probably have to do some theory and some sort of testing on validation sets to figure out the correct thresholds and everything like that. But I believe, yes, you could do something reasonable already with a simple model. Yeah. Okay, so that's just to show you that, you can do a lot with a linear model. It's a great thing just to sort of teach yourself the basics. And I just wanted to show you what it actually looked like to run a code that does these kind of things. Okay, so that's the linear model. So basically I would say always use the linear model. If you're gonna do any machine learning research, always use the linear model somewhere. Like just start with the linear model, see how it does. You may be surprised how well it does. And you may learn something about your data set. You may find out that you didn't prepare your data properly. Or it can be very helpful. But then when you need better performance, you have to move to a more complicated model class. So the first one I'll talk about, and this ties into the tensor networks the most, is kernel learning. And it's actually rather simple. So kernel learning is like saying that linear classifier doesn't cut it. So let's kind of beef it up in some way, right? So let's see how we can beef it up and make it more capable. So let's say we wanna separate data into two classes. And let's say the data looks like this and the blue points are these and the red points are these. But then let's say we are naive about it and we wanna use a linear classifier. It may be insufficient. So this is this classic XOR pattern where you can't separate reds from blues with a single line. It's just not possible for this data set, right? Because it's sort of like high dimensional information embedded in a low dimensional space. That's one way to look at it. So you try to resolve this problem by saying let's not just apply a linear classifier. Let's first kind of augment the data with extra features. Another way to say it is let's just apply a nonlinear, some kind of arbitrary nonlinear map to the data. And let's not have the model consume X. Let's have it consume this thing, phi of X. And we'll say a bit more about what that is. So you basically empirically map the data up into a higher dimensional space. So you just prepare your data in a way that you start with larger vectors X. And then now you have this higher dimensional space to work in. Now you try a linear classifier. And you are more likely to succeed because there's more dimensions to move in. So there's more ways you can draw a plane that can cut the data into the right classes. So the idea of kernel learning is really simple. It's just that you do something complicated to the data, but you only have to do that once because you just have the data already once at the beginning. It's like a constant. Then the thing you're actually going to spend all your time messing with, that just enters linearly the weights. So I would say these function, this class of models is defined by the fact that the weights enter linearly. That's the important thing about it. The weights enter linearly, and everything else is nonlinear and complicated. And I haven't even said what the phi of X is. And the point is that you get to pick what phi of X is. So I haven't even had to say that yet. OK. So it's a linear classifier in this higher dimensional fictitious space you could call feature space. And so feature space is up to you how you make it. And so some of the research is like, how should you construct feature space out of the data so that this thing can succeed and have good performance? So let me give you an example of what's called the feature map, which is this phi map that takes you from the kind of native space of the data up into feature space. So a simple example of such a feature map could be, let's say our native data is just three-dimensional data. So these could be coordinates of things in real space or something. But then let's say that these things are these points are kind of really entangled with each other in real space. They're like in some kind of complicated swirl, they still want to disentangle them from each other. So we can give our model a better chance to do that by making this larger vector out of these components. And we can say, OK, this larger vector phi of x will contain the number 1 so that we get a free constant in our model. It'll contain the original vector x, but it'll also contain these pairs x1, x2, x1, x3, x2, x3. So now if we apply just a linear classifier to this, we can put term extra terms in our model. So if we have a coefficient of w that picks out this term, we can have nonlinear combinations of the components of x. I think you all see why that would help. So basically our model can not be sensitive to correlations in the data and not just looking at each component separately. That's the important thing. OK. And you can make all kinds of other complicated feature maps. I don't know if I have examples of other ones. But another classic example is you basically can make one that one way to think about feature maps is that it imposes a new distance measure on the data that's different from the original one. And you can pick different distance measures. I think I'll say one thing about that on the next slide. And a classic one is to just pick a weighted Gaussian distance measure, something like that. So a few more technical notes about kernel learning. So it's also called support vector machines, but that's really only technically right when you're using a certain kind of cost function. I think it's called the hinge loss, just a detail. So if you were wondering what that term means, it basically means you work with the model in the previous slide, but a specific choice of cost function and training and everything in regularization. The name kernel learning comes from how you usually, at least in most of the classic literature on this from the 90s, and all the introductory review articles you read, what they'll tell you is that typically you can't really afford to really work with your feature map. The feature map is just something in your mind. What you actually work with is the dot product of the feature map with different things plugged into it. And that's called the kernel matrix. The idea is that you say maybe the distances are easy to write down. The distances of these numbers kij in feature space, even if the actual vectors defining feature space are not efficient to write down or to work with. So the key example would be if you take a kernel that's like k of xi xj, and you just define it to be e to the minus alpha xi minus xj squared. That's a very common and popular kernel. It's called the radial basis function kernel. But I don't know if the name really means, but maybe something about the way Galcians look. But that's a very common and very successful kernel that people pick. And the idea is that this implicitly defines phi. And you can write down phi basically just by doing a Taylor expansion of this formula. You can work out what phi is. So people in kernel learning like to point out that for this kernel, the phi is actually infinite dimensional. Of course, many of the coefficients of it are weighted by one over different numbers factorial. But still, in some formal sense, it's an infinite dimensional feature space. But really, all that is saying is that you just change your measure of distance, and then you train a model with this measure of distance instead of the original one. But when we get to the tensor network stuff, the idea there is that we will work with the feature map directly. And we'll pick one that lets you use tensor networks. That's the idea of that. Okay, and then one other thing that we might use later in the talk if there's time is this interesting fact called the representor theorem that says in this type of machine learning, you actually know something about what the ideal weights look like. And it's just that they're a linear combination of the feature map training data. So the idea is that you have your training data, you blow it up into this higher dimensional space, and then basically that the ideal weights live in the support of these training data feature vectors. So even though it's a super high dimensional space, there's a much, much lower dimensional subspace that where all the data lives and the weights live there too. And so you just have to find these alphas. And so there's much, much fewer alphas than there are weights, that's the idea. And so one way that people typically work in this field is they immediately use this fact and then they just work with the alphas for the rest of training. And that's called the dual representation of the model for those of you who wanna know. But again, with the tensor network thing that I'm gonna be talking about later, the idea is that that doesn't really scale, but we'll actually work directly with the weights. But the weights will be in the form of a tensor network so we can actually do that. Whereas if you tried working with the weights directly, normally it would be way too big. It'd be like working directly with a many body wave function. So that's the idea. So this is analogous to working with a many body wave function by like say summing slater determinants, whereas I'll be talking about a different approach. Okay, and then so this really is a very popular field in terms of practitioners. A lot of people in academic settings still use kernel learning because there's a lot of advantages. Basically you can just immediately compute the exact solution to your model in many cases as long as your training set is not way too big. So if your training set is a few thousands in size or less, you can just compute the ideal weights kind of just by diagonalizing some matrices and that's it. So it works really well, but it doesn't scale that well. If you do it naively, it scales as the cube of the training set size, you can get this down to n squared or even n, but you start having to use fancy tricks. So it's not as popular anymore with engineers by which I mean people in Silicon Valley because they're all about big data and big data means n is like millions or even infinite and you just can't afford, you know, n cube scaling. So that's why people have switched to neural networks for one big reason why. Okay, so now let's switch to neural networks for just a few slides just to give you some background on those. So neural networks, these are like the current darling of machine learning right now. So everybody in Silicon Valley is using these, as you know. And they're notated diagrammatically very often but I just wanted to emphasize this isn't one of these tensor diagrams that I used later in a, sorry, I used yesterday and we'll use later. This is, these diagrams are a bit more empirical what they mean. So I'll explain that on the next slide. So what is a neural network basically? All it's really saying is that you make up some complicated class of functions that you define by how you compute them. So what you do is you take the input x and here I'm just making it three dimensional just so it fits well in the slide but really it could be, you know, like a hundred dimensional or something like that. So you take your input x, then you just multiply it by some rectangular matrix and that's called the first weight layer. So that rectangular matrix is usually notated by a bunch of lines. The idea is that the lines are just the coefficients of the matrix. So this is the one that says, so you see this is going toward two numbers here. There's gonna be two outputs. So this is a three by two matrix, W one. And the idea is that that line is the component that connects the, when the row index is set to one and when the column index is set to two. So that's the one two, that line is the one two element, that's the one one element. So each line is just one of the elements of that matrix. So the idea is that there's a little number righting on each of those lines. So that's called the weight layer. So you just multiply it by a rectangular matrix. You get the two numbers in this case that result. Then you take each of those two numbers and you plug them in to a non-linear function. And which non-linear function you pick is just up to you and there's different ones that are popular at different times in the literature. So right now they're these ones called values that are popular, but before it was these things called sigmoids, which are kind of like, look kind of like, I don't know, is that like a Fermi direct distribution? I don't know. But they look, there's some kind of rounded step function basically. They're all some kind of rounded step function. And then you just take the numbers and put them in point wise into those. Then you take the resulting numbers that come out of those. Those are the neurons, by the way. Then you do another weight layer. So you just multiply by another rectangular matrix. And eventually the rectangular matrices get smaller and smaller and smaller. And you get a small number of outputs, either one or two or maybe 10 outputs. And that's how you evaluate the model. And then usually you use a different non-linearity at the very end. So a very common one is that you put it into an exponential at the end. So that's called a softmax output. So that's basically all the neural network is. So the non-linearities are called neurons and sometimes they can have adjustable parameters too. Sometimes they can, sometimes they don't. And other neurons that people use include tanch functions or this thing called a ReLU which is nothing but zero and then a line. That's all a ReLU is. And the terminology deep just refers to any neural net that has more than one weight matrix or weight layer. So this is already a deep neural network in that sense. And the number of neurons in each layer is sort of ad hoc and arbitrary. But there's a theorem that says if you have enough of them you can represent any function whatsoever. So that's one very powerful thing about them is that you can always add more neurons and represent anything. And empirically they just work very, very well especially for image data. And one reason they work well for image data is that you do additional tricks. And one of the key tricks is this thing called convolutional layers. And these are basically where you make the second or third layer of the network only sensitive to small patches of the previous layer. And what you do is you basically scan these patches through the data coming from the previous layer. And then that defines the numbers coming out of the next layer. So it looks a little bit kind of like a renormalization group process where you basically do some kind of blocking and then you scan through, but they're overlapping blocks that you scan through the data. So that's worked empirically very well for image data. And yesterday I mentioned this image net paper which is this one from 2012 where they just blew the doors off the performance of this really challenging image classification task. And I believe that was... I think Hinton was involved in that one, that's right. And these are his other colleagues, Jan Lacan and Joshua Benjio. They're kind of like the people you most hear associated with deep learning. So I thought I'd just mention them. Lacan was recently the director of Facebook AI and Benjio has a big group in Montreal and Hinton goes between Vector Institute and Google. Okay, so briefly let me just mention there are other model types of machine learning that are worth just knowing about, but I won't say much about them. One of them is called graphical models. And this one is kind of similar to TensorFlow Networks from my point of view, but it came up in the machine learning and statistics field. And it's different from TensorFlow Networks in that the entries are always, or the output of the model is always interpreted as a probability. And they're also restricted to, and not an amplitude, like an actual probability. And they're always restricted to have non-negative parameters everywhere through the model. So you can't use like the SVD when you work with them the way you do with TensorFlow Networks. Another class is Boltzmann machines. Many of you probably know about those from applications and physics. This is basically nothing but a random bond classical Ising model, but they're very powerful and have a lot of interesting theory behind them. And in decision trees, basically this is where you just sort of to make a bunch of forking paths, you say, okay, if this criterion is met, I'll go this way, otherwise I'll go that way. These are actually empirically do very, very well on competitions and different kind of challenging tasks. Another thing I just wanted to mention is, I've gone to some of these big machine learning conferences and met people in the field and it's kind of interesting to see how the field is constructed. So there's different sub communities. One of them is very academic and the papers often involve lots of theorems. Sometimes I think they even prove things that don't really require proof, but they just like to prove things. So there's a lot of theorems. Some of them are very deep though too. There's another theory, there's another community that's very engineering oriented. And they, you know, these two communities kind of drive each other crazy. The other community, the engineering one, gets incredible results, but the developments are often a bit fattish. Like the papers will start by saying, we will switch to using this kind of neuron because it's popular. Or, you know, this famous person uses it, so we'll use it. And then, you know, the people in this community will criticize them, but then they can't really argue with the results. So it's interesting because there's a lot of good things happening, but in kind of different perspectives. One thing that's very important to know if you interact with this field is that conferences, conference talks and posters are valued much more highly than journal articles. So that's kind of opposite of physics. So if you actually get into a conference proceedings, that's like the very best thing. That's like getting a nature paper. If you put your paper into Journal of Machine Learning Research, that's like a PRB or something. So that's how that field works. So that's important to know. Otherwise you might get that backwards if you try to interact with them. And it's also just interesting to see how strong their industry ties are. So for example, in a lot of machine learning groups, professors have a hard time keeping students all the way through a PhD because they'll just get poached with some kind of six figure salary halfway through their PhD or something. So it's like a problem for them actually. And eventually the professors give up and they just leave too and get even bigger salary. So that happens. Anyway, it's a very interesting field, but one thing I would say is that it's a very, the field has a lot of really deep people in it though who know a ton of really interesting math and statistics. So I think it's very worth talking to people in this field. So I just wanted to encourage you to interact with them. And I'll post this, I already did post the slides. So I encourage you to take a closer look at some of these resources. Some of these are extremely good. So this one in particular, these lectures by this professor at Caltech, Yasir Abu Mustafa, or maybe some of you are at Caltech and can go take a course from him. It's all on YouTube and he's extremely entertaining. He always puts jokes into his slides and stuff. It's really good. So check that out. There's also a nice article, I should have updated the slide. It's called, I forgot the title. It has this great title though, but it's Pankaj Mehta and David Schwab's big review article aimed at physicists all about machine learning. I think it's called like a high, bias, low variance introduction for physicists or something like that. And then there's all kinds of examples and excellent blogs online that you can mess with yourself to teach yourself the techniques. Okay, so okay, great. So now let's switch gears and talk about using tensor networks and machine learning. So this will tie back into the introductory material yesterday and today. Okay, so this is work actually with David Schwab who is now a professor in Cooney in New York who does stats and neuroscience but is very interested in machine learning very generally so I just wanted to credit him for helping me get into this field. So the motivation of like why would I want to put tensor networks in machine learning? Why would anybody want to do that? The motivation is that tensor network methods are very, very powerful. They have really excellent algorithms for optimization. I mentioned yesterday DMRG for example. And there's many other algorithms also. And these give extremely excellent results. So this is showing a recent paper just from this year where this group from Taiwan was studying predictions of this kind of famous like cane fish or scaling of transport through 1D systems with a weak link in them. And the idea is that they were looking at very, very long distance behavior and correlation functions and checking the predictions of this theory that's based on RG and getting really excellent agreement. And this really takes very precise calculations to do. So I just wanted to point out that tensor networks are very powerful. They're also highly interpretable because everything's linear in a tensor network. So it's really easy to kind of like analyze the internal structure of what's going on. Not really easy but you can do it versus other things where it's just much harder. So this is an example of some work, some ongoing work by the group of Frank Fristrada and different coworkers. And what they've been doing is they've been kind of rethinking a lot of what we know about or not rethinking but kind of recasting a lot of what we know about 2D topologically ordered systems in terms of PEPs tensor networks, these 2D tensor networks. And so they've come to these understandings of familiar concepts but kind of using unusual tools in terms of thinking of Wilson loops as being inside of the PEPs. And in that form they're much cleaner and they take the form of MPOs which are called matrix product operators. They're basically like matrix product states but they have two lines coming out of every tensor instead of just one. And it gives a really easy understanding of what is a topological spin of an enion and how could you think about calculating it. And these different conditions that sort of define different types of topological order. So just to say that that kind of understanding comes from the linear structure of a tensor network. And finally tensor networks are widely applicable. So this is an example of some method that's actually been known for quite a while now since 2007 called TRG. And it actually has nothing to do with quantum mechanics per se. It's just a method for classical systems. So the idea is that you write your partition function of a classical model as a tensor network but one that's difficult to contract. So the idea is that if you look at this tensor network so every one of these blue dots is some tensor and it's basically like a transfer matrix trick for a 1D system. But these are like 2D transfer tensors and you can think of the spins as being the indices. So if the index is one it means the spin is up if it's two it means it's down. And then all the tensors do in between is they kind of mediate between the spins and apply the correct Boltzmann weight. Then what you do is you basically explode this network out into all these little shards by doing singular value decomposition. Then you regroup the shards into larger tensors and you do this over and over and over again. And you can see this loops. Like you get a new network of these larger blue tensors. You put that in here and you just do this again and again and again. And you get very, very precise results. You can compare the predictions of conformal field theory and you can get critical exponents to like many digits very accurately. You can see the RG fixed point emerge just by looking at the elements of the tensor. You can do a lot and it's all just for classical systems. So knowing all this you would say, okay, basically quantum wave functions are just an example of a large tensor and so are transfer matrices of a classical system. They're all just large tensors that just happen to occur in physics. But really tensor networks are something kind of bigger than just physics. They're really just a general math technique that just happened to be invented by a bunch of physicists. So is there really any more applications for it than just wave functions or some classical systems? Could that only be the two applications of tensor networks forever or are there more applications? So of course there are more. And some other background, this is sort of switching to a slightly different motivation is to say that machine learning has actually been benefiting from physics ideas for a long time, kind of forever. So Bolson machines are nothing but the sort of rising models just applied to data. And there's ideas like renormalization group that come up all the time in machine learning. So things like deep networks or auto encoders, ideas about what's called PCA, which is nothing but the SVD, but you can do it hierarchically, is very similar to the renormalization group. But this is 1920s and 1970s physics. Have we done nothing new since then? There's actually been a lot since then that we could also try to see if there's some analogy to machine learning for some of the stuff since the 70s. And one of those things is tensor networks. So even more specifically, I was looking at my friend Juan Carrasquilla's notes when he was working on this stuff he was doing using convolutional neural networks to recognize phases of matter. And I kept asking him, I said, is that a mirror tensor network? And he goes, no, it's not. It's a convolutional neural network. And I said, yeah, but look, those are like these kind of mixing layers. They remind me of disentangulars and these are these pooling layers. They remind me of the isometries, the RG part of a mirror. And he's like, no, no, no, it's different. But I wanted to use a mirror somewhere in machine learning. So I thought, okay, I've got to use a tensor network in machine learning. I'm just gonna skip this slide, too much introduction. So then it made me wanna know, are tensor networks useful for machine learning? And the answer is yes. And if you use tensor networks in machine learning, you realize a lot of benefits already. And then there's some that I'm still working on, but I'm confident we'll come. So one benefit is that you take models that typically to train them would scale like the cube of the training set or the square of the training set. And you can do that in just linear scaling with the training set and maybe even better. Another thing is that you can have adaptive weights. So you can train the model and as you're training it, the number of free parameters in the model can automatically grow and shrink as needed to adjust to the data. So I think that's very interesting. You can also learn data features. I'll have a lot more to say about this. What I mean here is that you can take the data and pre-process it using tensor networks in a way that only extracts the useful information for learning and discards other information that's just not helpful. And then future benefits, some of these are already being realized or some of these are more in the work. Some of these are more speculative, include better theory and interpretability. So really trying to understand what has a model learned? What is a model sensitive to? Like what inputs does it actually care about? What inputs could fool it? What inputs does it not even see at all? Are things that I think can be tackled very handily with these tensor network tools. Better algorithms already have some examples of that more on the way. And then quantum computing, which I'll say a bit about. It's basically that tensor networks just are quantum circuits already. So if you can get a tensor network to do machine learning, you already have a quantum circuit in hand that can do machine learning. So in a way, it's almost the enemy of quantum machine learning because the more that I can do classically with tensor networks, that's one fewer thing we need a quantum computer to actually do. But then actually in the end, I think these things are friends because you can push a tensor network up to a certain point and you're getting certain performance but then you might run out of resources. So then you can switch over to a quantum computer and in that point of view, a quantum computer is kind of like a special GPU or something whose job is to contract big tensor networks basically in that point of view. Yeah. Does it have any non-linearity? So no tensor networks have any non-linearity. They're all linear. That's right. So I was waiting for someone to ask this question. So it's a good question. So this may come up again later, better. But let me briefly say that is where the power of neural networks comes from is non-linearity but the question is compared to what? So the question is compared to a neural network that doesn't have any non-linear things in it. That would just be a matrix. So if you had a neural network and you took out the neurons, it would just be a matrix of the size of your original size of your data. So if your data is 100 dimensional, it'd be like a 100 by 10 matrix. So it's not very powerful. However, when we apply a tensor network, it'll be also kind of be like a big rectangular matrix but it'll be acting in an exponentially larger space. So that's the difference. So the idea is that you first you map the data up into an exponentially bigger space than the one the neural network acts in but then you can actually work in this space by using tensor networks and then you get a lot of power that way. So you load all the power up front and then you deal with it that way. Whereas in a neural network, you're kind of doing the high dimensionality by putting non-linearities all throughout your model. In the tensor network, you do it all at the beginning. That's the difference. Yeah, it's a good question. Okay, so let me just say though, first of all that I'm not the first person to think about using tensor networks somewhere in machine learning. So there's a bunch of interesting work that's already been done kind of going back to around 2014 by some different applied math groups and also some people who used to work in physics as well. So these are some students of Geoffrey Vidal's actually. So some of this work includes things like compressing weights of neural nets which I'll say a bit more about on the next slide but also things like large scale PCA. PCA again is basically just taking your data and putting it in a big matrix and trying to diagonalize that matrix. But what if your data matrix is extremely big? You can use tensor networks to help you with that. So there's a nice paper about doing that. Also putting tensor networks into other kinds of machine learning models like called Gaussian processes and using them for some kind of feature extraction tasks. So just mentioning there's this other work. So that'll be in the slides when you can, which you can download. Let me highlight one of these works just to show you something pretty different from what I'll be talking about but just to give you a different flavor of the kind of things you can do with tensor networks in machine learning. So this is just a couple of slides on this idea. So the idea is let's say you do want to do neural networks but you wanna do really big and expressive neural networks. So you wanna do a really wide model meaning that the middle layer has a ton of neurons. Let's say you just wanna do thousands of neurons for some reason. You couldn't afford to do that normally because you would have a really giant weight layer. So this group of Novikov at all had this idea of why don't we approximate this weight layer by a kind of tensor network? So you can kind of think of this as a matrix product state but where every tensor has a leg going up and down instead of just one going down or up. So the idea, this is called a matrix product operator. It is very similar to a matrix product state. And the idea is you just say, what if the weight layer could be approximated by that? Then you say, well, let's just assume that, put it in and then train the weight layer in this form rather than in this kind of original form. And with this technology, this group was able to train a neural network that had something like close to 300,000 neurons in the middle. And they were able to compress this with some kind of 80 times compression and only get within 1% of the very best results for that same data set. So they lost a little bit of performance but got a huge compression. So anyway, that's interesting. But I looked at this work and I was thinking, I would prefer a framework where the tensor network like is the entire model. So rather than just being a piece of the model or some trick to compress a piece of some other model, I think the tensor network could do the job. I think it could just be the entire model. Here's some kind of maybe sillier or more serious motivations about why. So one motivation is that could a natural image really even possibly be more complex than a quantum antibody wave function? I mean, a silly thing to say is, well, images are actually coming from quantum mechanics after all. So how could they be more complicated? I mean, that's more of a light motivation I would say. But basically, these are some of the most complicated things we know about and tensor networks can handle them. More seriously, we can import really interesting ideas from physics and maybe get some mileage in a different field by bringing in these interesting ideas. And best of all, we could export those ideas and maybe a few years from now, those ideas will come back to us in a different form and we can learn from the machine learning community who's very good at applied math and optimization how to use these tools better in physics. And that could be very interesting. Okay, so now to be a bit more concrete about how I wanna do this. Like how do you actually make a framework that does machine learning but where a tensor network is at center stage? So let's be concrete about it. Let's say we have some kind of inputs X, which are these vectors of numbers. And these could be some numbers that are normalized to go from zero to one. And they could represent the pixels of some kind of, you say, grayscale images like this handwriting data set. Just a easy thing to think about. Okay. And then let's propose the following model. So I will credit here another paper by Novikov that had the same idea the same week actually as us. But I liked how they explained this idea better because the model is a little cleaner the way they wrote it down. So I like this way of writing it down. So they said, okay, formally, this is gonna be one of these kernel learning models but let's put that on hold for a second. It's easier to see it in this form. So all the model is, it's rather different from a neural network. It's actually just a high dimensional polynomial. So the model is a polynomial where we're gonna label all the terms by binary strings. So we could have all these binary, all these numbers S1, S2, S3 to SN. If they're all zero, then we don't get any Xs at all. We just get a constant. But now if S1 equals one, we get X1 and none of the other Xs. Or we can have S2 equals one, S3 equals one, and we just get each of the Xs separately. Or we can get pairs of Xs. We could have S1 being one and S2 being one, S1 being one, S3 being one. So we get pairs, triples, so on. So it's all the different products of the Xs that we can make but the Xs only appear most once. That's the idea. And then the whole idea of doing this is that the coefficients have to have the same labels as the terms. I mean, that has to be the case. But then they accidentally end up looking like a many body wave function. They're not a many body wave function, but they have the same structure. So they're just a large tensor with a huge number of indices that go over a small range of values. So those kind of look like spin and half spins and there's n of them. That's the idea. Okay. So it should be pretty clear already that this is a fairly expressive model. It has very high order terms in this polynomial so it could fit a lot of things. And it's not too hard to extend it to have things like the squares or the cubes of the individual Xs as well. So this was an idea that was touched on by a few different people. We came up with something very similar. And here's the entire model written out for the case of a three-dimensional input. So now I'm being a little more formal in saying, okay, this model is the form W dot phi of X. The phi is basically defined by this part and then W is just the weights. So if you write this out, it's interesting to see what you get. So the first line is the linear classifier. So you get the linear classifier for free and then the rest of it is just stuff that can do better. So you're guaranteed not to do any worse than the linear classifier. As we saw in this example of these hand-rendigious that already gives you up to like 98% performance or something. Then you just keep going by adding these other terms and these other terms just help you to get all the way up to 99.9% or something. And there'd be a lot more of them for the case of higher-dimensional input. It goes exponentially with the dimension of the input. So we're gonna need some kind of tools to work with such a huge amount of weights. Okay, and we can generalize this in a lot of ways. And one way that was closer to the paper I wrote with David Schwab was to say, we don't actually want powers of Xs per se. We just want a set of nonlinear functions applied to the Xs individually. So we take the Xs, the components, and we sort of point-wise evaluate them into these nonlinearities. This is kind of like neurons. It's a bit different though too. But then we have exponentially many different combinations of these functions we can take, all the different combinations we can take, all these different products. And then we add them all up with some unknown coefficients and that's our model. And the goal is to find and train these coefficients and we could represent a lot of really complicated functions this way. Because I haven't even said on this slide what the fives are. So they could be very complicated functions. The particular fives that we picked is kind of motivated by this picturesque idea of thinking of pixels as spins. It was just a motivation. We were just thinking like, okay maybe if you wanted to apply ideas from physics it's helpful to use this mental crutch of thinking of like a white pixels and up spin and a black pixels like a down spin. Here I'm writing it as a Y vector and an X vector. And so maybe a gray is like something in between. It's like a superposition of up and down if you want. Or you could just say formally it's just mapping one-dimensional number to a two-dimensional normalized vector. That's all it is, right? But this actually leads to a very powerful model. And then the idea is to say that the total feature map, the thing that maps this data from this original dimensional space in dimensional space into this feature space, which will actually be exponentially bigger two to the n dimensional, is a tensor product of all these functions. And all I mean is just a product basically. But it's a tensor product if you think of this as defining a tensor within indices. So this is nothing but a product state wave function. If you think of it as a wave function it's just saying take the data and represent it by product states. And it's just formally a vector in a two to the n dimensional space but it's a very easy one to represent. Of course you can easily store and represent this on a computer even though you can't represent most things in that space sufficiently in a naive way. So let me write this a few different ways but we all know what product states are but basically it's just saying, okay take the data, formally write it as this tensor product with these two component vectors. That's how that looks in tensor diagram notation. It's just the outer product of a bunch of two component vectors. Again, what was the point of doing that? The point is that let's say we had the simplest case where the function outputs a number. That'd be the case that you could say do two class, binary classification A or B. In that case then we want the output to be a scalar so we better have the weights have the same number of indices as the representation of the data. So there are the weights dotting in with the data to make a number, that's our function. And so then the upshot is that the weights again look like a mini body wave function, formally speaking. There's just a big tensor with lots of indices. Okay, so then we can use tensor network tools. So it's something that's set up to be able to use tensor networks well on these. Now there's a lot, you know, writing on whether this will actually work well or not. Some of you who are aware of these ideas of the area law and this idea, which I didn't get to touch on that much, but I think that'll be talked about a lot more in Uli's lectures later. I think there was some earlier in the school. Might wonder, you know, is this tensor have anything to do with the wave function? Maybe this thing is highly entangled from the point, if you think of it as a wave function, maybe it's not at all like a ground state. So maybe it's not at all efficient to represent it by a tensor network. And that might be the case. But instead of trying to prove that or disprove that, the idea we had was let's just try it and to see if it works. And it did, and now I have some arguments I can show you too about why we think it worked. But it may not always work. It may depend a lot on the task and on the data. But there's been some encouraging results so far and I'll show you some more, that's why other groups confirming that. The idea is that the main approximation or the main further assumption we'll make about the model is that it's reasonable to compress this tensor by some kind of tensor network. Most often a matrix product state, but it could also in the future be a PEPS tensor network or a tree tensor network or a mirror tensor network or some of these other kinds. Okay, and what do you gain from this? So right away you actually gain some interesting things. So one thing you gain is that you can import ideas from physics into this area. So you can take an algorithm very similar to DMRG and optimize this model that way. And what do you get from doing that? You get two things. You get linear scaling in everything so you get linear scaling in the dimension of your input and you get linear scaling in your training set size, which is already much better than you typically get for models of this class, so this kernel learning class where the weights enter linearly. And the idea is that here you can directly work with the weights rather than all these tricks involving kernels and all these things which I didn't explain in great detail, but those tricks don't scale very well. So the idea is that you get linear scaling because what you do is you just attach each element of your training set to your weights and you calculate a gradient and I'll show you how the gradient works on the next slide. Let me just show you kind of how the algorithm looks at a very high level. So what you do is you say, okay I wanna optimize this matrix product state of the weights and the way I'll do that is I'll merge two of the tensors together. So this is called two-site DMRG if you merge two of the tensors together. So you merge two of them together and the idea is that this will let you optimize in a slightly enlarged space and hopefully find your way to a better solution and you get to adapt to the bond and I'll show you that in a second. So now what you do is you freeze these other tensors, so you freeze those and then you just calculate updates on this one for a while. So you just calculate a gradient and apply it or some kind of other optimization and then you improve this tensor for a while then when you're happy with it, you say okay that tensor's done then you use a singular value decomposition to break it into two pieces and when you do you expose this bond again, you discover that bond again and that bond can now be bigger than it was before or smaller. So it'll adapt automatically to solve the problem better and to give you more weights or to pull weights out as needed. Yeah. It is like a hyperparameter, that's right. The difference so here is it gets picked by the algorithm. So instead of you and the human picking it, the algorithm picks it. Now you control it through another hyperparameter. For those of you who don't know that term by the way and machine learning people use this term hyperparameter to kind of just mean like external knob on the algorithm basically. So you instead pick some kind of cutoff you would say what's the threshold of the size of singular values that I consider large versus small and that will automatically determine the bond dimension but does that answer your question? Taking the bond dimension to be big you said? Or it actually does. I mean if you take the bond dimension sufficiently big you can represent any tensor. So then you can represent any model in this class. So it doesn't mean you have the right class of model but that's a separate, that's a bit of a separate story about that part about whether you pick the correct feature map. However I think also tensor networks later down later in the line has something to say about that as well. So I'll show you later if there's time may not be enough time but I'll show you later that you can actually do an interesting thing where you can start with kind of an overly expressive feature map and you can actually use tensor networks to automatically prune it down. So in the end you can actually kind of overpower your model and then have this adaptivity of the tensor network on its own figure out how to prune that back. So in the end you've essentially included every possible model and let the data steer you toward the subspace of models that is demanded by the data to solve the problem. I mean this is something, this is sort of ongoing work but I think these things can be tackled. But let's say within, but the thing I can definitely say is within a fixed model class, a fixed choice of how you map the data product states then there always exists some large enough bond dimension that this MPS can represent any weight tensor whatsoever. That's for sure true. But then whether that's enough to do the job that you actually want to do is a sort of a, it's a sort of separate but definitely important question that you're raising but you have to consider that as a separate thing. It's sort of up to the model choice, yes. Yeah, that's a good question. So one thing you can do is you can always imply the kind of classic things that people already do in machine learning which are penalty terms and your cost function. So that's like the main thing people do is they would say, besides also just model selection, like how do you do this map? So you can put in a penalty term, I didn't even say what cost function I'm using. So what I'll use most of the time is just this kind of squared error cost function plus an L2 penalty like just a norm penalty that says penalize W from having too big of a norm, just do that. That works but there's interesting other things too. So one thing that ties back into kernel learning is what feature map do you pick? That's a kind of regularization. If you pick this to be overly expressive you might overfit. If it's not that expressive you're protected from overfitting but you also might not do that well. And then one other thing that's yet to be explored that much but I think is very interesting is the role of the tensor network itself in regularization. That's not something that has been considered before because these models haven't been looked at before. But the fact that there's a bond dimension here, the fact that the tensor's factorized certainly has implications for regularization but what those are is sort of still to be worked out. I mean, because we know it's gotta regularize in the sense that if this bond dimension is small there's definitely less parameters. So that's already a kind of regularization but what the regularization is doing and what statistical properties it has is something that needs to be thought about. Did I answer your question? Okay, so there's all these ideas of regularization. I didn't talk about this too much but the idea is that thing I said briefly on the one slide about of all the functions that could fit your thing pick the simplest one. That's analogous thing here is that we control the norm of W as we train it. That's, I'm not saying that on the slide but that's implicit. And then, but the more explicit thing is that we don't let this bond dimension grow too much. We keep it as small as we can keep it to solve the problem at hand. If we let it grow and grow and grow and grow we could just fit anything and we might overfit. Okay, so the algorithm continues by you merge the next two tensors, optimize them together, freezing the other ones, break them apart and so on and you go down the line. So that's equivalent to DMRG except for what you do to that pair of tensors when you optimize it. So in DMRG you solve an eigenvalue problem. Here you solve whatever problem is appropriate to what you're doing. So if it's supervised learning you do some kind of gradient or conjugate gradient update instead of an eigenvalue problem. Otherwise it's basically DMRG. Okay, so to show you a bit more about that so what about this sub problem? How do you actually do the gradient? Just one quick slide on that. So this is just to give you the flavor. So the idea is that you say let's say we wanna update that tensor where we merge two of them together. We've contracted over their shared bond and we're gonna freeze the other ones. So the other ones are just bystanders for this part of the algorithm. So you affix one of the training data product states. Well you do this for every single one but this is just an example for a particular one. And then you take the derivative with respect to b and because everything here is linear the derivative is very easy is just you remove b. So that's the derivative of that function with x plugged in with respect to b. And then you can contract these wings like this part with the frozen parts of the MPS tensor. You can just do these contractions like contract over that line, that line and that line and you get a tensor with one index that's that index. And you do the same on the right and you get a tensor with one index. And these two are just the raw data tensors of those two sites or those two pixels or whatever they are. And then that's the gradient. And then you just add that in with a small empirical weight, a small empirical training rate alpha. And you do that for every different element of your training set and that's what you do. So you just do that at every bond and if you do conjugate gradient there's some more steps but basically this is the idea. And then you just do that to keep improving this tensor b, that should say b, to keep improving it and get a new tensor b prime that has better performance. Then when you're done on that bond you break it apart and go onto the next bond and just keep doing that back and forth. So that's the idea. Question about that? I know it's kind of fast but just to give you an idea of how these steps work. So now why should this idea work at all? So one thing you could say is you know, okay neural networks work really well but they have this layered structure and they're very complicated. Tensor networks, you know, they're good at representing ground states but maybe the tensors that occur in these machine learning models that are good for classifying data are nothing like ground states and maybe they should be very highly entangled. So why should this work? So this is one slide just to argue that at least these things have a shot of working and the reason is very simple. It's that they can always represent the linear classifier exactly with bond dimension two. So any MPS that has a bond dimension two meaning the matrix dimension is two can always represent the linear classifier model which as I showed you in that little quick demo I got 98% on this handwriting example with the linear classifier. So you're always kind of guaranteed this kind of floor of performance that is at linear classifier level and then by growing the bond dimension bigger than two you can go much beyond that. That's the idea. So here I'm just showing an exact construction that's telling you how you can actually encode a linear classifier into an MPS. I'm not gonna unpack this construction that much other than to say if you remember from the end of the first lecture yesterday this looks a lot like a single particle wave function where I had phi one, one, z and then I had the vacuum then I had phi two one in the vacuum and the vacuum phi three one. It's actually the exact same construction. So the idea is that you can imagine a little one particle moving through the system and it goes over every pixel and it doesn't look, it doesn't look, it doesn't look. Finally it gets to the pixel it wants to look at and it looks at that pixel and applies a weight of v whatever like v seven on pixel number seven. So you can think of the wave function as actually being in a single particle space and so basically that's a single particle wave function is a linear classifier. So then if you have more particles it's like having a superposition of multiple linear classifiers that can run around and take products of the things that they see. That's the idea. And that's just kind of a motivational slide to say that like you're always guaranteed to at least do as well as linear classifier and more. So yeah. Yeah we've looked at that a little bit. It's interesting to think about. So yes the feature maps I showed were all kind of spin a half in the sense that you just get to show, you just get to see two pieces of information about every pixel say if it's an image or whatever that x component is. So you could certainly expand this. You could have instead of one x, so this is going back to that one where it was x to the power of zero or x to the power of one form of the model. Instead of having one x and that's it. On every side you could have one x, x squared, x cubed. You could just keep going. Or you could have cosine, sine, some other function, some other function. You could have cosines and sines with different k's inside and have more complicated Fourier components or something like that. So that's something worth looking into. You get into the danger of overfitting though. So then you have to work harder to regularize the model by penalizing the norm more heavily or other strategies. So that's the trade off. Okay. One other thing you can do is you can extend this idea for multiple outputs. So neural networks have multiple outputs. At the end you can output multiple labels. Same thing here. You can basically stick an external index on the tensor network. It's very natural in the case of tensor networks that have like a tree structure, like a mirror, because you have like a top and the top can have an index sticking out of it. MPS is a little more awkward looking, but basically you can put an index somewhere in the network. And the idea is the data comes in the bottom, the MPS evaluates it and then the result is a vector over this index L and you look at the outputs and the biggest one, for example, could mean that's the correct label, depending on how you train things. What's interesting is that you have this idea here of feature sharing, which is that all the different models for all the different labels, if this was a giant tensor, if this was just a bunch of weights, they would be totally different weights for every different label you wanna recognize. But once you factorize the weights, you can take these tensors and these tensors to be the same for all the different labels you wanna evaluate. And only this tensor differs for the different labels. So the idea is that you're basically processing most of the data into some kind of common, like reduced features that could fed into this middle tensor and then those reduced features are what lead to the classification. Okay, so does it actually work? Yeah, so here's an example showing experiment we did back in 2016 on this handwriting classification task. So we took these 60,000 training images, unrolled them into 1D just to use a matrix product state. And some of you, your hackles may go up and say, why would you take this 2D data of images and use a matrix product state? The idea was that we wanna just do the simplest thing because the idea is if the simplest thing works, then how much better might the more complicated ideas work? So we wanted to start simple. So we just said, let's just rasterize it, kinda snake through it, unroll it into a 1D chain, and then make one of these matrix product state models. So we represent all the pixels as a product state, attach it to this MPS, train the MPS, and we get 99.9% accuracy on the training data and that drops only to 99% on the test data. So that's only 97 wrong out of 10,000. So we were very happy with that. And we got into this conference, this so-called NIPPS conference, which is like the big machine learning conference, so that was a lot of fun. Right, so yeah, so what I'm excited about in this topic, it's not really about the performance. It's more the ideas that I'll mention in a minute about it, especially this data set, this handwriting data set, it's pretty saturated. A lot of things can now get 99%. And it's getting to where it's not even meaningful to go past 99% in some sense, or whatever, number 99.5, because at some point it starts turning into what's called snooping, where even if you're not snooping by actually training your model on the test data, you're kind of publication snooping in the sense of you'll try a bunch of ideas until one of them works, and only publish the one that works. So sort of like gradient descent on papers after a while. So after a while, it stops being a meaningful test. It doesn't tell you much about what models are really good. It's more you have the survivorship bias if you're only publishing the ones that, so it starts to get into that game. So for us, it was meaningful because we just wanted to know, can this work at all? Could we do something as crazy as take a 1D tensor network and scan over a 2D thing and see reasonable results? So if we had not been able to do similar to all the other state-of-the-art things, we would have been, it wouldn't have been as interesting that this might be worth pursuing. But the more interesting thing about it isn't to beat anything else. It's more to roughly tie other approaches, but then have a lot more to say about algorithms and interpretability and this kind of thing. That's the more interesting side, I think. So since then, I've been encouraged to see that there's been some other interest in this idea. Some people who are already thinking about it before us, some people who got interested in it afterward. And so there's been some nice papers coming out of purely machine learning groups. So these three papers are from this one group in Hebrew University that's just thinking about machine learning, they're just in the CS department. And so they have this one paper that's kind of interesting, Deep Learning and Quantum Entanglement, Fundamental Connections. So the ideas they're trying to say that maybe ideas that physicists have been talking about can be broadened and used to understand other things. So physicists have been talking about entanglement and so maybe this can be generalized into these ideas of tensor rank. And you can think more about some kind of higher dimensional notions of rank. Instead of just thinking of matrix rank, you can think about how we think about all these bonds of a tensor network and we've learned a lot in physics from thinking that way. So they're trying to say, can we export these ideas to machine learning in some useful way? Also the ideas have been extended to say we could do generative modeling, which I'll touch on a bit. This means not just trying to recognize data, but generate new data that's similar to the training data. So this was done very successfully by some groups and further work on supervised learning, which is assigning labels with different kinds of architectures. So let me talk about some of those works in a few slides. First, let me also mention that there's even some companies that are now starting to use this technology. So one company is actually more, they're more interested in drug discovery, but their co-founder came out of a physics group at University College London, Bid Strokovich, and he actually was in a tensor network group before that. So he's using tensor networks within their company and you see the name of it has tensor in it. I'm not sure exactly if they're totally wedded to using it, but that's that company. And then another company that's very interesting is right up the street from us in New York City is called Tunnel, and this is this ambitious mathematician named John Turilla who is actually on leave from Cooney right now to start this startup, where they're training matrix product states to recognize language. So they're trying to train matrix product models to do natural language processing and they're having some successes with that. So it's interesting to talk to them. Okay, so let me kind of highlight some of those works I mentioned here, just to show you some of the different directions you can take these ideas. So these might inspire some of you to do some interesting things. So one of these is, I don't have an updated reference, but this paper is now in physical review X actually. So this is a paper that's using wave function-like ideas to do generative modeling. And then the wave function parameterization they picked is tensor networks. You could do this same ideas with other ansatzis or wave functions too, but the idea was that they said, could we do state-of-the-art generative modeling using what are called born machines? And this is basically models where you have some kind of model but you square it to get the probability. And that has a lot of benefits actually. It sounds like a simple thing, but that's not something that wouldn't be obvious to do for any math reason. It seems like a weird kind of quantum physics motivation. But once you actually do it, you realize all these benefits. Some of them are very simple. One of them is that you don't have to worry about whether your model has minus signs or complex numbers in it. The square cleans that up for you. And that sounds kind of silly, like a silly problem, but that's actually dogging a lot of fields of statistics in machine learning that they have to tread very carefully with their optimization algorithms never to introduce minus signs. And there's all these steps that would be the optimal step to take next, but they have to go around that step because that step would introduce minus signs. So they have to do a suboptimal step instead. But this group is saying, why don't we just square the model? Then we can do all these optimal algorithms directly in no problem. And one of the things they also do is they take advantage of a certain property that matrix product states have, which is that when you think of them as probability distributions, you can do what's called perfect sampling of them, meaning that you can take one sample from them and then the next sample will be totally uncorrelated with the previous one, except through the correlations of the probability distribution itself. But there won't be any auto correlation. It's not that kind of thing. It's not a Markov chain at all. So you just draw independent samples as much as you want from the model. And that's very important in machine learning because very often they encounter multimodal probability distributions where you have one peak over here and then another peak way over here far away. So this gets around that directly. So the idea is that once they train their model to generate digits, they could just sample from it and the first sample would be a nine, then the next sample would be a three, then the next two samples would be a nine, then the next one would be a seven. But if you take other things like these Bolson machines and try to do this kind of block Gibbs sampling, you'll get a nine, a nine, a nine, a nine, a nine. You'll get like 10 nines in a row. Then you might get a two for a while. Then you might switch to a one for a while. Here you just jump all around to every digit totally randomly. So it's actually very powerful for tasks people wanna do. Okay, so another paper that had some aspects I thought were interesting was paper by this group that studied supervised learning but using these kind of hierarchical tensor networks. So these are like tree tensor networks that they studied. And what I thought was the neatest thing in the paper was that they looked at what's the data look like in the original form when you just represent it by product states and do some kind of distance measure. And they did some kind of low dimensional embedding of the distances. This is at like layer zero. Then you go look at layer one. Now what does the data look like? Two, three, four, five, six. And you see at the top, what the model has learned to do is to disentangle the data. So the model has kind of pulled the data apart and gotten it into two clusters so that the very top is ready to be labeled now. Those are A, those are B. So it gives some insight into what the model is doing. And I like this and I would like to see more work along these lines. I think there's a lot more one could do to really understand precisely what each of these sensors are doing. And I think more so than say like a neural network because these are all linear things. So because they're linear, there's a lot of analysis that one can do. In fact, it's even better than that. They're not just linear, they're actually isometric maps meaning that they're unitary in one direction. So if you call one of these things U then they have the property that U dagger U is the identity. So they can all be interpreted as nothing but rotations and projections, rotations and projections. So you can ask questions like, if I pick that tensor, what subspace does it project onto and what subspace does it project out? And then you could know that there's certain inputs that the model at the top simply cannot see whatsoever. And there's other inputs that is very sensitive to and you could study this all very carefully. So I think that's where this could be pushed a lot more. This is another interesting paper by this machine learning group in Hebrew University. They studied tree tensor networks as well and they were trying to argue that different choices of what they call the ranks which we would call the bond dimensions of these tensor networks give different so-called inductive biases. And that means that they're better and we're suited for different kinds of data. So they did these two tasks, they called a global task and local task where they had digits that were randomly moved around in the plane. But in one case the digits were very large and took up a lot of the field division. In another case they're very small and they found that in the one case a model where the bond dimensions started out small and then got bigger worked better. In the other case another model where the bond dimensions were big and then got smaller worked better. So basically the idea is that there's different just by choosing different bond dimensions the models can be better and we're suited to different tasks. And what they didn't mention in the paper what's nice to know is that that can all be done automatically. So instead of having to pick the bond dimensions to be larger at the bottom or smaller at the top and vice versa you can just do DMRG on a tree straightforwardly and that'll figure out those bond dimensions for you. So that's great because it means tied with this paper it means that DMRG is figuring out the correct architecture for the task. So that's what you could conclude or at least suspect could happen. So that should be tried. Now something that was a bit different but I quite liked was this paper just from this year using a generalization of matrix product states these things called matrix product operators. I mentioned them now a few times it's basically where you have two legs on every tensor going up and down not just one leg and using this for what's called sequence to sequence modeling and that's where you say input one sequence like some kind of time sequence or word sequence or something and then you want the output to be another sequence. So the input could be say a question and the output could be an answer. So this could be like talking to Siri or the input could be historical stock price data and the output could be future prices or something like that. So what they did is they said let's assume our model is an MPO tensor network so it's like a double-legged MPS with legs going up and down and then let's train it by attaching inputs to the bottom and outputs to the top and these are both of the form of product states and then we wanna maximize this overlap we wanna have it when we put the correct output on top for the correct for some input on the bottom we want that to be big. Then when you use the model you put new inputs on the bottom and then the output is some kind of it looks like an MPS it's like an entangled output and then what this group did is they just thinned it back down to a product state and said that's the output and they got really good results they were able to beat state-of-the-art recurrent neural network approaches actually even when they partnered with some people who are practitioners in that field to help them try to beat it. What I think they should have done though which they didn't do but I mentioned to them is instead of thinning down this output MPS back to a product state so they could interpret it as one of their training outputs I think they should have sampled from it so I think you should have think of the output as like perhaps a probability distribution and you can think of like multiple outputs from an input with different probabilities so I think that'd be really interesting you ask something a question then you get different answers depending on which samples you take so maybe you'll get slightly different wordings of an answer or something like that so it should be investigated more. Here's the thing I may have time to talk about in more detail at the end we'll see how the time is going but this is a paper I wrote in January called learning relevant features of data with hierarchical tensor networks and the idea here is like can we do other things rather different from supervised learning with tensor networks so instead of just saying okay here's a model with some weights and we just kind of train the weights can we use other ideas from physics and kind of import them whole cloth into machine learning that are interesting ideas and one of these ideas is real space RG kind of principled reversible real space RG with tensor networks so the idea is that I would take some data as a product state just in the form like in the earlier slides with the supervised learning but instead of doing supervised learning on it what I do is I take the data and I bundle it up into like something like a mini body density matrix so that's this thing here and if I have time I'll talk more about it but you can actually justify this entirely outside of physics because this thing is actually related to it's actually this thing you could call the feature space covariance matrix and it's related to this technique called PCA so you could justify this construction totally independent of any physics motivation but from the point of view of physics it looks like a mini body density matrix of a mixed state that's a sum of a bunch of product states like outer products or product states but again you can totally justify this mathematically as well then what you do is you basically do a real space RG that course grains this density matrix in a way that generates all these tree tensor network layers so a tree tensor network is basically this construction where you have three index tensors two that attach to the physical sites and then one that goes up at the first level then at the next level you think of these outgoing indices as new physical sites and you put another layer of tensors on top and you keep going so you can basically create this tree tensor network and you can think of that as taking the raw data that's mapped into feature space then coarse graining it a certain number of steps then having this reduced representation of the data up here then you can do whatever you want with that reduced representation and what I did mostly is that I just put an MPS on the top and trained that MPS to do supervised learning but everything in the middle was trained without any knowledge of the label so it was automatically just studying the data and basically projecting out directions in feature space that were not useful to understand the data and only keeping directions that were useful that's why I said relevant features and then using those relevant features to do supervised learning and I got very very good performance on a much more challenging task than that handwriting task it's this other task called fashion gymnast where you actually have to recognize images of clothing and most of the state of the art methods get around 89% and I got 89% also so it's very comparable to things like convolutional neural networks and decision trees and other methods so that was encouraging and there's some other little tricks that I put in as well okay one other paper kind of building off this result and fashion in this so this is kind of typical good performance is 89% so if you look at most of the methods that are all in that pack but there's one or two outliers that do better and now one of those outliers is another tensor network actually so this is a very recent paper by the group of Ignacio Sirac and Yvonne Glasser was the lead author here and they studied generalizations of tensor networks that are known as string bond states and entangled plaquette states and some of these by the way get very close to Bolson machines some of these are very similar to Bolson machines but these are the way they were thinking about them is they're more like groups of tensor networks that can be kind of folded together and by using stochastic optimization on them they were able to tackle these very challenging to train tensor networks so these are really starting to approach training things like peps even and they got really state of the art performance on this challenging fashion and recognition task so the only other model I know of that beats theirs is this one particular deep neural network that has something like a hundred layers and it's this thing called a Google net and that thing gets like 93% and theirs gets 92.3% it's very very close another thing that I liked in this paper was that they actually did some kind of learning of the local features from the data so they said let's let's let the data tell us how to map it into a product state based on what gives the best performance and interestingly what they found is very very close to the thing that me and David Schwab picked in this paper from 2016 so that's the thing that we picked that's what the data one so it's very similar so somehow we had already guessed a good choice and the data wants that choice anyway so that was I just thought that was kind of fun okay so that was just some different works people have done any questions about those before I switch gears to quantum computing? Yep so I actually just let the norm float so it is an MPS but it's not a way function per se so it doesn't have to be normalized for the case of supervised learning if you're doing generative modeling you might want to normalize it because that's like setting the total probability to one but if it's just supervised learning it's just empirical weights that are there to kind of just dot with your data and give you some output so it could be different from one but what we did in practice is that you don't want the norm to float too much because what it likes to do is just blow up it'll just basically grow and grow and grow and grow the more steps you do to the huge numbers and that's not actually doing anything for the performance it's just some artifact of the training so what we do is we put a weight penalty in the cost function so if the norm gets too big it actually starts to increase the cost function so that goes into the gradient as well yeah that's the idea oh yeah in that example that's right so in that example you can just you can just always at every step either always work within the norm one manifold so you can always make your gradient to be orthogonal to your whatever tensor that you're working with and then it won't change the norm at least to leading order it'll change it a little bit but then you can just correct it by hand there's other things you could do too you could just let the norm drift and then normalize it at the end if you want that might also work yeah so it's not the easiest thing to handle it is a little bit of a headache but there are good ways to handle it so it's a good question yeah okay yes that's a really good question thanks so yes I would say that question has some important aspects actually in the sense of that although I'm making these analogies to physics a little bit or we want to make these analogies because we're used to tensor networks being like wave functions or other things one key thing to keep in mind is that some of the machine learning stuff is much easier than physics and that's good so that means that we could actually things that were things that have been only working okay in physics might be super powerful in machine learning so like peps for example so peps tensor networks are a bit challenging to do for physics they have these issues but part of the problem is because you're trying to optimize over this Hamiltonian which has to go in between two copies of your peps but in the supervised learning setting all you want to do is attach a product state onto the peps and that erases the physical indices I should draw this so you have like a peps which I'll just draw it as like a grid I'm not going to draw the circles or anything so everywhere the line crosses is understood that there's a tensor and then there has these physical indices coming up but then that's just a weight layer and then what I do is I attach a product state to it this may not be exactly your question but I just wanted to mention this so now I attach a product state onto it and when I do I just now have a single layer tensor network that's it so I just get some kind of thing that's like this and this just looks like one of those classical partition functions networks that I mentioned briefly at the beginning it's just a number that I have to calculate and we have all these different strategies to calculate it one way is to think of this as an MPS that could be viewed as an MPS and this could be viewed as one of these MPOs one of these matrix product operators and you can just multiply this times that and kind of march it through and you can get the number so there's things that are much easier to do in machine learning say with peps than they would be in physics now in terms of I think your question was more like what is the analogy of like what's going on here to something in physics I mean there doesn't have to be exactly it's a good question but but maybe the best thing to say is that is that there doesn't have to be that much of an analogy it's more like repurposing a powerful tool from physics so on one level it's just different on one level it's like finding the ground state of the Hamiltonian is an eigenvalue problem these are more like just some minimization problem it's just a different problem but of course you know you could minimize the like the energy you know in some sense there's just not there's not a Hamiltonian here per se but there are structures like in physics so there is like a density matrix in some cases or you can make you can make a thing up like a density matrix with this like I mentioned it can be perfectly justified and now this does get into some interesting things which is that there's it turns out that tensor networks when I say that tensor networks have a lot of interesting theory I haven't had a lot of time in these lectures to go into that but one of the bits of theory is that that tensor networks let you discover what are called parent Hamiltonians and so this is the idea that you give me a tensor network and I can constructively produce a Hamiltonian for which that tensor network will be the ground state and I can make a frustration free Hamiltonian for which it'll be the ground state so I could if I wanted to train one of these tensor networks for some kind of machine learning task then I can like construct a Hamiltonian for which is the ground state and look at that Hamiltonian if I wanted to maybe I would learn something from doing that I'm not sure but it's interesting to think about but maybe the most I think the most important analogy to physics might be in the interpretability like we could use physics ideas to look into what the tensor networks are doing so one key example that again I didn't you know have time to get into all this but I think I mentioned this maybe to a few people was that you can if you have an MPS that's translation invariant that you've trained this is coming from the physics literature that goes on forever you can evaluate the correlation function sorry the correlation length precisely by just you pick out one of the MPS tensors yeah maybe I did mention this yesterday and then you make this thing called the MPS transfer matrix that object and then you diagonalize that object you look for its dominant eigenvector which would be some other tensor like this and you wanna find so let's call this vj you wanna find lambda j vj you're looking for these eigenvectors vj of that thing you can think of this as being something like the result of putting an operator into two copies of the MPS and kind of rolling it up and then getting this blob but then you want that blob to sort of transport through the MPS in some simple way and by getting those numbers lambda j you can actually bound the correlation length of an MPS so it would be very interesting to do that for data so to take some kind of time series data like historical temperatures or prices or language or something like that something with a 1D character train a translation and variant MPS to model it whether it's generative modeling, supervised learning then extract the correlation length from the MPS directly and so you've kind of learned something about the data that would be otherwise pretty hard to calculate so that's some kind of physics techniques coming on that you could use so I think there's a lot of things like that so I think it could be interesting so let me now switch to the unless there's a pressing question let me switch now to the part about machine learning sorry quantum machine learning so the idea of this part is just to say that tensor networks are quantum circuits and their people are trying to figure out good uses of quantum computers and machine learning might be one of those uses and so then there's kind of a nice fit then I think between tensor networks quantum computing and machine learning and credit to my co-author first author on our paper Bill Huggins who kind of pushed me to work on this I was pretty reluctant because I was like I don't know much about quantum computing what's going on there and he said let's just look into this and then around the same time that we published this group at University College London also published very related work so it's nice to see that this idea was sort of in the air also I'm not sure if I have it on the slide but I wanted to mention I might have a slide later it was really directly inspired by a work by Brian Swingle and Isaac Kim who had some other ideas about using tensor networks for quantum computing their ideas are very similar to the ones that I'm gonna show because ours were inspired by theirs but their idea was to use tensor networks to find ground states on a quantum computer and their idea was basically that if you propose the ground state to be in the form of a mirror you can then exploit the form of a mirror to write better quantum algorithms so I encourage you to take a look at their paper and I'll try to provide a reference so you can email me and I can send you the reference but then I was thinking that okay that's great but I think machine learning isn't even better fit for reasons that I'll say but basically some of those reasons include that quantum computers are very noisy and machine learning is very tolerant of noise also it's just a pressing application that people would wanna use quantum computers for okay so a bit of background what is a quantum computer? This may not be the most authoritative definition of it but just for our purposes so what I'm thinking about is gate based quantum computing there's other schemes too there's things like annealers and other things but I'm thinking of gate based so what you can think about quantum computer as being is just a set of coherent qubits just some collection of qubits which right here are these circles for which one can do the following things efficiently prepare certain initial states so you can say prepare them into some specific product states so here it could be that they're all in the zero state that's really what those are zeros and also to which you can apply unitary operations usually these take the form of one and two qubit unitary operations usually the one qubit ones are very flexible like you can just apply arbitrary one qubit operations two qubit are more limited usually it's just like something like controlled Z or Hadamard or something like that these very specific two qubit unitaries but it depends a lot on the hardware which ones you can do and I say usually because it's actually very interesting to me to find out talking to some of the people in these groups in these labs that they actually can do multi qubit unitaries but they don't always advertise this that much so I actually talked to one of the IBM experimentalists Hanhee Peck about this and she was mentioning this one unitary that she has where they basically blast all of their their superconducting qubits with microwaves and this has some kind of effect on all the qubits at once it basically does some kind of different applies a different phase to each one but it depends on the particular qubit what phase you get and I went up to her after her talk and said that's awesome that you can do that and she started apologizing and saying like oh I'm so sorry you know it's not a two qubit unitary she's like theorists only want two qubit unitaries and I'm like no no no no like I like I want more than two qubits actually and you'll see why in a few minutes so that's something by the way just for people who are thinking about this field is that we should really be pressing the experimentalists to be to tell us about all the unitaries that they can do past two qubits and we should try to use all that capability so the last thing is performing measurements and this is interesting because some hardware well let me go ahead and go through the slides so this is the qubits they're usually notated and time is notated with these lines it's kind of like a musical stanza or something and then you put the unitaries on in time you kind of do unitary operations one qubit, two qubit in some pattern then you do measurements and on some hardware you can actually measure just a single qubit or a subset of qubits without disturbing the others at least in principle but some hardware you have to measure all the qubits it's like all or nothing so you just measure them all or you don't so some companies are working harder to try to put in this single qubit measurement which is something I'd like to have so it'd be nice when this comes online more but it depends on the particular hardware okay so now to the idea of doing machine learning with quantum computing this is an idea that went into an interesting direction just this year so there'd been some works in past years about doing this but all of a sudden starting in January and February a bunch of groups all had the same idea basically all at once which happens a lot in physics so one of these ideas was or first of all let me say what's the broader picture here these two slides will be about this broad idea that Maria Schuld called circuit-centric machine learning and the idea there is in saying instead of thinking about how do people do machine learning classically then let's just try to imitate that on a quantum computer the thinking here is let's just let a quantum computer do whatever it does best and then try to let that do machine learning so it's like you just let the quantum circuit be the model so then there's two modes of doing this one of them is supervised learning or discriminative learning and the idea is you say okay here we have data which we will represent as a product state so that should sound familiar from the earlier slides with the tensor network thing but this was just some other groups thinking about this so this is a group at Google Farhi and Nevin and this is a group in Toronto and they were saying prepare the data as a product state then have some kind of quantum circuit that acts onto that product state and here I've been very vague as to what quantum circuit it is and some of these papers also are equally pretty vague about it so the idea is this whole bubble here is just a giant quantum circuit it's one big unitary but in practice it would actually have to be lots of smaller unitaries acting one after another then whatever this big unitary is let's go and find this unitary that's our model that's like our adjustable weights is this unitary once we found this unitary that's our model we apply it then we just measure one of the output qubits I mean we can measure them all but we only care about one of them and then the output tells us what class the data is in so it's either class zero or class one so that's like A or B that's it and if you want more labels you can measure two qubits or three qubits and you'll get exponentially many labels to pick from so that's that idea so we had that same idea but we were still working on our paper at the time then there's another mode which is basically flipping this around it's generative modeling so let's say we have a model we want to generate data that's similar to some other data that we've seen so this was proposed back in December and also in this group by this group of Benedetti at all in January they said okay start with qubits that are just prepared in some arbitrary reference state like all zeros I'm using triangles here for some reason but that just means zero state now apply some other trained unitary that we trained somehow and I'll tell you in later slides how you could train it I could tell you a bit more about how this group trained it train some big unitary which could be an entire circuit acts on this reference state then again what we do at the end is we measure the qubits but here we don't just care about one or two of them we care about all of them and the output is a binary string it's you know zero zero one zero one one or something and then that's our that's one sample of our model so we sample basically from the model and that could generate things like images that look like some other images that we trained from or something like that so this is really you know this is really basically quantum mechanics it's just saying let's prepare a wave function and then collapse it and then but then use that for machine learning somehow yeah I'll definitely lead to that how do you do the updates but basically you just this is uh this is like the part of evaluating your function but I mean you can evaluate different functions and see how they do so then the question is can you do that in some smart way so do I have to search over all the different unitaries or can I follow a gradient or something and get there faster so I'll mention a little bit about how to do that but we can say more about it so if your question if you still have that question in a few slides then let me know but I will say a bit about that okay but there's even at even before we get into the issue a little bit of of what circuit should go here or maybe that's what I am going to say but but there's issues with these proposals right so I like these proposals because some of the earlier proposals that I'm not going to talk about much we're saying other things like saying like let's put a neural network onto a quantum computer but I'm not sure if a quantum computer that might work great but I'm not sure for myself if a quantum computer really wants to be a neural network a quantum computer just wants to be a quantum computer right so this these ideas are saying let's just let a quantum computer do what it does which is prepare states apply unitaries and do measurements but let's just try to make those unitaries and those measurements accomplish interesting tasks you know that's the idea however the way these proposals were framed these four papers that I mentioned they left some things unsaid or on the table or the less than ideal things that they even acknowledge in their papers one of these less than ideal things is that how do we parameterize these circuits what numbers should go in here we can't really necessarily make a completely arbitrary in qubit unitary right that's not feasible so we have to break this up into some other more specific unitaries either to qubit or these very special multi qubit ones that I mentioned they have in the in the labs and if you do try to just say let it be a random circuit you find massive problems with that so another paper by the google group came out in March and they said if you actually try to train a random circuit for machine learning you'd find that the gradient would be very very very flat you're basically on this infinitely flat plateau in in parameter space and you can't go anywhere so it's really bad so you need some kind of smart circuit some better circuit to put here you probably know what I'm going to say that should be and then the other problem with it is that it just is you need too many qubits especially this generative approach so the idea is let's say I want to generate realistic images or something so I want to output pixels but I might want to output you know an image as like thousands of pixels so am I going to have to thousands of qubits to do that that's not going to really fly right we're not going to have thousands of coherent qubits for a while so we need to do something we can do with tens of qubits in the near term so now in comes tensor network so the thinking here is that tensor networks are equivalent they're one-to-one with quantum circuits so if I take a particular tensor network this is a tree tensor network I can always read it as a quantum circuit and this is not just an analogy this is a precise map so what I can do is I can say these lines at the bottom let's say these are spin and half spins okay that's already a qubit that's just two-state two-level system then if I merge them by this tensor into a four-dimensional index I can always think of that as a product of two two-dimensional spaces I can always just factorize that so that's these two lines inside or these two lines going up here continuing up to this top level that's like this index so there's always some mapping between any tensor network and a quantum circuit you can do especially these ones that are trees like MPSs are a special case of trees so here's in fact the quantum circuit for a particular matrix product state so if I have a bond dimension four matrix product state that means you know the matrices here that connect one to another to another have an index size of four I can always write that as this circuit and this is a totally faithful mapping one to the other and the idea is that that's the first tensor that's the second that's the next you know and the idea is that you see that for any external line here any physical line it just goes and touches one of the tensors and this tensor is joined to this one by two lines so that's a dimension of four and two lines to this one that's a dimension of four so you can see here it is with two lines going to the left two lines going to the right and one connecting to the physical space it's just like here you know dimension four space dimension four space dimension two space so you can see that that's an MPS hopefully if you kind of learn how to read it that way so that's an MPS as a quantum circuit so that suggests that so we already know that you know matrix product states are useful for machine learning so I already mentioned all these slides with all these other works people have done one of them was this group from Beijing that actually trained a matrix product state sampled from it and got good quality images out they actually got good quality handwriting samples out from a matrix product state so you could do the exact same thing here so you could say well let's just interpret the matrix product state as a quantum circuit it could be that exact same matrix product state that group trained sample from it on a quantum computer the output would be the same outputs that they had so the idea is that this gives you some motivation of like how you could pick your circuit so instead of picking some messy random circuit you could just pick a matrix product state circuit and you know it'll do well because we already have these classical experiments showing that and best of all you can connect with those classical approaches classical I mean all the earlier slides where people are using regular computers to train these you can actually optimize a matrix product state or some other tensor network as far as you can push it then take those parameters just put them onto a quantum computer you're guaranteed to get at least the same performance I mean apart from important issues about the noisiness of the computer and then continue to optimize it more in the quantum computer yes well because you may not be happy with the performance you're getting so you might you might be getting up to a certain amount of performance and say well I wish I had more resources I wish I could do a larger MPS so by going on to a quantum computer in principle you can the gate generation process oh you mean like applying unitaries yes using the what sorry yes the advantage is that you can manipulate very very high dimensional spaces directly so the idea is that you can and I'll kind of I'll get to this in a minute but the idea is that this is a bond dimension for MPS because two qubits go through but if three qubits went through it'd be bond dimension eight four it'd be 16 so it's growing exponentially with however many qubits I let go between this unitary and that one so if I can just let something like 16 qubits go between that this set of operations when I transition to that set of operations and time then I can parametrize the matrix product state whose bond dimension is something like 200,000 something like that which is way out of bounds for what we can do classically so that is your question it's helpful that you're asking this question so thank you because the idea is that here this is just a bond dimension four MPS but if we can work with something like 16 or 17 or 18 qubits which is the numbers that they are working on we can work with matrix product states of bond dimensions of hundreds of thousands in principle so we could have extremely powerful models that's the idea so and I'll show you how that actually can work with even a small number of qubits here in a minute so that's the idea for a generative model based on matrix product states but written in a form basically this is nothing but a program that can be run on a quantum computer so if I tell you exactly what these unitaries are you can carry this out on quantum hardware I'm glossing over some details which is that you don't have three qubit unitaries but there are ways to so-called compile them into two qubit unitaries which means find two and one qubit unitaries that approximate these three qubit ones or you can actually use these multi qubit unitary capabilities there's other things you can do you can also just not propose this model propose one that only has two qubit ones but train them that way you can also do discriminative learning by just kind of flipping it around and that's where you prepare your data as a product state then you can apply another kind of tensor network motivated circuit or tensor network equivalent circuit then you know read off one of the designated qubits to say that's the label you measure it and you do that a few times to kind of see what this is probabilistic so you do it a few times to estimate what's the label that the model really wants most often and then that's your output how do you actually train it so this will come a bit more to your question the idea is that you take a given set of gates that you think are going to do a good job so these could be either initialized from some classical pre-training like the stuff that I've been doing or you know you just start with some random gates or something then you run the program multiple times on your data also to estimate the output well so you have to get a good idea of what's the actual output because it's probabilistic now you have an idea of how the model performs you take that output and you feed the result to a classical algorithm and the classical algorithm's job is to just propose new gates so it's that simple so this is just the idea of called these are called QAOA or Variational Eigen Solver it's this idea of hybrid quantum classical algorithm so this is an idea that's very popular right now in quantum computing the idea of wait we don't have to do everything on the quantum computer we can just think of the quantum computer as something that's good at doing one part of the algorithm and all the other parts can be done classically without too much cost I mean that sounds very very simple obviously I'm not you know there's some steps I'm leaving out here but it really can be that simple so one thing you can do is you can do let's see did I put a slide on this I can't remember if I did okay not on the thing I want to say right now so one thing that people do is very very simple it's really as simple as this it's called gradient free optimization and the idea is you really do just propose a circuit try it and then propose another circuit and try that and so on and then just try to sort of prune the circuits that perform poorly and advance the ones that perform well so this is things called like genetic algorithms or particle swarm algorithms the idea is that you just have a bunch you have a population of circuits and you say evaluate all of them see how they did the ones that did well those advanced to like the next step the ones that had poorly killed them and then maybe crossbreed the ones that did well in some way so that's the idea that's why they're called genetic algorithms it's like evolution evolutionary algorithms okay so you can do that you can actually do that and that works so this paper I mentioned earlier by Benedetti and Prado Moretes that's what they did so they downloaded they had this package it's a python package called particle swarm that they didn't write they just used and they just input these unitary angle parameters into this particle swarm optimization which just does this kind of it proposes lots of circuits and they just tested all their circuits and kept re-optimizing and they got very good performance so you can do this but there's other things you can do which is you can actually stick in specially designed unitaries that actually change the meaning of the output of the models instead of the output being the label the output is the gradient of a particular unitary so you can actually get the gradient out by putting specially designed unitaries in estimating the output of certain qubits that gives you the gradient so you can actually get the gradient out in another scheme so you can do various things to train these it's pretty interesting what to do I'll mention one other one on the next slide so heroically Bill my colleague at Berkeley took this on so he took on the very challenging task of let me train a discriminative machine learning model as a quantum circuit but only restricting to operations you could actually do on quantum hardware so this is tough to do because you know he didn't actually calculate gradients because I mentioned you can do it but it takes special techniques he did a simpler thing but it's harder to get working called SPSA and I'll describe that on the next slide but let me show you the results that he got so this is showing how the algorithm works when you do you know up to 5000 steps the test act he kept looking at the test set which you're not really supposed to do but but he didn't use it as training data he just wanted to see how the model was actually going to perform so he's training on the training data and he kept he keeps taking a peek at the test set to see is it going to work and he just keeps working better and better and better and better up to almost you know the very high percentage accuracy and then here it is in the last few tens of thousands of steps lots of steps because this is a very tricky training technique and he's getting up into 99% performance this is distinguishing two kinds of digits handwritten digits one from another using this tree tensor network applied to the data that's what we did so what is the algorithm that Bill used he used this thing called the SPSA algorithm it's very simple it just says go through the circuit and pick one angle that you want to improve you just say okay I'm going to pick angle number three from cube from unitary number 80 or something and then you say okay how am I going to improve that angle well I'm going to prepare two new circuits one where that angle is slightly bigger and one where the angle is slightly smaller and I'm going to evaluate both circuits just see which one does better and then either just pick the better of the two or you know move in the direction of if the angle increases or decreases move a little bit in that direction with some empirical step size that could maybe be a random step size or something so you just take these tiny steps carefully one parameter at a time and that's SPSA so you can do this on quantum hardware because all it really requires is that you can evaluate whatever circuit you're testing that's all it requires so he did it that way and got this good performance another thing that we worked on was the idea that these might be robust to noise so it could be that it's robust to noise because of the way we trained it but we think it's also to do with the tensor network structure that we picked so we think that basically that because it had this tree structure that even if one branch of the tree got corrupted by some kind of random noise other branches could go to the top and still give a good output so the idea is that we as we were feeding the data through a tree I should have drawn the tree in one of the sides but the data goes up here's the data as a product state and then it's fed upward through a tree tensor network and then there's some kind of some kind of noisy map that gets applied to the qubits that simulates some kind of actual noise that could be present in a quantum device so you apply this noisy map to the qubits and it has the effect of kind of corrupting the information in the qubits going up and this is basically increasing noise and the red means things that are dangerously close to being misclassified and the gray means they have been misclassified so you can see that for quite a lot of noise this is kind of a realistic level of noise and this is getting even bigger than realistic level of noise then you could actually still correct you know recover the right labels here these are things that are tending to sometimes give the wrong labels but they still can give the correct labels if you just pull the output qubit enough times to really hone down what the right distribution is if they're gray it means that they just give the wrong labels so no matter how many times you measure it you're going to get the wrong thing but it held on for a very long time and then it really starts to get corrupted down here so that was encouraging to see so in the last few minutes let me just say a few more ideas we had so one idea that I'm pretty excited about is you know so what can tensor networks bring to machine learning to quantum machine learning besides just you know maybe a better choice of circuit some of these nice properties like robustness to noise is there anything like bigger that it can bring to it other than maybe I like this circuit someone else likes that circuit well it can bring some pretty big things I think actually because it can actually get you around this bottleneck of the fact that you only have a very small number of qubits in near term hardware so this is an actual quantum computer this is the IBM one of the IBM devices and just has five qubits so just motivating that you know the really high quality devices right now don't have a lot of qubits there's some that do have a lot of qubits but quality is not maybe as high right so so can we still do interesting things with a small number of qubits and you can if you use tensor network thinking to do it so let me show you how to use tensor network ideas to sample a very large output much larger than the number of qubits so here's how it works so let's say I have four qubits that's all I have that's all I get to use and then what I do is I entangle those qubits with a bunch of unitary so this blob means a bunch of two and one qubit unitaries that collectively are like a four qubit unitary then what I do is after that round of entangling of applying unitaries I just measure one of the qubits so I just collapse that one and record what I get and I say okay I got you know zero or something or maybe one now the other ones are still in principle entangled with each other so they go on in time and now I re-prepare the one that I measured now I start entangling them all again and this unitary could be the same as that one it could be different interesting things can happen either way now I measure that first one again record the result now I reset it entangle measure reset entangle and go on and more and more and more and the idea is that in principle I can do this forever now of course in practice I have a coherence time and it's going to deco here but what I'm doing is I'm trading circuit kind of width for circuit depth that's the idea and I can record all these outputs and you see that here I have six outputs for only four qubits so I can get much more outputs than I have a number of qubits and what am I really doing here so are these going to be interesting outputs and will this be an interesting way to get outputs like what am I going to what am I expecting to see what kind of properties can I generate for this output data well the way I can understand this is to think of the qubits that I don't measure as some kind of hidden state some kind of virtual space and that hidden state size will grow exponentially in the number of qubits that I have so the idea is this is an eight dimensional space but if I put one more qubit that's going to be a 16 dimensional space and so on so if I had you know 17 or 18 qubits this would be a really big hidden space that's kind of transmitting the information about what measurement I got here to the next one to the next one to the next one to the next one so you kind of think of some flow of information where you know once I met once I collapse this one that'll tell me something about the state of these other three which will influence that one and so on and so on and we can connect this to tensor networks as follows so that's just some weird circuit but if you interpret these circuit elements as tensors which you can and absorb anything fixed these little fixed starting vectors and reset vectors into these tensors then just reshape them this is all I'm doing from this one to this one to this one is just redrawing that one is that no change here then I'm just redrawing collections of indices as one big index at the very last line that is a matrix product state so you can actually show that this process of sampling where I take four qubits and sample, sample, sample, sample and then finally at the end I just measure them all is precisely equivalent to preparing a matrix product state and sampling from that so if I can understand what spaces matrix product states cover like what kind of samples can I expect to get from a matrix product state that's the same exact space of samples that I'll be getting from this algorithm that's the idea and so if I can use more and more qubits I can sample from matrix product states of increasingly large bond dimension so that's an idea that we're pursuing right now last slide is to say these ideas have been tried on actual quantum hardware so this is from this other paper that came out around same time as ours by this group in London and they tried a little toy data set called the iris data set but they got very reasonable results and this is the circuit they used it looks like a tree and they actually carried this out in an IBM computer so it's really been done just on toy data so far but I think this has some interesting ideas that might keep pushing what we can do with quantum hardware so okay so I had this other part of the talk prepared in case there was time but there isn't then that's okay so if you want to check it out this other part I'll post the slides and take a look at this paper it'll be in the slides this is the work I did about turning data into a density matrix and then growing a tree tensor network so that's what the density matrix looks like after you explode it into a tree tensor network so you can actually calculate that so you basically diagonalize this exponentially big matrix that way and then you can put things on top to do machine learning and get good performance and then do partial number of layers and so on and get state-of-the-art performance on these challenging tasks so I was trying to get to the conclusion slide all right so thanks for your attention