 I have the pleasure to receive Miles. So he'll talk about, which is a nice compliment to the lecture that you had in the morning. He'll talk about TensorFlow networks for machine learning applications. Thanks, Miles. Great, okay. Hi, I'm Miles. Glad to be here with you all. So I'm gonna be telling you today and tomorrow morning about TensorFlow networks for machine learning applications. So you've been hearing about neural networks and they're something that came up in the machine learning field and have been making their way more and more into physics applications. So these talks will be about sort of the complementary activity, which is that tools mostly from quantum physics, quantum many body physics, having some chance or some opportunity to make their way into broader fields, machine learning and other areas of the sciences that I'll probably touch on tomorrow and things like solving differential equations or optimizing functions, things like that. So that's what it'll be about. Let me briefly tell you a little bit about where I'm coming from. So I work at a place called the Flatiron Institute. I thought I would just flash one slide about that. So Flatiron, we're located in New York City, so right in Manhattan. And we're kind of unique in a certain way, which is that our focus is on computing and computational sciences. So you can see all of our centers have the word computational in the name somewhere, so they all have a letter C somewhere in them. And we cover a lot of areas by now. We started out with just four centers, but now we have quite a bit more. So we have a center doing astrophysics biology, I'm in the one for quantum physics, and there's some people here from that center attending. There's also computational mathematics, and then the newer ones are neuroscience, and then this new thing called the Initiative for Computational Catalysis. That one's really new. They're actually hiring postdocs and faculty right now at ICC. And it's gonna be like a half-size center for 10 years doing quantum chemistry and applications to catalysis. So some of you may be interested to learn more about that. So let me start by talking about what am I gonna cover today and tomorrow? So today I'll go through the basics of tensor networks, water tensors, water tensor networks, and then some kind of broad introduction to how I could see them being used in machine learning and how they're starting to be used in machine learning and what the potential is, what are some of the drawbacks of the current approaches right now? Because it's not, as of right now, it's not as broadly applicable, as I say, neural networks to all kinds of data. It's a bit more niche. So we'll kind of unpack a little bit why. But then tomorrow I'm gonna challenge you all a little bit tomorrow by going really deep into two algorithms for training tensor networks, in one case from data, and then the other case from a function that you can call, that the machine learning algorithm is able to call. So we'll see some really novel things. For example, neither algorithm will have a gradient in it anywhere. So there's no gradient descent in these algorithms. It's all linear algebra. That's one novel thing. The other novel thing is that that second algorithm, the one here at the bottom called the tensor cross interpolation algorithm, there's no data. So it's more like reinforcement learning where you just have the ability to explore an environment. In that case, the environment is like a function, like a piece of code that the algorithm can call, and it's able to query this function or code and learn the whole, it's able to learn all the outputs of the function just from a few. So that's called active learning type of idea. So we'll get into those more tomorrow. Today's gonna be a bit more broad. I wanted to start too by asking how many of you have heard of a tensor network before? Okay, as most of you, right? What do you associate those with when you hear about tensor networks? What are some concepts that you are familiar with? This ADMRG? Yeah, I think someone said that. That's like the algorithm that really launched that field. What are some other concepts you may have heard about with low rank approximation? Very good, yeah. That's a very good mathematical way to place tensor networks, right? So that's really what they are. A lot of people think of them as wave functions. That's how I learned them. But really, they're a mathematical technique based on low rank approximation for working in high dimensional spaces. So that's what today is gonna be about. And then just to convince you that this is a real thing, that I'm not just talking about just only concepts, let me kinda give you a quick sneak preview of what is gonna be the last talk tomorrow and this algorithm called tensor cross interpolation. And let me just show you this demo of what that algorithm can do. And then I'll explain in the second lecture tomorrow how it works. So what this algorithm can do is one of the things it can do is you can give it some function, like a piece of code for a pretty complicated function and it can machine learn it into a tensor network. And I'll tell you in a minute what is a tensor network and what this diagram means. But for now just think of it as some kind of architecture with parameters. So here's the function. It's 40 Gaussians. Here's the Gaussians in a random locations, random widths and random heights that can be positive or negative. And then just for good measure, a sharp step at the location x equals 0.4. So just some big step as well. So kind of a hard or complicated-ish function, right? And let's see what we can do. So here's the code for that. Nothing too special. This is in the Julia language. So this is just saying 40 Gaussians. Here's the width. Make random widths and heights as arrays. Here's the step information. Make a function that evaluates that function at points and throw it into this tensor cross interpolation code which for now is just a black box. And here's the code running. It already ran. That was how fast that went. And there's the function. So the function is the thin lines. The actual function just evaluated on every possible point is these thin lines. And then these blue points are where the TCI algorithm evaluated the function. And you can see that some places it evaluates more densely and some places more finely. And it does look like a lot of points but there's some statistics here. It says that on the last pass over the 1D space it only did something like 1,300 queries of the function versus the potential 262,000 grid points that the function is living on. Once it's represented as a tensor network that's how many grid points you can actually ask for the value of. So it was able to fill in that many grid points just by evaluating the function about 1,000 times. That's only about half a percent of the possible points. So we can do this again. It's just generating a random function each time. Okay, so they're learned to that function too. And you can see there's this sharp step and it has no problem with that either. Okay, so that's just a quick demo of what some of these algorithms can do just to give you a feel of the power of them. Okay, great. So let's start though at the beginning about tensors and tensor networks and then why do I think personally that tensor networks have some promise for machine learning and could be quite interesting. And today I'll use the inspiration of the DMRG algorithm for that. And then I'll end today by discussing two really contrasting ways to represent data as tensors and the advantages and disadvantages or the potential of each of those ways. One I'm calling a high dimensional, one low dimensional and they're both very interesting. And then some example applications of what has been done so far and a bit of a setup for tomorrow. Okay, so what is a tensor first of all starting at the basics, right? So you may have heard a lot of things about tensors, things people will write an expression that has a lot of indices but then they'll say but it's not a tensor and then you're thinking I thought it was because I see a lot of indices, what's going on? So especially if you've studied GR this people like to say a lot of things or not everything is a tensor. Or you'll say here's a tensor and you'll show someone some data and they'll say no, a tensor is not that it's how it transforms. You probably have heard that, right? These are all true statements but the problem is they're kind of usually over complicating the story. The point is once you have a basis and you know what basis you're in then in a basis a tensor can be represented as a multi-dimensional array. So as long as you know what basis these numbers refer to, that's a tensor. So in this point of view and a very practical kind of engineer point of view a tensor is just an extension of the idea of a vector or matrix up to higher dimensions. So a vector is a one-dimensional array of numbers a matrix is a two-dimensional array of numbers. We can talk about the components we can say the second component of the vector v is a number three or the one two component of this matrix here is seven and so on order three tensor would have three dimensions, right? But you can see it's impractical even for order three to try to view this as this multi-dimensional array. You can't really write it. It's sort of you have to use 3D drawings or 4D drawings. Order four would already be very hard to draw this way. So a much better notation was invented by Roger Penrose and it's called Penrose diagram notation. It may look like neural network notation where their circles and lines connected but it's actually it's sort of related but very different in an important way. So with neural networks when you see the lines each of those lines is the entry of a matrix. Here the lines are indices. The other difference is that here when you put things next to each other they're implicitly in a product with each other whereas in a neural network when you put them next to each other they're more in a direct sum with each other. So this notation is sort of exponentially higher dimensional what it's expressing but you can still kind of roughly see them as related if you know at least how to kind of think about lifting direct sums to direct products. That's just a bit of theory but more practically the idea is that in traditional math you'll see a tensor notated as a letter with a bunch of indices and what you do in the diagram notation is you replace the letter by a shape and the indices become these lines and even though you can put the names of the indices you don't have to and that's very nice because you can just express ideas by saying this is connected to that without having to think of names for everything and having subscripts and then sometimes subscripts with subscripts it can get very heavy in traditional math. So to give you some examples of this notation for low order tensors a vector is a shape with one line a scalar would just be a shape with no lines. A matrix is a shape with two lines coming out the orientation doesn't matter for most applications so you just put the lines coming out whichever direction you want. Order three tensor has three lines and you can already see the advantage instead of having to make a cube of numbers you can just draw three lines coming out so you can easily do order four you just have four lines you can do order n with n lines. You should think of these shapes as like a trunk full of numbers and you can reach in the trunk and pull out a number and they're all kind of in there or like a cauldron and all the numbers are just in there and you can reach in and access any of those numbers. So if you want to notate how do you get a particular number so this kind of one, two, two component of this tensor t you can just think of setting the lines temporarily to fixed value so you set this line to one this these lines to two and then out comes the number seven which is in that array or that tensor and then the main rule of this notation the main other rule is kind of like going one step beyond Einstein summation notation which is that when you join the indices there's the contraction like a sum is running over this index line. So on the right is the traditional math notation we say there's this matrix with two indices i and j they're summed over index j so you can see j is connecting to j here then j is gone it's out it's done from the sum i carries over to the other side and we see that m times v equals w on the left that's the exact same expression just here I've dropped all the letters but you can still see what's going on so you can see here's a thing with two indices connecting to something with one and then what's left over still has this dangling index on the left so what's nice about this notation also I did first of all you don't have to think of the letters but also you can do these kind of visual proofs so you can see that even if you didn't know about a matrix or a vector you could just see that the result has one line left over so it's nice to just visualize what's going on so for example you can see that the next expression is a scalar because there's no lines left over right so again these kind of visual proofs or the last one is a vector right one line left over and then here's the traditional notation for these and that's that's pretty nice in this case but I kind of like this left head side better and again you don't have to think of all these these letter names so this really pays off when you work with tensor networks it's not so important for small things like this but when you really have a lot of tensors it saves you a lot of writing and I find it more intuitive to see like the topology of the network how things are connected and it's just more intuitive so on the right is the traditional notation for what's called a matrix product state or tensor train tensor network which I'll introduce in some more depth in a minute and there's a lot of papers where you'll only see this expression for it especially from the applied math literature and you know when you see that for the first time at least to me it's not so obvious what's going on you really have to stare at it a while and kind of locate names of indices and think about how they're connected whereas on the left you just see it immediately it's chain of tensors each one has three indices they have a linear or tree-like topology tree in the sense that there's no loops it's just a chain right okay so now's a chance to test whether you got the last two slides so just think for a minute about how in your own words would you describe each of these three operations so let's start with this one what's the kind of typical math name for this operation right does someone want to say yeah dot product or inner product right so we have two vectors two things with one index each that index is summed over so it's V dot W it's a vector inner product okay how about someone else want to say what's the second one okay yeah trace of a matrix right so it's a matrix because it has two indices that index is summed over you can see the result as a scalar because there's no dangling indices great okay and then someone else for the last one this one doesn't have like a actual common name but just sort of in words how would you describe what operation is going on let's see in the back yeah contraction with a rank three or order three tensor with a vector great yeah I typically say order three these days because rank is also used to mean when a matrix has like could be factorized with a smaller sum in the middle but a lot of people say rank for number of indices so it's perfectly fine exactly so order three tensor with a vector exactly and you can see it's pretty heavy these notations on the right and on the left you can write the letters if it helps but you can leave it off if it helps also okay great so now kind of pushing ahead into tensor networks one last thing about tensors is that any function of discrete variables can be represented as a tensor and I think that's in a way just kind of a very straightforward and obvious statement because you could just take the function you can just set all the variables going in imagine these are discrete integer variables like S1 through S6 each run over D integer values or maybe listing two values you know one and two, one and two, one and two and all you do is you just plug in all possible values into the function and just put them in this tensor so just fill up this array with all the outputs of F and clearly it can represent that function and that would just be true for any function whatsoever and so I wanted to point this out to say that there's no non-linearities here and yet it can represent any function whatsoever so I'm gonna make this point a few times a few different ways that even though if you know neural networks you hear very often that you only have a powerful function approximator when you have non-linearities here you have none and yet you can represent any function at all however you pay an exponential price in this form we'll see how to break that exponential later okay yes oh how to do continuous variables yeah I'll show you a really neat way to do that later exactly so but for now if you just did a continuous variable one way to do it that people actually do is they just finely discretize the variable but that as you might imagine is kind of inefficient you might take a let's say if S1 was continuous or call it X1 then you might chop that let's say it goes from zero to one you might chop it into 100 steps and so then you would have 100 dimensional index and people do that and you can do that however 100 is a lot and especially if you have three or four such indices then it's gonna be 100 to the four numbers inside and also you have a grid approximation which may not be so good of an approximation so another thing you can do is you can attach a basis you could use a basis of functions and that could work better you could also just mathematically you could just say I'm not worried about cost I just want that to be continuous and that's okay and you can prove a lot of things but I'm often thinking about how do you store it on a computer so we'll see toward the end of the talk that a really neat thing you can do is you can use groups of indices to represent the bits of a number and this is like an exponentially efficient way of inputting a continuous variable so use collections of indices this is the idea called quantics tensors or quantics tensor trains that some of you may have heard of but I'll introduce that so here's, so it's a really good question we can discuss more here's just me running through a more intuitive picture of what is it like to fill up a tensor with all the values of a function so just imagine we run over all the values of the function all exponentially many and fill them in okay great what are some examples of big tensors or where do tensors pop up and one argument I would make is that they're actually everywhere once you start looking for them one example is quantum many body physics or quantum computing also anytime you have a many body state that you can't represent using tricks like Clifford tricks or free fermion tricks once you're really faced with the full many body Hilbert space then the general wave function actually is a big tensor and it's very straightforward to see you pick a basis so let's say these are qubits or they could be spin halves so each index S just runs over two values say zero or one then all the information about so the basis is just picked all the information about the state is in the amplitudes the coefficients in front that's all the information right there and then just by inspection you can see some numbers that are indexed by a bunch of integers that's a tensor right so every many body wave function is a big tensor whether you like tensors or not so if you're doing quantum mechanics tensors are there so we saw you know general functions of many discrete variables general wave functions right they're kind of everywhere but why are they challenging to work with why do we need tools like tensor networks why don't we just work directly with tensors there's a very simple reason it's called the curse of dimensionality in math and in physics we call it the many body problem and it's just that the parameters grow too fast so if you have a vector with a size let's say all these indices are size two if you have a vector with a size two index two parameters you know two you get a matrix four parameters three index tensor eight parameters but once you go to N indices so think of like a wave function within site something like that two to the N parameters so if you have 50 indices you'd have this many parameters which is about 10 to the 15 so it's really hopeless right we can't store tensors with way too many indices maybe up to 10 is okay 20 but after that it's really impossible so there's some ways out of this though there's quite a few and I'll focus on the last one especially one way there's a whole field in fact of people that study sparse tensor so you could say what if we only know some of the entries like maybe we just collect some entries in the sense of collecting data or maybe most of the entries really are zero that's one way out is to do sparsity another one is to say maybe they're all non-zero but I'll just pick the most important ones more often than the less important ones so you can do sampling right that's another way out but I think the most interesting personally is low rank structure and I'll unpack that quite a bit here in a minute what I mean by low rank structure so for the case of a matrix finding low rank structure is basically an elegantly solved problem although there's many different ways to solve it we'll see a really neat one in the third lecture but the most kind of fundamental way to solve it is to compute what's called the singular value decomposition of a matrix so first of all low rank what that means is you have some matrix m and I'm being not too specific here and then you say what if we could factorize it into two other matrices where the outer dimensions are the same as m they have to be the same right these big dimensions but this inner index that gets introduced between them what if it's small like what if it's just two or four whereas the outer dimension could be a thousand right that's what's meant by low rank because that's small inner index so actually finding this finding the most optimal a and b with the smallest r is a solved problem solved by singular value decomposition so let's see kind of how singular value decomposition works I'm not going to go through the algorithm for how you actually compute it but just kind of show you what it gives you when you run it on the computer so you get some matrix and here you may see there's already some structure in this matrix like some numbers are repeating these two columns look kind of related to each other so we already have some intuition from that that there might be low rank structure so we call the SVD basically it squares the matrix on both sides computes eigenvalue decomposition of both of those and then puts it all back together if you want to know kind of how it works there's other ways to do it and when all that's done in the dust settles you get three matrices out you get u on the left v on the right and then s in the middle and the idea is that u and v have orthonormal columns so when they're square they're unitary and then the s matrix is diagonal and these numbers are called singular values and they can always without loss of generality even with complex numbers in the original matrix to be chosen real non-negative and decreasing so it's really nice because it kind of is kind of ranking the columns of u and the columns of v and saying the most important second most important third most important and tells you by how much and then gives you some sense of you know how much they're each contributing to the left and right spaces of the matrix and what's neat about this decomposition is that it lets you do a controlled approximation so you can first take it exactly and then you can say here's I'm showing at the bottom the error from the original matrix to this reconstructed one so far we just do the SVD and multiply back and there's no error but what if we throw away the smallest singular value this is the matrix that you get basically that just gets rid of the third column in this example what you can do is you can collect the singular values that you throw away and add up their squares and when you add up their squares it actually gives you an error report of how far away the reconstructed matrix is gonna be from the original one so it's like a controlled thing you can do throw away the smallest ones collect the error you made and it's a controlled method where you know how much error you're making and then you know throw away the next one and you see that the error is exactly again given by the sums of squares of the singular values you throw away and then what's the point of throwing them away what's nice about this is now this whole block down here is zero so these zeros multiply into V into U so we don't need those parts of V and U anymore we can just keep only the columns corresponding to the singular values that we keep sometimes also the singular values may actually just be zero and those you can just throw away immediately others are small and then you throw them away but you know you're making an approximation right and so when you've done this you get these smaller matrices and now everywhere you used to have them you can use these smaller matrices so you have less memory you use your code runs faster that's what's nice about it also you can get insights it's more interpretable all these kind of benefits so when you've done this how is that related to what I said on the other slide so singular value decomposition spits out three matrices and if you wanna go back to this structure of two one way to do that is to square root the S matrix and put it into U and V and you can call that A and B so all I'm saying on this slide is that if you wanna think of low rank as factorizing into two matrices SVD also solves that by just push S into one or the other or symmetrically into both okay any questions about SVD and about low rank that's kind of important yes good question so it's if it's square it's just the cube of the linear dimension of the matrix but let's say the matrix is m by n the complexity is actually min of m n squared and m squared n so this is a nice thing about it so if they're the same or close to the same it's just the cube of m but if they're different you can pick the smaller of the two so say that n is really big then the complexity will be this one the dominant complexity and so it could be 10 by 1000 and it'll only scale linearly in this one which is nice times some overhead of 10 squared okay any other questions so then okay I already mentioned how if you can factorize you a low rank factorization of a matrix so SVD is just one way there's other ways but SVD has some kind of provably optimalness to it that we can discuss but let's just say you have some way of doing a low rank decomposition of a matrix that is nice because it reduces your memory footprint and your compute cost and now for the rest of the talk when I mentioned the rank I'm not gonna call it R but I'm gonna call it chi because that's what's more standard these days than the matrix product state in tensor train literature so that's the picture for a matrix okay but so much for matrices that's a solved problem what about for tensors what can we do there can we find low rank decomposition and now there's a whole zoo of possibilities actually so there's whole other types of low rank properties of tensors that I'm not even gonna touch on or barely mention there's one called canonical polyatic and there's like a whole subfield of people studying that one there's many others there's Tucker these kind of things I'm gonna be talking about matrix product states almost exclusively in these talks and so how do they work the idea is that they're more opinionated about the order of the indices some like CP, low rank tries to keep the indices symmetric but MPS and tensor train is saying we either guess an ordering of the indices or we already know an ordering of the indices sometimes we know a good one like there might be some spatial structure and space or in time things like that so we can do the following procedure we could say let's think of this huge tensor as a matrix that's very lopsided right so we'll put one index on the left okay and all the other indices on the right and make some big very rectangular matrix right it might be like 2 by 2 to the n minus 1 matrix right we can always do that it's like in Python if you just call reshape you could do that to a big tensor then we can we can throw that really big matrix into the SVD and out comes us and v and we can keep you out on the left and do a low rank factorization then we can keep you behind and then we can cut again here and we can say take the result of that and do another SVD but grouping indices differently this time and then do another and do another and do another every time we see this leftover tensor down here we can always cut and then reveal another bond like that okay so that's a very quick kind of lightning quick introduction to how that works I didn't go through all the algorithmic steps but that's just kind of a motivating algorithm for what a matrix product state is so we call this thing at the bottom once we've done that we call this form a matrix product state and in the math community they call this same idea tensor train kind of tensor train is really a better a more general name right it's like a train where the carriages are our tensors that's really where they got that name so any questions about what I mean there like that process we're not going to use that algorithm but it's just more of a motivating idea of why this could be a good form to approximate a tensor okay great so the idea is that you think about doing this if you want to motivate why this could be a good form but in practice what you do is you just start more or less with this form so you say we imagine some big tensor that's so big we could never store it we don't have access to the whole thing right like remember I said big tensors could have something like 2 to the 10 to the 15 numbers inside so we imagine it but we don't really ever work with tensors that big what we do is we try to approximate them in some form so we just write down an approximation that's motivated somehow and this is a particularly good one in many cases and the approximation is that every external index of the original tensor goes on to just one of these smaller core tensors and each of these has three indices yes that's a good question yes that's the case so you can prove that if this form exists that you can always find the optimal ranks the optimal chi so if I took like one way to think about that is if I took this form at the bottom and I put random numbers into it but I chose these chi's to be small like say I choose them all to be 4 or something like that then I just mash all these tensors back together I just contract them all back together to get this big tensor above and then I don't tell you how I made the numbers you know I just give you that those numbers then I then I run this process and I go and do these SVDs from left to right it actually will give me each time four four four four it'll rediscover those ranks again thanks yeah good question yeah any other questions about that I'll say a few more things about this but just that sort of a just so you understand what's what do I mean by matrix product state so the advantage the hope here is that whatever data lives up here or whatever tensor you're trying to represent whether it's some wave function that could be the ground state of some quantum system or maybe the state of a quantum computer somewhere as you're evolving a circuit that you can try to capture it in this form with these ranks or bond dimensions these numbers chi they tell you how big are these indices with those being kind of on the small side so you want them to either be like two or four or ten or maybe a hundred maybe a thousand or ten thousand is okay but after that it would get too expensive so that's what you're hoping for so you want to try to balance accuracy with these small ranks or bond dimensions one neat fact also about the theory of these is that if these chi's are big enough you can represent any tensor whatsoever but they have to be exponentially big for that to be true so if you take the ones in the middle the ones at the left kind of grow by powers of two or powers of your external indices but if you take by the time you get to the middle if you take them to be of the size about two to the n over two you can represent any tensor whatsoever but again you hope they don't actually have to be that big so let's say you're in some application though where you get away with these being small and you have a good approximation then what's nice is that most of the algorithms that you wind up doing using these tensor networks require only kind of polynomial cost in chi to do computation or in terms of memory storage so most algorithms require chi cube computation chi squared memory right so if chi is like ten or a hundred these are not such big numbers and you can do it so what are the kind of algorithms you can do some I'll just flash very briefly without going into many details for lack of time and then I'll walk you through DMRG a little bit more because that's the kind of seminal one that we use in physics a lot by now there's quite a few algorithms and I'll really dive into two tomorrow in quite some depth but just to give you a flavor of how big the toolbox is things you can do you can sum two MPS or more than two in always staying in the compressed form so if you have two MPS and they have small ranks or small bond dimensions you can sum them run an efficient algorithm that recompresses the sum on the fly and then dynamically figures out the ranks of the new ones and the rank might grow like if those two MPS are very different the rank afterward might just be the sum of the original two but if it's say secretly they're the same the same tensor adding it to itself the rank will be exactly the same afterward and the algorithm can figure that out on its own you can also multiply tensor networks by other tensor networks so if you have an MPS you can multiply it by another kind of tensor network called an MPO or a tensor train matrix so if you think of MPS as kind of like a high dimensional vector MPO is like a high dimensional matrix so you can multiply these together and again there's a controlled algorithm actually quite a few different ones that runs across and compresses this network down and processes it into another MPS and again the bond dimensions might shrink they might grow they might stay the same and the algorithm can figure that out for you all with kind of polynomial cost and in fact the cost is just linear in the length always in these algorithms and then one really neat thing you can do that I don't have slides on but just something good to know especially in the context of machine learning is that when you have a tensor network without loops so in this case it's just a line you can do perfect sampling so they're auto regressive the idea is that you can just draw a sample from treating the entries of the tensor as probabilities or the squares as probabilities there's different versions of it you can draw a sample and then when you want to draw the next sample you don't have like a Markov chain or anything that's running you just start the algorithm over and you get a new sample so just by how the algorithm works it's provably perfect sampling in the sense that there's no like auto correlation time or anything like that you just draw a sample then you draw another one fresh and another one so each sample is really drawn from the true distribution defined by the tensor network so these are just nice algorithms to have and there's other kinds too besides matrix product state there's more general trees so these are just any kind of loopless tensor network they go by different names tree tensor networks hierarchical Tucker networks and there's also tensor networks with loops the most famous being the 2D what's called peps tensor network I also put tensor grid which might be a more generic name sometimes called tensor product states and these are quite popular in physics for studying like things like 2D frustrated magnets and things like that but they're much more challenging to work with and optimize once they have loops the algorithms get a lot tougher to do and to optimize so people often talk about in kind of in a lot of parts of machine learning and even in physics and even in tensor networks I've noticed people often like to emphasize representational power so they like to really focus on statements like this about you can represent anything or they talk about aerial law of entanglement which I don't have any slides on but some of you may have heard how many people have heard of aerial law of entanglement this idea, right? Okay, so this is very commonly associated with tensor networks it says that if you have if your entanglement or your system of based in aerial law and the tensor network is your wave function that these kais can be bounded and remain small so those are all about representational power but I think personally the real power of tensor networks is actually in the algorithms that's really what I think is so powerful about it the algorithms are based on linear algebra and linear algebra is precise, fast, stable you know it's very nice set of algorithms to build all like more powerful algorithm on top of and the most seminal of these in terms of really what kicked off the field is the density matrix for normalization groups so I'll just give you a quick kind of lightning introduction to how that works so what it really is doing is it's solving a minimization problem equivalent to an eigenvector finding a dominant eigenvector of a huge matrix but in physics language it finds the energy, ground state energy and the ground state of a Hamiltonian so let's assume we can write that Hamiltonian itself as a tensor network and for common Hamiltonians like local 1D Hamiltonians, even 2D there's quite a few nice ways to do that I won't go through how you do that but there's just some closed form expressions you can write down for these tensors you can also use numerical methods to find it so let's just say the Hamiltonian's already in that form which will be convenient for me then DMRG runs and what it does is it finds its dominant eigenvector or ground state as matrix product state tensor network so first of all you form the expression for the energy and here I'm just taking the state to be always normalized for convenience it's just this sandwich of psi H psi so that's a number, there's no indices sticking out you can see how helpful the diagram notation is too I wouldn't want to have to write this all out with letters and everything and what DMRG does is it uses an alternating strategy to optimize so in math they call this alternating least squares kind of strategy meaning you just optimize you freeze some of the parameters the one in blue are frozen and the ones in red they're unfrozen, they're thawed and so we optimize them and then we re-freeze them and then we thaw some other ones so we'll thaw these, then next we thaw these then we go back and forth and that's called sweeping in the physics jargon but in math they would just say it's an alternating strategy okay and then what do you do at each step? what you do at each step is you factorize off or pull off the tensors that are unfrozen that you're trying to optimize and then you actually view this as a miniature exact diagonalization problem that you have to solve so here comes the linear algebra in a sense right so what you do is you say you take all these other tensors and there's an efficient even though this looks like a lot of computation all this rest it's actually there's an efficient pattern you can do to contract all of them down and it's not too bad you actually leave them in three pieces I'm not gonna go into that level of detail but let's say for convenience I just put them all together into one tensor what you do next is you throw this into a Krilov eigen solver that does matrix multiplication and does some smart kind of linear algebra reasoning and figures out a better approximation to put back for this tensor so you don't do gradient descent you do something much much faster that very quickly converges and there's a little detail to the experts you don't really converge this part you just run a few link shows or Krilov steps just to push yourself closer to the ground state then you get a better tensor and then you put it back and then you move over to the next place and there's some detail too about doing two tensors and factorizing to make the bond dimension grow and shrink that I'm not gonna get into right now this is just kind of a lightning explanation of DMRG this is called one site DMRG so let's just see how it works so this is a movie I made a DMRG running on 100 sites spin half Heisenberg model and the speed of the movie is about real time on the computer so it's pretty close to how fast it actually runs and what these heights are that I'm plotting are the local energies these are just in the current wave function what's the expected value of s.s on a bond and the showing during the movie so here it goes, right? That sweep one and you can see already these energies are coming down very nicely okay, great there you go no sorry I think I mis-wrote what it is these are the magnetization sorry so these are the expected value of s that's why they go to zero the energy is at the top it's a little mistake in my slides there but I'll run that again and it's fun to watch the energy in the corner it shows the total energy and you can see it coming down very fast so watch as once it gets the first few digits it really starts putting on the correct energy digit by digit by digit so these are very very fast and precise algorithms when they work in the best case you can kind of see already many digits are frozen and it's just kind of working on the last few okay so that's actually about a real-time run of DMRG for that system okay so it's an extremely powerful algorithm here's a talk I just saw two weeks ago and I clipped the paper about this talk where Ian McCulloch and Osborne are showing new methods for improving DMRG kind of improving it even more to make it more stable and more reliable and you can just see how precise this method is so the error from the energy you get at each step in time of the running time of the algorithm compared to the exact energy is just coming down on this log scale to very high precision so in the end you get 10 to the minus nine error in like a hundred seconds so a very efficient algorithm so then the motivation at least for me and I think some other people was this is working so well in physics and there's all these other areas of application we can envision and after all tensor networks are basically just function approximators what's so special about wave functions couldn't we use this for other things too maybe they only work for wave functions right but that would seem like a kind of strange state of the world especially when you see neural networks coming over and they can represent wave functions too so then if tensor networks can represent wave functions couldn't they represent functions in other areas so could we harness the same power for something like machine learning and I would say that the answer is yes but then the question is how well does it work what's gonna happen in this field and I'll try to unpack that for you today and tomorrow and I would say there's three challenges in doing this not that these are insurmountable challenges but what I mean is these are like the three main things you have to think about when you're trying to do some application like this like applying tensor networks to machine learning and they are representing the data coming up with algorithms for training and then selecting good problems like problems where this tool is the most advantageous to use because it could be that 20 years from now this finds a lot of applications but they might be really different applications from say neural networks so it may still be that neural networks is what you want to use for vision and then tensor networks will be better for other things and we just don't know right now I'd say there still hasn't been enough research on this to really know and we're just still kind of figuring it out and some of these algorithms are still just being discovered even right now so it's a pretty active area but I would say it could use more people like maybe some of you getting into it and helping to push it forward okay today I'm mostly gonna focus on representation of data which is actually more interesting than it sounds and then tomorrow I'll be talking more about training algorithms and good problem selection okay so representation of data so again more interesting I think than it sounds so let's say we're given a piece of data like a sample from some data set and here I'm thinking of images just because it's more visual and let's say that each piece of data we get from that data set has N numbers in it like N pieces of information right so what I mean by that is let's say we reach into the MNIST handwriting data set and we pull out one of the images and it's an image of the number one and then we count how many pixels there are and there's N pixels so that's like N numbers that we would get when we read that image in right these are just these N grayscale values okay so a very common thing you do in machine learning is you just vectorize that you just unroll those pixels it's a big vector and this might be like the thing you put in at the bottom of a neural network or the beginning of a neural network and you see all those input neurons it's just these numbers from this vector X right so we could just view our data abstractly as this vector of length N okay and this is a good part to stop me if you have any questions if I go too fast through any part so I'm gonna talk about two basic representations this is just kind of my own take on these two, this distinction in the end we'll see there's kind of a continuum in between them of how you can mix these two ideas but one representation I'm calling high dimensional representation and the other will be low dimensional so high dimensional the idea is that if you have N your vectors of length N then you're gonna represent this by some kind of tensor with the same number of indices as that as N indices so I'm calling that high dimensional because this is a tensor that lives in like a formally in like a two to the N dimensional space right if these are indices of size two if they're indices of size you know 255 it's a 255 to the N dimensional space but you're just putting the data in a very tiny corner of that space all right and I'll give you a concrete example of what these vectors could be but what I'm saying is you take each pixel and you map it to a vector then you just formally put these vectors kind of next to each other like in an outer product and if you actually took that outer product you would get a vector of length like two to the 784 or something something enormous but you don't really take the outer product you just imagine taking it right so you just put it in this little corner product state corner of high dimensions so I'm calling that high dimensional or state encoding or product encoding I'll give you a really concrete formula in a minute the other one is low dimensional representation this one's been known for a long time and rediscovered in many fields and in many different cases so this goes by other names like amplitude encoding or quantic tensor train it's a very common idea in quantum computing so if you know about the HHL algorithms or PDE solving with quantum computers Grover's algorithm you know all these just actually the number system we use every day that's this encoding right the idea of having a base in decimals and I'll go in some more detail about that in a minute too but just to say what I mean by low dimensional it may look really similar to this one right there's a bunch of indices what's different what I mean is that here you only use log in indices so if you have in let's say there's 64 pixels you only use eight indices to represent that the data so it's very different very low dimensional even though it's a tensor and what's interesting too is that generally the tensor is entangled like the bond dimension is not one here it's one so that seems kind of paradoxical but we'll unpack what I mean okay so let's go into a little bit more detail about high dimensional representation right so yes well that's a good question really it's not that's not a hard and fast rule that this one is a product state and this one's an MPS but it's mostly what I really the distinction I'm really making is how many indices there are and how each setting of the index corresponds to data and I'll go through that in some more detail basically here I think the examples will help but I think it's just easiest for this one to choose forms where it's not an MPS you could make it an MPS and that's more like projecting the data so kind of excluding certain patterns we could talk it's a little bit hard to cover all of this but we could talk more about the advantage of that here it's more necessarily kind of entangled because the indices work cooperatively I have a slide explaining what I mean by that but the indices cooperatively kind of walk you through the data in this form so we can be really concrete about that so high dimensional representation this one is motivated more by second quantization for those of you thinking like quantum physicist and this is one that was proposed in 2016 by a paper I was on and then this other paper from the math community and you can kind of think of it as a fanciful way in the case of images is that you take each pixel of an image and you kind of promote it to a spin or a qubit that's a physics language in math language you just map it to a small vector and then you take all these vectors and outer product of each other and this is just a formal outer product meaning you just write the outer product and you stop you don't actually try to multiply all these vectors together because then you would end up with this exponentially big vector that you wouldn't want to store and there's different ways to do this so one is to map the x values say they go from zero to one into this vector cosine and sine with pi over two so it just rotates from pointing up to pointing sideways as x goes from zero to one but another form that actually I'm gonna use this one because it's a little easier to explain some things with is just to use one and x that's a different form you can use and we'll see in a minute why that's an interesting choice and you can pick many other choices and these choices the name I like to use for this is local feature map so it's saying this is borrowing the term feature map from kernel learning it's saying how do I take a number and kind of lift it feature map it into a set of features here the features are cosine and sine here the features are one and x you could pick other features you could pick x squared and e to the x you could pick any features you want it's a bit like kernel learning where the kernel is a bit made up and you can just try different things or like neural networks where you just kind of make up an initial weight layer you could have this width or that width it's just something that you come up with and think through the implications of okay now how do you do this local feature map you take this formal product of all these vectors how do you turn this into an actual model you could use in machine learning so in the end we're after a function representation with weights that we can train you do this by just contracting it with a weight tensor so you say here's this outer product of vectors I just need a weight tensor that has the same number of indices so that when I contract all this together the whole result is a number there's no indices left over they're all contracted and then that's clearly just a function of x one through x n because here's all the x's going in down here I just picked six to be concrete but it could be any n you want but that would have the problem the cursor dimensionality too many indices if I'm doing mness it has 784 or something pixels I can't do two to the 784 weights way too many weights so then the idea is to say I'll jump ahead and kind of jump back is to use a tensor network to make this more efficient so what if we just guess or hope that the weights can be replaced by a tensor network now we have a chance of doing this of actually training this thing and then the idea is just as having a larger bond dimension for this MPS you could represent any tensor for large enough bond dimension here if we have a higher bond dimension to rank we'd have more parameters more representational power in this model okay but what kind of model is this right at least for this example of this local feature map and this kind of data encoding this kind of model is actually just a very high order polynomial so this gets into like why what motivates this choice in the first place right so if you just think about ticking the industries let's say each index just runs has to run over two values here right because each vector has two entries if you just think of ticking the indices through all their values but in kind of an organized way maybe kind of motivated by thinking of like perturbation theory right you have like a vacuum and then you have one particle popping out of the vacuum or two particles you could have all the indices set to one and then that just picks one one one one one one so you just get a scalar so that's like a constant shift or you could have one of the indices to two and all the rest are one so let's say the first one is two setting two that picks out this X one down here and then these are all one so just pick the number one one one one one right and so you get this weight W two one one one one times the parameter like times the input X one or if the two is in the second place it's times X two or the third place is times X three okay but we could also have the entries in the weight tensor where two of the indices are to the number two and then we get X one times X two X one times X three in Kernel language they would say these are like product features they would even say product interactions I wouldn't really call these interactions in physics that means something else and then as so on you can have more three twos four twos finally when you have all the twos all the indices are set to two you have this huge product of all the X's so this is actually a very nonlinear function very high dimensional nonlinear function okay any questions about that okay great now once you replace the weights with the tensor network it could be an MPS it could be something else it's a little less clear what the function actually is right is it still really this polynomial not really it kind of is but now the W's are related to each other in some way there's some kind of extra structure that you're assuming or putting in so maybe the structure is that the data is one dimensional right so maybe these X's are amplitudes of sound or something having to do with the words right so then this one dimensional onsets makes more sense if it's an image maybe it doesn't make as much sense but you could still just brute force it and it might still work right so how do you actually use this to do machine learning you know if you were just doing a kind of machine learning where a scalar output is what you want then this is all you need if you want a multi-label or vector output you can one way to do that is to put an extra index another way would just be to use multiple MPS one for each label so you can try different things it's kind of a very flexible framework so let's just say you put an extra index that runs over as many values as many labels you have so let's say you have 10 labels this extra index could be dimension 10 it's a bit arbitrary where you put it you can put it on the left somewhere in the middle you know if you made a tree it'd be more natural you put it at the top of the tree right I'm just being kind of general here and then one way to train this model and tomorrow I'll show you two much more detailed and interesting ways but a more kind of basic way is just to train this model by having some kind of cost function that determines your learning tasks so here I'm thinking of supervised learning which is equivalent to regression where you have data which are these inputs xj and then you also have target outputs y that are like what you wish the function would output for those inputs right so you have the x's and the y's that's your data set and then you say I wish f would output yj for every xj that I put in so I'll penalize it when it doesn't and then I'll try to minimize this cost so then you're just using the chain rule you know take the derivative that cost back through until you hit the weights where you actually have parameters you can optimize you work out the gradient you know use the usual gradient descent algorithm if some of you don't know it we can discuss but I think you've been hearing about it and you just alternate back and forth and use gradient descent and you just go back and forth through and update these one at a time or maybe two at a time and do that until you've minimized your cost function right that's the most basic way you can try to do this question yeah yes the bonds sticking at the bottom are always two for this kind of feature map if you used more features it could be more but yeah in this case it's two it's a good question and the one at the top could be 10 or 100 it's as many labels as you're trying to as many different kind of discrete y's that you're trying to output if the y's are a number then there might be a slight mismatch here between the math at the bottom at the top really this f should now have an index on it and the y should be a little vector now I should have fixed that here I'm having f output a number and the y's are numbers but yeah you could see how to generalize that that's right or in this case on that site yeah so I'm saying each of these circles I unfreeze I kind of flip the colors in this slide but you know I unfreeze the one that's blue the rest in red or frozen and I just optimize the ones in that I'm pointing at with that arrow and then I go back and forth and you don't have to do it this way but this would just be a nice efficient pretty efficient way to do it good question thank you good question I think so so this is where I think more research is needed because right now this idea of using tensor networks is pretty ad hoc I would say in terms of theory for machine learning but actually I think that's a real opportunity too is that there's been so much success in theory for tensor networks in physics I don't see why we couldn't bring that success over it might have to be rethought and redone but I think the potential for theory to be done is very high with sensor networks and I think I'll make good on that a bit tomorrow and show you some of that but some things we could feel pretty confident about are that let's say in this 1D kind of architecture that the larger you make this internal bond dimension the bonds going through here then the longer distance correlations you can capture that's for sure because you can show that correlations generically decay in these 1D networks and they eventually cross over to exponential decays but over short distance scales they can reproduce power laws very well and the kind of how far they can mimic a power law before they cross over to exponential is related to the bond dimension so if it's bigger you can mimic a power law for very long distance and in fact you can even do arbitrary you know all type of correlations over short distances as well hopefully that kind of goes some way to answer your question but it's not I would say it's mostly not something that we just have a very clear theory of right now in fact it's something that we need more theory about okay so that was just kind of giving the most basic strategy for optimizing one of these tensor network machine learning models and like I mentioned tomorrow I'll show you two much more sophisticated strategies for doing this that don't involve gradients so then this was something that I was really interested in 2016 and proposed this model and we just wanted to try it on something that you know was popular so we tried this MNIST handwriting data set the supervised learning where you're given these images they're labeled you know zero to nine and you try to predict those labels by just you know taking these images and feeding them into the model so we tried the simplest thing which was basically this model though with that cosine sign feature map and we did the gradient descent training and trained to 99.95% accuracy on the training set which is 60,000 images and then we got a little bit above 99% accuracy on the testing set which is only 97 of them incorrect so we were happy with that that was just to show that like the idea can work we didn't use all the best practices of machine learning we kind of snooped on our data if you know what that means like we were like watching it checking on the test set over and over again during training doing all kind of things you're not supposed to do but the point of it was just to show that there does exist some choice of parameters that makes this true that was the only point of that experiment at the time just to go in a little more detail we tried different bond dimensions we tried bond dimension 10 and we got a 5% tested error raised it to 20 going down to 2% and we had to go to 120 to get below 1% and the thing about it is 120 is not that big by physics standards like people routinely do DMRG with bond dimensions of like 10,000 these days and I think there's even a paper doing 60,000 using TPUs at Google so 120 is not that big and yet this calculation took quite a bit of time because 60,000 is a big prefactor when you're churning over all that data so this is why I was saying we need better algorithms and I'll show you tomorrow what I mean okay so that was just kind of a quick walk through the high dimensional story now let me switch to the low dimensional story which in some ways is even more interesting some of this is about some things I've just been learning about recently and hopefully it'll be interesting some of you too so the low dimensional representation in the high dimensional representation each index was like corresponding to one pixel of the image or if it was a sound signal it would be like one moment in time or something and the low dimensional representation the indices are different they work collectively to access each feature if you wanna go give me feature number seven all the indices work together to go get that feature out of the tensor for you so it's kind of like first quantization versus second quantization where in second quantization the tensors are just points in space and first quantization you're storing like positions of particles like something very different let me be concrete about how this works so let's say we have again our data vector could be the same data vector and here I'm zero indexing it so the first entry is x zero and then going up to x in minus one what we do is we pack that into a tensor in the following way we just use binary integers to do the encoding so we say if all the indices are zero that number inside that tensor it should be x zero okay now if we make the last one one that's x one and then if we make one zero at the end that's x two, x three right so I'm just doing binary counting these are just the binary digits going across the top alright so what I'm doing is like before we turned a list of numbers into a tensor that lives in a two to the n dimensional space you know when we had n entries here we're just doing something like we have n entries and we're going to use log n bits okay that's something kind of different okay and so on and then once we set all the indices to one that's the last entry of the vector now a really neat thing you can do with this low dimensional form is you can use it to represent a continuum so same trick but you just kind of re-imagine it a little bit differently you say we could represent data represented in a continuum so for example it could be a function f evaluated on a very fine grid of spacing one over two to the n so if n is how many of these indices we have like little n let's say we have you know like eight indices we could have a grid of size two to the eight and the grid is just putting in these binary integers but now they're after the decimal point right so if they're after the decimal point we could have you know zero point zero zero zero zero is the number zero and then all the way up to zero point one one one one one is like approximating the number point nine nine in decimal fractions okay I'll go through that in some more detail but the idea is we can we can take this continuum we can turn it into a grid think of that grid as having a length n but then we only need log n bits to put this thing into a tensor right so all we do is we just define the tensor so that there's n bits and these n bits are the n digits that we have and as you all know you can represent with a handful of digits like you know 20 digits you can represent a very fine grid you know you can work to very high precision in a continuum okay so we can just see how to do this so for example if we set all the bits to zero that's f of zero we put that number in the tensor if we set the first bit to one this is in base two so we jump half way through the continuum so that's f of a half right but in base two now if we set the next bit to one we jump to three quarters that's f of three quarters right but in base two and then you also have very fine precision here so if you start toggling the last bit back and forth you're moving by just one grid point from you know three quarters to three quarters plus one over two to the n whatever n is whatever precision you want to work at okay any questions about that? so this is actually a way of like inputting a continuous variable into a tensor so this tensor like tensor is you know before I was showing how they eat a bunch of discrete variables here all these indices are working collectively to swallow one continuous variable so this is like f of x as a tensor so this may go toward the question you were asking earlier too what's neat about this too is this is actually a hierarchical representation representation of data it's like a multi-grid method okay but this is also I just want to emphasize even though I'm saying fancy things about it this is also just the regular number system we use every day just in base two but it would be the same in base 10 right so it's even though it's something very old it's something kind of profound that when we count we're actually counting in a hierarchical representation right like a multi-scale representation so the idea is that we can locate numbers we can say how do we represent a half minus one over two to the n so this is the number just before a half we represent it by writing zero one one one you know with a bunch of ones so this is like walking down a tree we go we say take the left half then take the right half of that and then the right half of that and then the right half of that so we can zoom in exponentially finally toward a point right or we could say what about this point I don't really know how to describe it maybe this is I guess this is just the number one quarter right so then we say one quarter is just go in the left half then go to the right half of that but then don't go any further okay so you walk down this tree and that's how you locate that point and so on so you can just do that's three quarters right okay so that's what I mean about this kind of low dimensional representation you can do it either as a way of just unpacking a vector or as actually rolling up a whole continuum vector of points so you can view it in either form so as an example application of this low dimensional idea there's this paper just from 2022 that I really like where they actually did a mixed representation this is a really neat idea they said let's take again, gymnast although they focused they focused on fashion gymnast which is a much more challenging data set which is these black and white images of clothing like shoes, handbags, shirts things like that it's very challenging actually like the highest accuracies you see reported are things like maybe only 90 to 92% on the test set even by very state-of-the-art deep neural networks and things this is very challenging they said let's do this high dimensional encoding between the patches so there you put those into like a product with each other but then within the patch we'll just do this kind of zigzag or snake and do this low dimensional encoding so you see there's four things that are like in a product with each other but then within each patch there's a small MPS that's these different color short MPS is in the bottom so they're kind of taking the data and each of these is just that vector unrolled in that little bits kind of hierarchical form you know and then they just product these together more like that first thing I showed you so the idea here is to kind of balance the benefits of each one when you do the low dimensional encoding you're basically just working with the original vector you had just representing it in kind of a neat way the high dimensional encoding you're really going up in dimension so you have a much more powerful classifier so this is some way of kind of getting some of the power but then not having something that's so high dimensional that it's like hard to train and too expensive and that kind of thing so it's a really good idea and when they did this idea they actually only needed to take the weight MPS which is this light purple one on top to have a bond dimension of 10 and they got very close to state-of-the-art results they got something like 90% on the test set which is really good for fashion in this so I thought this is just a really great idea and they also experimented with the bond dimension of this low dimensional data encoding on the bottom okay great yeah except for what, sorry we don't, no yeah, oh certain like chi max, we don't so I think it's more of, it's a good question I think it's more, right now it's more of an opportunity so these kind of studies are more just showing that it can work, they're very empirical and so they're showing that it can work so then the question is it already works so then why does it work, you know we don't, that's the kind of knowledge we have right now so it already works and then the question is what makes it work, you know versus another way of asking the question would be can we show that it will work ahead of time, you know and we don't have that, that work has basically not been done as far as I'm aware for sort of general things like images now there has been some very deep work done for functions and I'll show you some examples of that tomorrow but if you remember back to the sum of Gaussians that I showed there's some extremely nice work showing how tensor networks are able to capture functions like that like for example you can rigorously prove that essentially any smooth function when you're represented in this low dimensional form I mentioned has a low bond dimension and you can actually bound how low in certain senses so for example if a function can be fit by a polynomial of order P the bond dimension is exactly P plus one at most so you can prove things like that for functions so that's also a little bit of what I meant about right now we have some areas like images where we have very little theory we have other areas where we have a lot more and also that tensor networks may have more of a niche role to play so they might be in the end quite good for machine learning functions like math functions and maybe they'll never be very good at images compared to neural nets but maybe not, I think it's just too early to say but it's a good question to ask, thanks. Yes, you can, I just don't have that available to me right now but I would say basically though roughly from my experience or what I think I know to be the right comparison just off the top of my head is that that first study I showed the one of just just MNIST with the gradient descent it's a lot of parameters it's a lot more than an equivalent neural net and this is because in a sense very roughly that in a way a tensor network has to use the parameters for two things it has to use the parameters both to sort of like be the weights and to kind of do the non-linearities so to speak so in a neural network you've written down the non-linearities as a piece of code that runs on the computer and they're explicit and you write them down in a tensor network the thing that's kind of analogous to the non-linearities is also modeled by the parameters in the tensors also so some of those parameters are going into that I can't prove this but this is just my intuition that some of the parameters are going into kind of that part of the work and some are going into saying this is big this is small weight this more weight this less so it's not as parameter efficient but that's why I think the real way to compare them in the end will be not by counting parameters because that's a very gradient descent centric view of things but to actually compare them in terms of the best algorithm you can think of for each one and then how fast does it run and how reliable and reproducible of the results and how interpretable and accurate and things like that so in the end I think that the algorithms is that's why I think that's the key point is you know if we can come up with much better algorithms that maybe don't touch the parameters as many times for example that's the way forward I think yeah did that answer your question one other partial answer to your question too is that you see just by a small tweak of the idea here that these authors did they needed a much smaller bond dimension so the thing I did we needed 120 on MNIST and they only need 10 on fashion MNIST so just by changing the architecture a little bit so you know there's still a lot of exploration this is not that old of a paper there's still a lot of exploration going on about how to best use these ideas it's still pretty new you could and that could be a neat thing to do although I should have emphasized more strongly that I'm gonna have another slide saying this a different way but that the non-linearities already here so these are already there's already like an upfront non-linearity in the idea so you see this is already non-linear right just by mapping the X into that tensor in the first place it's non-linear but it's more in the spirit of kernel learning where the you know in kernel learning I'll have a slide about this later too but in kernel learning the idea is that the non-linearity happens once upfront on the data and then that way all the weights enter linearly which has a lot of theoretical advantages but it has a big drawback too which is like a lot of weights you know so it's just there's always been this kind of back and forth in the field between those two and right now neural networks are kind of ascendant but I don't know if that has to be the case forever I mean we'll see where it goes you know and in the end it doesn't have to really be one versus the other we might have hybrid architectures and there's some papers about that maybe we use neural nets as initial layers and then we switch tensor networks and I'll even mention you can use you can flip that around too I'll show you a slide on that in a second so let's end with a few slides to kind of round out the talk on just a couple of kind of more theoretical remarks about using tensor networks in machine learning and you've already asked a lot of the right questions some of you about you know what can we prove what's known what's not known how do they compare one thing that's nice I think is that it's actually very straightforward to show a kind of universal approximation theorem for tensor networks even for continuous inputs right and here's the here's the theorem in kind of two slides right and it doesn't require a lot of heavy math it's basically almost just like an observation so the argument is this is that we say remember how I said you can represent any discrete function as a tensor network meaning a function taking discrete variables we can also do any continuous variables as well by just blocking the bits together we could say for this input x1 we'll just use a set of bits in bits for you know those those collectively represent x1 you know here's x2 here's x3 if it's discrete we just put a one index which would be a discrete if it's continuous we just put in a little group of n and we can make that as precise as you want exponentially precise so if I use like 20 indices I can represent that continuum to one over two to the 20 precision so extremely precise right and then the tensor has arbitrary entries here I can put any entries I want so I can store any function on this exponentially fine continuum grid that I want okay so that's that's already kind of universal approximation theorem for tensors for functions of continuous variables and then I can say to make to try to make this efficient I can now replace that tensor by a network like a matrix product state or some other kind and then for a large enough bond dimension it's been proved it's easy to prove that MPS can represent any tensor so there you go that's there's the universal approximation theorem right so the point is is that you know tensor networks have one too that was the whole proof basically and there's no again there's no explicit nonlinearities in the tensor network all the nonlinearities are kind of in the representations of how do you put the data in in the first place in some sense okay and then I also had said just said this in words but one slide to say that this idea of tensor network machine learning is really a form of kernel learning so if you're familiar with kernel learning or if you run across it later this is just one slide to connect that for those of you who know that concept the concept is basically from my point of view just this formula it's saying that you take a linear model would be where you don't have this phi so imagine that phi is covered up linear model is like w dot x right kernel learning is just saying what if we like super charge that model by saying we don't want x x is too low dimensional we want to make x bigger so we put it into a feature map phi which makes it bigger so the feature map could be this kind of idea there's many ideas you could put it into some Gaussian function or something and then you say now that we've made x into phi of x which is this bigger higher dimensional feature mapped version of x I don't know if you've ever seen like David Byrne wearing the big suit right from like talking heads it's kind of like you just put x into like a bigger suit now you have more weights you can train because phi is much bigger than x and then the weights just enter linearly it's still like a linear classifier for that part of the model right and you can see that tensor network machine learning is the same idea we just feed x into this feature map it's high dimensional although somehow efficient product you know it's rank one it's a product something we can work with and then the weights are also high dimensional but then the other idea is that we compress the weights so if you know about kernel learning maybe some of you don't and that's okay but if you read their literature they have this idea called the kernel trick where they say the weights are too high dimensional we could never see them or even imagine them you don't even want to look at the weights your eyes would melt out of your head weights are bad you know they kind of have that view right and they're right if you really try to work with all the weights so they do this thing called the kernel trick where they pass to this thing called the dual representation they train these dual weights that are smaller that's more connected to the size of your training set but then you have bad scaling with the size of your training set so the idea of this program is to kind of flip that around and say hold on maybe we can look at the weights if we look at them as a tensor network and then we can have good scaling for the training set all these algorithms I'm going to mention are linear scaling with the training set which is the best you can do and then you know maybe we'll be back in the game with kernel learning but then we have to think harder now about okay so we've kind of immediately solved those problems up front but now we have to optimize a tensor network which has a lot of parameters so we have to think of better algorithms so that's kind of my like one slide thesis on that connection and to wrap up there's many other papers that have been now written about this basic idea about models, mixing machine learning and tensor networks just two to show you how diverse this topic is already becoming or three papers one that I find really nice this actually precedes the work I showed by quite a bit and came out of the math community and I still see it being used every month I'll see a paper about this in different forms is using tensor trains, tensor train matrices, tensor networks inside of neural networks so this is a really neat idea so there's even some papers I think from last year about using them inside of large language models and the idea is to reduce the number of parameters so you say what if instead of an arbitrary weight layer this weight layer was an MPO or a tensor train matrix or that kind of thing some kind of tensor network then we don't have to store all the weights we store them in a compressed form and then when you do the gradient descent you do the gradient descent through these tensor network weights but still with the non-linearities and all the other parts of a neural network and this neural network could be also a transformer or any other kind of thing then can we get some kind of advantage in terms of training time and cost and maybe not lose any accuracy and indeed they find they don't lose very much accuracy when they do that. Another paper that's really nice just from last year by Thomas Barthol and collaborators is more in the spirit of the stuff I did show you where it's more this kernel learning with tensor network weights but the idea is that they further factorize each tensor into some kind of sum of vectors so they say everywhere we have this tree so they use a tree instead of an MPS and in every node of the tree they actually say it's a sum of an outer product of three vectors and that's what they do and then they train through that and they find that the training is much faster they get really state of the art accuracy on images and things so there's some really neat work out there I would say so I'm kind of running out of time for today so let me just wrap up. I was just kind of outlining today a very broad framework for machine learning using tensor networks so the idea is we could bring over ideas and tools from physics and also now applied math into machine learning so ideas like you give me a Hamiltonian I can give you its dominant eigenvector or ground state maybe now you could say I give you data and I can give you some weights that are compressed and do machine learning that way with hopefully some of the same advantages over time adaptivity, speed, precision, linear algebra ideas I'll show you again more tomorrow about all that and then the question is, does it really deliver? I'm sure a lot of you are very practical like you may be thinking should I use this or should I just ignore this, right? I would say as of right now, 2024 it really depends for images in computer vision there's some promising results I showed you these results but it's not competitive in the simplest way of you can't download a package right now and just beat something on ImageNet it's not like that right now some what just because of software also we just don't know the best ways to use it people developing neural networks the usual story you hear is that there were neural networks and there were GPUs and boom it started working that's like the story people like to tell but the real story is that it was this like 12 people working on it since the 80s who got ignored and passed over for grants you couldn't get a talk into NeurIPS about neural networks in the 90s it was fringe that you can watch these videos even from as recently as 2009 I think where there's this whole course on machine learning and there's like one lecture on neural networks and the rest is all kernel learning and things like that right and then it just switched around in 2012 completely so you never know this is like right now a very small niche area maybe it'll say small maybe it'll grow we'll find out but so I was saying that as I was saying to some of the people asking questions tensor networks have relatively a lot of parameters compared to a equivalent neural network if you do gradient descent it's kind of slow and heavy it uses a lot of memory but there are opportunities to go much, much faster than gradients tensor networks after all are just linear algebra and high dimensions period that's what they are that's really what they are so if you like linear algebra and you think it's good you kind of have to like tensor networks to sort of my thesis and for low and medium dimensional functions it's already very powerful so people are doing some pretty incredible things right now in kind of medium dimensions so 3D functions, eight dimensional functions six dimensional functions with tensor networks and I'll highlight those tomorrow things like machine learning orbitals for quantum chemistry fluid flows of PDEs things you know optimizing landscapes controlling robot arms so I'll flash a few of those results for you tomorrow and I'm gonna focus on two novel algorithms and it's gonna be a bit more technical tomorrow so kind of bring your thinking caps and I'll show you how these algorithms work in some more details and we'll think about what promise those could hold so thanks a lot for your questions and for your time, thanks. It's a lot for this very nice talk I think you still have 10 minutes for our questions so we can go on with the questions, please. So let's use the microphone. Yeah, thank you for this very nice lectures it was very interesting I have a question regarding LLMs because they are all the craze right now so why not ask a question about it so for LLMs you have to work really hard to bring words, syllables, whatever you want into a language like a form that the LLMs can understand like you have to do the tokenization what not like aren't tensor networks aren't they already in kind of form of a transformer decoder right? You already have like a sort of a very discretized inputs get an out probability is an output like wouldn't that be like a kind of a natural application? I think so, I mean I think a lot of questions that my people might ask my answer might have to just be we don't know like hasn't been tried enough like it's really I think a nice idea so there's this really nice paper by Dario Palletti I think from like 2017 or so where he had the idea of something like and I'm still really curious like a lot of these papers are just out there and as far as I know they've only been followed up on like a little bit because it's just not enough people trying these things out at this time. That and I think also the people who have tried it have mostly tried with gradient methods and as I mentioned I think that's probably ultimately not the way to use gradient. I think we have to use some other algorithms and I'll mention those too tomorrow that I'm gonna go into but I think if I'm remembering correctly the idea was to have this kind of MPO type architecture and then you would input like a string of characters and then you would read on the other side something like this and then these would just be some kind of I'm not gonna read state or something and there would be some form like that and then there was actually some success of some idea like this where you would have this kind of input part of the tensor network and then you would switch to an output mode so I can give you this reference but it's by Palletti. Yes, some questions? So if you imagine for simplicity to have a one dimensional function like the one that we're showing right now on the slide and you do the thing you were explaining so you have a grid of points that are two to the N and you do the low dimensionality encoding. So in that case the Hamming distance between two strings doesn't have anything to do with the Euclidean distance on the line. Is this a problem and is there a way out and those in 3D? Yeah, it's a good question. I think it's not immediately a problem like for 1D functions this still works super well and you just have this idea you have this notion that when you study this it's encoding a lot more that it's a very multi-scale encoding so something I like to do in that form is to write the tensors going down to emphasize this and the idea is that this is the kind of courses scale this is the finest scale so you find things like that if you just leave off these you still get a very good approximation you're just gonna coarsening. So in this way of looking at it that problem you mentioned about the bits kind of changing very suddenly sometimes even though you move a small amount on the grid it doesn't really cause too much of a problem because these tensors at the bottom are almost like flat functions so even though the bits will change suddenly the value doesn't change very much so if you kind of see what I mean it's like all these will be set to one and this will be zero and all of a sudden these will all flip to zero and that'll switch to one but these are very smooth down here so everything just kind of stitches together very nicely but I think you're onto something about when you start using these to do two-dimensional functions so how does 2D work in this form? This becomes x and this becomes y so you have x bits and y bits it's not like a peps it's like it's just you have two indices on every tensor 3D is just you have a third index on every tensor now you have z bits then it can become more of a problem so if the function is smooth it still works great but if it has like sharp discontinuities and they form like a curve or like a circle or something then the bond dimension can explode and it might have something to do with this hamming distance thing or not I don't know if it's super well understood why it explodes in that case Okay thanks because the RISD is a gray code thing but it's only curing the nearest neighbors in 1D There's only the what, sorry the... Okay maybe it's a little bit technical but basically you can make sure that two consecutive points on a one-dimensional grid have hamming distance one Oh okay but it doesn't work if you take like... I'd really like to learn about this because I think in a way the people doing this right now the people trying this out are only using this kind of base two because it's sort of connecting to quantum information quantum computing ideas like you can, it turns out in this form the Fourier transform is just the quantum Fourier transform of these for example and it's actually also an MPO so like there's some neat connections however maybe we're leaving a lot on the table by choosing binary maybe we should be using a different thing so I'd like to hear more about that idea Questions? So yeah maybe I can do one question so you made this connection with the kernel method so in this high-dimensional representation so I was wondering if you can use this high-dimensional presentation to tell like what are the relevant futures of the system like what are the relevant product of the future of the system of the dataset That's a good question so that's exactly where I think tensor networks could bring a lot more interpretability however sometimes when I hear the word interpretability I mean I think there's a lot of competing ideas out there of what interpretability might mean or what would be the best thing we would want and I feel like there's still a lot of room for creativity about that so one idea I could think of off the top of my head would be I mentioned to you you can do this perfect sampling of a tensor network right so the weights here in this form are on top they're at MPS I could perfect sample them so if I wanted to know which weights are big what are the big ones I could just start collecting samples from W and just rank them you know like because I can also compute the weight when I get the sample so I could actually just make a plot to see like how quickly are they decaying how high order are the big ones this kind of thing like that would be that would be one thing that would be easy to do but I don't know if that's like the best way to analyze it so I think there's a lot of possibilities about I think the possibilities for analysis are huge but I don't think we right now know all the best ways one of the talks tomorrow I'll talk about this thing called tensor cross interpolation that one is really neat because it's actually a completely new gauge for MPS that people didn't appreciate before in the physics community it's not based on SVD it's based on this other type of matrix factorization and that one what it does is it tells you something like out of all the two to the n weights that are in the MPS that you only need like a very small number of them it's like chi-squared or so are the ones that tell you everything and the rest are interpolated and you keep a table of all those important ones so you actually have this running memory of like these are the important weights that I have to store and all the other ones are just some linear combination of those so you actually have really in your hands explicit like these W's, the explicit ones Just to follow up on the interpretability kind of explainability issue have tensor networks been used in a kind of Bayesian setting where distinguishing between aleatoric and epistemic uncertainty become more important? I don't think they have I think it's something really like worth looking into so I mean I don't think I can comment on I don't know the Bayesian setting very well in terms of like those concepts I mean I just know the basic idea about priors and some of this but the one little side comment probably not to your question but something that I would be really interested to try is that I know a big activity in some of those fields is to like estimate the posterior using sampling methods a lot this is a really big activity there's whole software packages like Stan and things for doing this one kind of theme that's been happening a lot in the kind of corner of Kinnitz Matter that I sit in is very kind of popular part of CCQ is the idea that maybe wherever we see Monte Carlo being used we could use tensor networks and this has been getting used a lot more so there was a whole chain of thought that just recently culminated where some researchers were starting up by sampling Feynman diagrams and high dimensional Monte Carlo then they said let's sample kind of groups of Feynman diagrams then let's not sample them let's use pseudo sampling methods where you sample on these kind of pseudo I forgot the name of it but you do this Monte Carlo with these deterministic points then they said why are we doing sampling we should, we're just doing sums after all we should use a tensor network so they use this algorithm I'll show you tomorrow to just machine learn all the Feynman diagrams up to order 30 and so now the Monte Carlo is gone you just do this machine learning you get all the Feynman diagrams up to order 30 this huge factorial number to very high precision it's kind of crazy and so then I'm very curious to know if this could be brought into that area where people doing all this posterior sampling too you know could we do the same tricks over there and then not have to do Monte Carlo I think it could happen in a big way possibly so all right I think now it's time to close so let's thanks again Miles