 to this special seminar. Today, it's a great pleasure for us to have Lorenzo Rosasco as our guest. Lorenzo is from the University of Genoa, and he and his colleagues actually lead a very lively center of machine learning, Malga, called Malga. And Lorenzo is also a wonderful teacher. So I guess you all will be able to enjoy this lecture. I'm pretty sure what I say, which is about learning function, operators, and dynamical systems. If you have questions, please shout questions, and I will hand you the microphone. Thanks for setting the expectation low. So I'm going to tell you a bit about these three things. This is the last bit of work that we did. It's also new for me. And somehow, the point of view is that we're going to reduce dynamical system to learning operators, linear operators, in fact. And a lot of things that we're going to use are basically the same that you use when you learn functions. So this is the purpose of basically starting from the basic tools and then developing throughout. As you see the subtitle, we're going to use kernels. Kernel methods are very related to their sibling Gaussian processes. There will be no neural network inside, which is an unusual machine learning feat of this talk. This is not my work. It's really a lot of people work. And I'm going to mention it throughout. So just in case you don't know what kernel methods are or you hate them for some reason, I tell you why I like them. And I think they're at least toward their own reason. There is an old one. They used to be the state-of-the-art algorithm. They're not anymore for everything. But they are in somewhat subclasses of problems where you have data, which is not super high-dimensional when you can basically what I call the high but not too high dimension, 5 to 10, 20-dimensional. Or we can engineer good features. I think this is typically setting where these methods are still very useful. And whenever you compare them to neural metrics, they usually work the same, but they're faster and they come with better guarantees. So they're still useful in not everything, but there is a subset of problems for which they're very useful. And the reason is that, fun enough, if you open a deep learning theory paper, half of the time is about kernels, because neural network is just hard to study. And so the closest thing to neural network are infinite-dimensional linear models. We turn out exactly what kernels are. So for this couple of reasons, I think this is still worth well-investing topic. One thing that, you know, working for a while on this, we felt like is that some kind of modernization effort is needed. The narrative needs to be changed a little bit. And also the tools need to be upgraded a little bit because the data changed between, say, 2002 and 2022. And so some help was needed to somewhat make this thing more modern, notably GPU caned. And all of a sudden, you know, if they're running joke with people that these like neural networks, it's much easier to run a transformer than to do an SVD on a million points. Despite the fact that SVD is a pretty basic operation. Because there are software libraries and it's very easy to download. When I mean run a transformer, it just means run the code without any help and not, you know, make it work well, okay? Anyway, so this will tell you where I'm coming from. And before actually dealing with the kind of problem I want to do, I want to give you a cartoon of machine learning just in case your background is not necessarily just everyday job doing machine learning. What is machine learning? Okay, this is one slide version of a cartoon. And it's basically you have input and outputs and you want to find a relationship, okay? And this is good enough because you can call this machine learning, you can call it physics, you can call it statistic or approximation theory, you can call it whatever you want, okay? So how do you solve this problem in any way? All of them. The first thing you do is that you decide some parametric version of your function. Function are hard to compute if they're just infinite dimensional objects. So you typically parametrize them somehow, okay? And the classic way of doing this parametric research is by putting a lot of thoughts in the way you build this parametrization. For example, doing some physics reasoning, okay? And try to reduce the number of parameters to one or two or three. I mean, I used to have this, I guess for Neumann quote about elephant and wiggling tails and stuff. You know, as a physicist, typically two is already three. You shouldn't do that, okay? But in machine learning, we like P to be a gazillion, okay? And we also don't spend a lot of time thinking about how to build this S. The S that I like is this one. Where 5J are your favorite basis, Fourier, wavelet, union, whatever, okay? And maybe 5J itself is itself parametrized. And then you call it a neural network, okay? The point is not that much what I just wrote here. The point that P is huge. This is a bit unusual, okay? From the classical modeling, okay? And so the thing that happened is that the next thing is the obvious thing, okay? What do you do? Parametrized, you try to find the parameters, how you do it, you use the data, and this is just least squares. You can use whatever you want, okay? And then you have to solve this problem. So typically you have to do some numerical stuff, okay? But then the bit is that, can you guess what's gonna be the value of this argument? Zero. You can predict whatever you want. You basically violated the rule of, you know, try to keep your model simple, and you can, you know, do whatever you want. So the next step is really the one that is somewhat the key point. You need to use some of the data to test how good you are, because if you don't, you will only predict the past. And with the model it is so large, you can always predict the past. And somewhat this paradigm of the train test split, that you use some of the data to get the model and some of the data to test the model is really like, you know, the kind of protocol that you have to follow in machine learning, okay? And this is as simple as it is, a difference between data-driven modeling and non-data-driven modeling, where basically you make this one so stringent that if you're able to do anything like this, you're already happy, and you skip this guy, okay? Here, this one is so stupid and so large that this is trivial and everything is here, okay? Does it make sense? So I call this representation optimization generalization. You'll see the three keywords popping out again and again if my computer doesn't turn off too often. This is a problem. Did you see it? Give it back? Yeah, but I want to do low power mode. Never. So why did you turn off? It's gonna happen again, okay? So that was the cartoon. This is real life. Real life is hard, okay? So the kind of function classes, the representation I'm gonna do is this one here, which has the advantage to be very general, but also abstract, okay? So the kind of representation, I just told you this simple function, linear function, what I'm gonna consider are so-called reproducing kind of Hilbert spaces. What are they? They are infinite dimensional function spaces. They are Hilbert spaces, but they have inside a special function called K, which is a function of two variables, such that when you fix one of them and you take the inner product with F, you evaluate the function, okay? Pretty abstract, okay? But intuitively what does this buy you? That's somewhat when you evaluate a function, this become a linear operation. And you can treat this function pretty much like you treat linear functions, okay? This is not true for all function spaces, notably L2 doesn't have this property, but these spaces are basically by axiom, initial axiom have the property that you can linearize function evaluation, okay? What are example of this? So if you're familiar with them, this is a trivial thing. If you've never seen it, it's a bit brutal, but I'm pointing out what matters here. And also, maybe I'm gonna give you two examples, okay? One is linear functions, okay? Simple one, in this case, the kernel is just the inner product. The second is so-called spaces. Does anybody know what these are? You basically, instead of just taking any function of L2, you take functional two with a given decay in Fourier, okay? So they're smooth. How smooth? They have a polynomial decay S in Fourier. If you don't like Fourier, what you do is that you take derivatives and you'll be sure that they are square integrable. How many derivatives? S, okay? Particular case, when you take S equal to D, or D over half, I'm not sure, I made a mistake here, then you have this kernel, the exponential kernel. It's a very simple kernel, okay? So what you see is that in both cases, there's a very tangible object. It is K, which is kind of a weird abstract object. You can really see it, okay? So the first step representation in this talk is gonna be this choice, okay? Which has the advantage that it is an infinite dimensional choice. It's not a very restrictive, or many. There is some restriction, but it is an infinite dimensional. The second step was optimization. So what we're gonna do is just least squares plus a penalty that makes the problem well posed and you have a unique minimum. And then the question is, how do you compute the minimum? Because you're working in infinite dimensional space. So that sounds hard, okay? But the results, which is not exactly new, says that in fact the minimizer of this always have a specific form, which is the one you see here, is just a linear combination of the kernels. For example, the exponential, the one you choose, where at the training set point, what are the coefficients of the solution of a linear system? That is defined in terms of a matrix that is defined by the kernel. Okay? There's a relatively simple problem, it's just a big linear system. And the proof of this is two lines, one line. Okay, it's a very, very simple result. And basically just least squares from Gauss. I mean, there's nothing fancy here. One problem with this stuff is that, at least in this basic form, you have to form this matrix and you have to store it and then you have to invert it. And this is worst case, basically n-cube complexity, if this is the size of the data, but especially in n-square space. So if you think, give or take, take 50,000 points, this start not to fit in your laptop. If you take 10,000 points, it just doesn't fit. Okay? Maybe you can go to 200,000, 300,000 points if you spend a lot of money on memory, but at some point it just doesn't care. If you have a million point, this guy you just don't solve. Okay? So the issue is that in the modern era where you start to have millions of points, it starts to be tricky to just solve this problem exactly. Okay? One reason why do people do like these methods is because you can give very, yes. You just said that we cannot solve this problem which is in big O of n3 and then complexity. My question is, since this is a linear problem with just a lot of parameters, are not GPU well fitted like for this kind of problem precisely? Yes and no, because the point is that if you have a dense linear system which is a million by a million, you still have, you just cannot, you can do it with an iterative solver but it's so big in its basic form that you basically have, you cannot fit it in memory. So what you can do is you have this huge square matrix so you're gonna read a slice of it. You're gonna do a bunch of multiplication on GPU. Maybe you can parallelize the operation then you can keep on going but it's just freaking huge. So even if you do that, you have to constantly at every step parallelize, sorry, synchronize all the processor. So I don't mean that you cannot do it because by just doing parallelization you can always do iterative solver like gradient descent is that their problem become ginormous. Very, very inefficient. One thing which is nice about this method is you're gonna give very precise guarantees in terms of performance. So in my cartoon version of machine learning I had this half of the data used to test my model but in theory what you do is that you assume you have infinite data to test your model. And you try to figure out how good you are in infinite data. So this capital L is exactly that. It's like an infinite test set, okay? And before you read what's written here this theorem says how good is the test set of my method compared to the best possible test set, okay? So this is the best possible test set if you can be God. This is the test set of your algorithm and this theorem says that if you choose your magic parameter to be one of a square root of N then you get this one of a square root of N bound, okay? How good is this bound? And there is assumption is as good as it gets. There is a matching lower bound. It's just a simple bound, not a fancy one. You can make it much sharper or make, you know, you can put stronger assumption and have faster rates. You can make weaker assumption and get lower rates. But this is just gonna be my cartoon like result to give you something to think about, okay? This tells you somehow that you can give precise guarantees on how good your algorithm is gonna perform, okay? And you can also kind of figure out what does this depends on. We're good? Has anybody seen this kind of guarantees before? Yeah? Points. Yeah, the question is what is N? N is the number of points in the training set, okay? And the idea is that the more points you have, the better this gets. And the more points you have, the less you have to enforce this extra constraint. Okay? Other questions? Yes? Last, let me replace thing with something else. The last what? The first question is a bit complicated, know how to prove that the exponential have this property, I mean, I'm not proving you, I'm just telling you, believe it. So, suck it up is the unfriendly version. It's a fact, okay? You can prove it is an exercise, I'm happy to show you how to do it. It's just an example. The second question is the one that is relevant here, because you wanna know how to use it. The point is, intuitively, because there is this correspondence, this weird space in the kernel, the kernel has the property, the exponential has the property, you should use it somewhere, and I'm showing you here how to use it. Suppose that you pick the exponential kernel, okay? I told you before that this is like using a space of function where I can evaluate linearly and it's like with so many derivatives. But how do you use it? This way, you build the matrix, this is now e to the minus, blah, blah, you take the inputs, you build the matrix, you invert the matrix, you get a vector, you take this vector, you shove it in front of these kernels, which are again the one you use, you combine them, this is a function. Does that make sense? Does it? Why you wish? But close enough, right? But that's exactly the idea. You wish that this is a good approximation of water, the function generated, you're wise. Does that make sense? So again, if you want, this is abstract, but this is how you use it, okay? This is how you, this is two lines of codes, if you wish. You specify the data, the kernel, lambda, and you get the function approximation. And the next question is, is this the function I wanted to approximate? One square root of n away. Is this good? Pretty good. Unbeatable, in fact, okay? So this is not a bad algorithm for this, okay? Yeah, stop me so that I can keep the right pace. Other questions? So here, x and y are one-dimensional. No, y is one-dimensional for now, and x is whatever you want. In my case, in the example, it's gonna be vector in Rd. For y, which is multi-dimensional? You have to wait for operators. Okay. Yeah, so basically, operator is gonna be about that. It could be operators, matrices, but it's gonna be in a minute, okay? For now, the output is number, okay? We're good? I'll keep this. One point I want to talk about is the fact that this big linear system we just talked about is big, okay? And as we were saying, you can swap and do GPU stuff, but it's still a big beast. So the question is, can we compress it? Can we compress this solver somehow, okay? And I'm gonna tell you about one way to compress it. Let's look at the second one, okay? The idea is, instead of looking at it with the whole space, I'm just gonna consider kernels at some subset of points, okay? So the result we just commented is, I have a linear combination of kernels. How many, as many as the training set points? What if you don't use them all? You just use a subset of size m, okay? Note, I'm not throwing away the data. I just use less data to write my solution, okay? So I replace the full data set with a smaller set to build my functions, okay? So my problem is looks the same as before, but now I just have a smaller space. I still use all the data, though, okay? If you use less data, then it's tricky. But you still use all the data, but somewhat a smaller space, okay? Here's an idea that has a bunch of different names, nice from method, these are called centers in using points. It's very much related to what people do in numerical analysis where they consider a discretization of the infinite dimensional objects, like the lurking methods, okay? But the key point here is that this is the idea we want to use to compress this algorithm. We want to see the consequence of making this choice. And basically what happens is that when you make this choice, what you can see is that as you expect your function like before is a linear combination, but instead of having n points, you just have m. And what are these, the one that you chose, okay? What about the coefficients? The coefficients are now given by a new linear system. Again, you can have to do a bit of a calculation to see this, but how does this linear system look like? It looks like this. Remember that before I had this huge matrix, what do I do is that I pick a certain number of columns at random. Here I put the first few because I'm very bad at slights, okay? Basically, you pick a bunch and then you put them together. And you get the slice, okay? What is the size of this? All the data here, and just m, which is the size of the number of centers you picked. This is your compressed linear system, and this is just a regularized version, okay? Does it make sense? So as you can imagine, you don't have to store this beast anymore, and the key part is going to be basically the inversion of a small m by m matrix, okay? The rest is going to be linear multiplication. So what happens is that you get the cubic dependence on m, but then just quadratic in n, okay? So you go from a cubic dependence to something cheaper. Yes. So the question is, I'll try to give a simplified version of what you said, and you tell me if that's your answer. What is the effect of m? Let's see, from a computational point of view, you save. This is enough, okay? Well, what about the statistical point of view? What is the effect of this m in terms of test accuracy, okay? How much you're going to lose? Because that's what you expect. You made the problem simpler from a computational point of view. You must pay a price, right? And this theorem is an upgraded version of the one you show before, where you see that you have lambda and m, and this says, no, you don't pay. If m is at least order square root of m, okay? If you compress more, then you start to pay. But if you're in this regime, or you have m order square root of m, you don't pay. So you have a lossless compression. You can compress computation without losing accuracy. And what is the intuition of this? That basically, you don't really need all the data, because you have some kind of a noise of a one on square root of m scale. So as long as you're able to keep the one on square root of m scale, you can compress without any loss, okay? Again, take this result with a grain of salt, sometimes this is great for children. All these square root of n are just by product of my assumptions, okay? If I was to make the assumption weaker or stronger, these square root of n will be changed with other power laws. But the relationship with lambda, m, and the error will stay the same, okay? So yes, satisfying this assumption, and this one. This is the worst case, under this assumption. As I said, this bound can be much better or much worse, and this connection can also change depending on the assumption. So this is not the full theory. It's an example of a theory under some parameter values. But you can do the same theory for other parameter values. So you say, I have correlation in my data. My function is much simpler. My function is much more complicated. How does it change? And you will get corollaries that gives you all these different regimes. That's a good point that's exactly related to what I was trying to say, okay? The algorithm, so uniformly at random, okay? So this is, you know that this is enough because this is the basic result I showed you two slides ago. Now you say, well, I'm going to replace this guy with some other guy, and I'm going to take m of them. And how do you pick this guy? And what I wrote in any possible way for this theorem to hold uniformly at random from the original set, okay? So you have a bunch of points, and you pick them uniformly at random. Absolutely, yes, but not much smarter. So in theory you can, so the question is, the first question was, how did I pick these tilde guys for the theorem uniformly at random? Can I do better than that? Yes, you can, but you have to be careful on, you know, some of theory tells you there are these big gaps and you can do much better. In practice, it's a bit trickier, especially in high dimension, okay? In dimension less than five, you can do a lot of stuff, basically use all kind of quadrature ideas. And that's why I also wanted to make it general. To prove stuff and see how good is the improvement in higher dimension is trickier. But definitely you can do it, and I'm happy to talk about it offline. The other questions, yeah, kids first. That's a good point. The question is, is this equivalent to say that I can find the square root of n rank, low rank approximation of my data matrix? Yes and no, because we are not shooting at the data matrix itself, but rather we have a vector. It's not, let me give you a simplified version, but to catch your point. You have a vector of y, and somehow you want to project it on the span of this matrix. And you want the projection of this vector to be captured well, okay? So the original matrix could be even high rank, but my projection could be well spanned by the first square root of n guys. And that's kind of what's going on here. I'm asking for less, which is exactly what I want. Approximating the y's, not the k. They're not super, so they're not asymptotic in the sense that all the constants are, there is a, therefore n bigger than a given and zero, where all the constants can be unisplicitly. So they're not for arbitrarily small n. But you have a threshold which is not just n large enough. You can just somewhat quantify exactly what's going on. And I guess, yeah, I wouldn't be able to simplify. I don't even, I know the more general result of the values n zero. I have to think a second about it. In front of everywhere. The constant do change in a dimension independent way. So I guess if I understand, we do this question. The question is, you have these results and you have the same result without this guy. I'm just showing you this. What about the constants? The constants are everywhere here, okay? The constant will not match, okay? But they will be dimension independent and independent. So I'm just giving you the leading order in terms of rates in n. Being sure that the dimension doesn't enter the picture, okay? But the constant in principle might change. In these kind of non-parametric regimes, you don't even know the lower. You don't know what are optimal constants. So you try to see that you don't get too crazy, okay? All right, so this is almost useful, okay? Turns out that any person that does numerics knows that this is not the way you want to solve the linear system. You don't want to solve, invert a matrix ever, okay? What you really want to do is to use some kind of iterative technique. Because you can use GPUs, because you can really parallelize, because you can make it faster in a gazillion way. You typically don't want to use gradient descent. You want to do something stochastic gradient, conjugate gradient, all the fast ways to solve the linear system, okay? And this is kind of what you would get if you just applied vanilla as it is. As it turns out, the convergence of this kind of stuff depends heavily on the condition number of your matrix, which is that the ratio between the largest and the smallest eigenvalue of your system. And here I just remind you what is the linear system we're trying to solve, okay? So it's well known if you take the numerical linear algebra book, is that what you can do is to do preconditioning. Which is basically changing your problems lightly to make this as closer to one. What you should do is to take a transformation of your data that basically is you need to invert this matrix, but you don't want to do that. Because if you do that, it's like solving the original problem. So what we came up with is the idea to use compression again, okay? So here instead of inverting this matrix, what we do is that we invert this matrix. If you compare this and this, you see that this part we keep the same. But here, instead of doing the rectangular part times itself, we simply take the square of the square, okay? So the ideal preconditioning would be taking this times this. What we do is just we take this guy against itself. Of course, this is not an exact thing anymore, it's a compressing again. So we do compression twice. First, to go from the big guy to a small one. And then to go from this to this, okay? And this is the actual algorithm that we use, okay? I'm gonna skip it, but we can also quantify the exact requirement on the size of everything in order to keep the same one or screw to end bound, okay? And so you can tell you exactly how many iteration you need. This is basically log n. What is the size of the matrix? All the quantities, all the algorithmic quantities can be given a value that gives you a given accuracy. So we can take the statistical learning effect of the computational choices, okay? So this we actually made a bit of, spend a bit of time trying to implement it decently in GPUs and try to be smart about auto operation, checking which one needs extra care, which one can meet the in core. Some of them need to be out of core. So we ventured in this HPC stuff that we don't know and we tried the best to do something useful. And we came up with this library called Falcon. And here I just want to give you a bit of a feeling of what you get, okay? These are not, sometimes I call this stuff fake real data. What do I mean is that they're like real data that you don't know what they are. So it's anything, okay? What really matters is that they're not cooked up by you and you don't have control on how they're generated and they are big, okay? So what you see here is that this goes from ten to the six, a million to ten to the nine points, okay? So these are very large data sets. And what you see here is that basically when we compare to our approach, to the other approach available which are two, three. There are not a lot of people trying to make this modernization effort. And I think we go a bit better, a bit worse. But what you see is that basically this modern way of doing things can handle a million points in minutes. And if you go to billion points, you can do it in hours, okay? Using this kind of projection compression techniques. So now you have techniques that allow you to use this kind of kernel methods on billions of points in an hour, okay? And this means that you might as well try them, okay? Because it's relatively simple. You have a nice theory. You can study and really make sense of all the things that I just mentioned. And it's all rigorously proved and available based on the principle stuff. And you get a toy that you can play with, okay? The beginning of talk you said that this kind of approach would work for problems that have dimensionality of 10, 20, what does it mean for this data set to have that kind of touch? So the dimensionality here is not n, but it's actually dimensionality of the data. So what's in here is that in all these data sets, the input dimension. So each x is dimension typically from 10 to 30, okay? And so even if you just take this stupid exponential kernel, that still does something reasonable, okay? And if you do a bit of feature engineering, you get a bit of boost. And you get these kind of results. So what I meant is that typically when you go in higher dimensions, okay? Safe images that are like, I don't know, 500 dimensional, 1,000 dimensional. If you just use the exponential kernels on pixel, you're crazy. Because that's just the Euclidean distance between two images doesn't mean anything at all, okay? So what I mean here is that the Euclidean distance between of these two taxi inputs means something and I guess I was going to use it. That's what I meant, okay? So the other example I want to show you is this one which is real data for real, real data for real from high energy physics. And this was, maybe I'll take a minute to discuss this. So basically this idea is, this is not our idea, but the thing is, you have a model, the standard model that's somewhat described really well data. And you can use it to generate data, okay? Then you have real data and then you ask, is the standard model always describing perfectly the real data? So there are some discrepancies, that's the game, okay? And by the way, this is just you have a problem and a simulator and you want to describe the simulator as describing well the data always or not, okay? How can you phrase this problem? It's kind of a testing problem. What do you do? You take the real data, you generate data with your simulator, with your model and then you do a classification problem. The result is going to be not 90% accuracy, it's going to be almost chance because the simulator is really good. So what do they suggest to do? Repeat the simulator a thousand times, get the, and play it against itself. Because you know that that's chance for real. And then see if somehow the difference when you get, when you do real data versus simulator are comparable to doing simulator versus itself. And you run a statistical test. Not up for me to discuss whether this is a good idea or not, but this is an idea that has been proposed to do model three, high energy physics discovery. From a pure machine learning point of view, you understand that you have to train a model many times, okay? Because you have to do a whole bunch of times, model against itself, simulator against itself, plus the one against the data. And it turns out that they had a neural network based model, that they had to run to a certain high precision that would run in more than a hundred hours. And if you use the techniques I just show you, you can run in 22 seconds, okay? So literally this now we are moving this guy, we're trying to move this guy into the pipeline of the data analysis at CERN. Because really this made the difference. This solution was basically unrunnable and this you can use it to test real stuff, okay? Okay, this is end of part one. I guess I've been talking about how much time I have, all right? Just to see how much I can. Okay, anyway, this is really what I try to give you the main coordinate. Let me try to summarize them a bit, okay? So my cartoon was, when you have a machine learning problem, you want to do data driven stuff. You have to choose a model, optimize the model, and then check the model. And for us, we were reproducing here by spaces with this kernel stuff. Solving linear system efficiently with compression, and then checking the properties was done in theory with theoretical guarantees and in practice with a set of basic experiments, okay? And if you want to take a message here is that kernel methods, I guess you can run on millions and millions of points in minutes or hours. And I guess one question is, okay, this is just regression problem. How far you can push it? And I guess we've been doing this quite a bit, okay? So this is a list of all the projects we've been doing. And what you see here, this idea you can use it beyond, you can take loss function other than the square. You can try to use this idea to do quadrators, just quadrature methods, by using this kind of important sampling that we were discussing before. But you can move to other problems like a bandits problem or unsupervised learning problem like K-means. I want to talk a bit about operator learning, okay? These are a bunch of people that did this kind of stuff. So somebody asked before, what happens if the output are not real numbers, but they are something else? For example, vectors, or even more generally, what if the output are functions, okay? So here, for example, you can think that this guy is a function and this is also a function. So for example, these are elements in a Hilbert space, okay? Why would you want to do it? Well, because you want to model things in a continuous fashion, because a lot of things come naturally in that way before you discretize. And typically, you have PDEs or integral operators and you want a forward problem like solving the equation. Other times, you want to invert the equation, you have the inverse problem. In other cases, you have structural equation like differential operation and whatnot that inform the way the solution should behave, okay? And in many of these cases, the object of interest is not a function of real values, but it's a map from a function to another function and operator, okay? How much of what we just said can be recycled, okay? Again, this is the, as before, this is just the one annoying slide. Because it's gonna be even more abstract than the one before, okay? So this is easy. Instead of the real numbers, you have function that goes from x to y, which is now a Hilbert space, as you expect. The real weird object is this kernel. The kernel before took two inputs and mapped them into a number. Now this guy maps the input into an operator, from y to y, okay? So this means that when you write K x prime, this is a number. Let's make example, this is when y is equal to r. When y is equal to rt, I call it gamma to avoid confusion. This is a t by t matrix. Every entry is a t by t matrix, okay? If y is a Hilbert space, speaking a lot, then this guy is a linear operator, which is what this notation stands for, okay? So if you look at this, you can view this as a collection of real valid functions, okay? How many? The question is, what is the difference between looking at something like this, and looking at basically putting a kernel on each coordinate? This is more general because somehow each coordinate can be combined, okay? I'm gonna give you an example in a minute where I do basically what you were saying, okay? Before we do an example, so to see the difference between that guy, let's just see how the definition goes. It's a bit abstract. That basically says, if you take this matrix, you can have a reproducing formal, but like in a weak form. So whenever you apply the action of a vector, essentially because of technical reason, okay? But roughly speaking, you have exactly the same definition, okay? So somehow this is more technical, but it's really the same. What are the examples here? Basically, we're gonna do it gonna be a bit in the direction that he was saying. So what you can do is, for example, you can say, okay, I'm gonna take as a kernel the input inner product times the identity, okay? This is a little bit like taking a kernel for each coordinate. In the infinite dimensional setting, this is infinitely many kernels, okay? If you do that, what do you get? The function you look at are linear operators. Nw is what you call the Hilbert-Schmidt or bounded Frobenius norm matrix, okay? So you take linear models parameterized by bounded Frobenius norm. If instead of the inner product, you put a kernel in this input space, you get the same stuff, okay? And that's basically what you were saying. If you want to view this in the finite dimensional case, this is one vector per dimension. This is the easy case. If you take it to be not the identity, you start to mix up stuff, okay? Anyway, this is not super important, not super important. This is somewhat technical. But what you can see is somehow that a very similar structure holds up to this obstruction of the kernel, okay? Mathematically, most things go through in a very similar way. So what you have is that you have a very similar representor theorem that I want to show you before. But in general here, discretization will be needed because object can be infinite dimensional. And you can get basically the same theorem that I told you before. What you see is again that here now you have to use a norm, okay? And the rest is pretty much the same. The condition now is gonna be on this operator valued kernel. And in this case, this is a simple result where we just assume the Frobenius norm of this to be finite, okay? This is really an interlude, okay? And as you see, I don't wanna spend too much of your neurons on this. You just get the feeling that somehow it does color case translate nicely, particularly simply if you want to deal with linear operators, okay? And both the theory and computation go through pretty smoothly, okay? And somehow that the challenge is to come up with good kernels. Because the exponential, which is not even a good choice, it's not a particularly smart choice, we lack good operator valued kernels, okay? Let me tell you a bit about how to use this thing to learn dynamical systems. So out of the second, the first part hopefully was a bit clear. The second part was fast, but you took the message of whatever you do with function, mumbo-jumbo, you can do it for operators, okay? We want to try to reduce the problem, learning dynamical system to learning an operator, okay? Let's see what is the dynamical system first. This is a simplified version. I have an interactive map, which in this guy, this station, it doesn't change, they tell me it's a step K and send me to take step K plus one. So now you have the pair K and K plus one and you want to see how to somewhat do the one time ahead prediction. The problem that has been started in a bunch of different situations. We give up the idea that the data don't change. I have a system, dynamical system, I want to identify and to predict or analyze, maybe I want to control it. And here the issue is that in general you have no linear systems, okay? Which are notably hard to do a lot of stuff with, okay? The easy case is when this is just a linear map, now F is no linear, though, okay? Many, many techniques and I actually see that I lack at one slide to tell you all the techniques that you can use to solve this problem. But somehow if you start to do, look at no linear systems and a good balance between having techniques and guarantees is not super easy. So here I'm going to tell you one particular approach and this is going to be not for all no linear system, but the one that satisfies a bunch of nice assumption, okay? So first of all, I'm going to talk about not necessarily like iterative math, but just a process, okay? Which is going to be, in this case, a discrete process, okay? So X is a random variable, XT is a random variable where T is just a natural number, too many. I'll cover them. First property, not exactly a surprise, Markov property. I assume that I can only, I use the last step to predict the next one, okay? Then we assume that there is time homogeneity, so that everything is described by a transition kernel, okay? Does someone give me the conditional probability if I'm a state X to be in a given state at time t plus one, okay? T plus one, okay? We assume there exists an invariant distribution in the limit. So your trajectory will end up in some kind of region described by some probability distribution. And finally, I assume my price to be time reversible, okay? This is gonna be needed for one property that I'm gonna show you in a minute. You can drop this, but think that it'd be complicated. Example here would be stochastic dynamical system. No linear dynamical system, but this map is no linear, and you have noise. And for example, the noise could be zero mean, okay? Now, before I do that, let me do, since I have a board, I don't have to use my hands, and I do a drawing. You have a dynamical system that provides it's stochastic, and this is just one trajectory. And it's gonna, the transition from one point to another is not gonna be described by a linear map, okay? How can you linearize it? Well, there are standard way to linearize it. For example, do a Taylor expansion or what not? But one idea is to say, well, what if, instead of looking at the linear system, I look at some function, okay? G, and observable, say energy, okay? And then I go to this new space, and I look at the behavior of not X, but GX. Maybe you can hope that if G is no linear, and good enough, you can linear, you can sum up some of the behavior of G will be more linear, okay? The other thing you can hope is that if instead of one G, maybe I take many Gs, maybe at some point this is equivalent to the original system, okay? Because somehow I get enough views of my system that I somewhat characterize it. Does it make sense? So instead of observing the linear system in this natural state space, you somewhat map it in a space of observable. And this is basically the definition of what is so-called Kubman operator, okay? The idea of Kubman operator before I hit you with the question is, let's take infinitely many of this. And then basically by definition, we're gonna get the linear dynamic. That's what's written here. So it's a map, defined as, this is the definition. When the Kubman operator acts on G and you evaluate it in X, this is defined as the expectation of what you would get if you were as a time t in X, and this is just what you would get at t plus one. You need the expectation because it is a random variable, okay? A pi acts on an observable, but on any observable. Any observable where? In the space of square interval function with respect to the invariant measure, okay? So this is the operator, it's not true that these are linear, but just to make it more suggestive, okay? It's like the operator that tells you how all this stuff evolves, okay? All these observable in this L2 space. So this is X, this will be L2 pi, okay? So this is an old idea, okay? But it become very popular in the last few years as a way to study no linear system, because somehow you can, at the price of embedding your state space, in an infinite dimensional space, you basically find the linear representation of a no linear system. So you can now start to hope to use many of the ideas that you can use because of linear systems, okay? One thing you can do, for example, is that if you can find an eigen decomposition, a spectral theorem for your operator, you can decompose the action of the operator, okay? And then you basically can separate the time component and the space component, yeah? So if I join this and the fact that you can use the same left and right eigenvectors are reflections of reversibility in this case. Self-adjoint center of reversibility is one hand, yeah. So I promise you that time reversibility were there for a technical reason and that's because I want the operator to be self-adjoint so that this is particularly simple. Otherwise you have to do, you typically have to do with normal operators and it gets more complicated, okay? So the point here is that if you do this, this is somehow kind of a Fourier analysis of a linear system, okay? Where somehow you look at the behavior of your linear system over some basis function. And if you truncate here and you just look at a few one, two, three modes, you can visualize, okay? It's like saying, let me do a non-linear embedding of this in one, two-dimension where I can visualize an approximate linear dynamic, okay? So more than just predicting things, oftentimes people are after this mode of the composition in order to visualize and interpret the result. And this is an extension that have a different name. Initially, the kernel was just linear and people call it dynamic mode of the composition. Then people start to use certain basis functions. It was really extended mode of the composition. Then people realized this is just a special use of Kuhlman operator theory. It's an idea that goes back to the 30s last century. And now this is what we're gonna happen in a minute. Surprise is gonna be that we make it into kernels, okay? So what do we do is we say, well, let's take this G to be in a reproducing kernel. Let me say one thing first, okay? You could say, well, you define a dynamical system as, you associated a linear operator to a dynamical system. Maybe we can just use part two, right? Part two was learn a linear operator. But part two, I needed to observe an input to the linear operator and then output to the linear operator, okay? And then I could try to associate them. But here I don't do that because what I observe is actually this trajectory, okay? So I could take, okay, I take a bunch of G's and I go on. But that somehow, how do you do that, okay? So one idea is that maybe kernel can be helpful for that. And so what we do is that we say, okay, let's restructure attention to each the space of observable to be just a scalar space of functions, okay? Which is defined by a kernel. Just one of the simple ones, one that I define denoted by small k. And I just want to find an approximation. And then some magic happens, okay? And here is the one line of proof that I'm gonna show you throughout this presentation, which is almost finished, by the way. Supposing you take just, sorry, there is a typo here. This should be also, I don't know, this is correct, this is the true g. Okay, so this would be gj, okay? But because I have means, I'm gonna fix it. You would like to, this is the time at t plus one and you find the w that find this observable at time t and send it to the one at time t plus one, okay? You don't wanna do it for just one. You want to do it, for example, for an orthonormal basis. So you have a full description of whatever is the action on the space. The good news is that because you have this property that evaluating gj is just, remember, you have this gx equal gkx. What I can do is that I can use this property here once, here another time. I can now move this w on kx and I get this formula. And then by using linearity and stuff, you can basically show that you can get rid of gj, because they're just an orthonormal basis and this is what you get. Forget about the computation, what did I get? The intuition is, if I know, I should know how the Kuhlman operator acts on energy. But perhaps I can get this for free if I know how it acts on k. It is somewhat a special function on this space. Because if I know how it acts on k, then by inner product, I can see how it acts on energy. So what's written here is, study the evolution of k. Try to find the linear map that's in kxt in kxt plus one, okay? And now this is an operator of value learning problem, where I have this guy and this guy, okay? Of course, you cannot do this because you have infinitely many of them. You have to discretize your regularize. And this is the kind of stuff you get, okay? You typically want to put an extra rank constraint. This is not crucial. You can go through the same stuff I was telling you before. This is going to be expensive. So you typically compress with the kind of technique I told you before. And you can prove that under some assumption you can, of course, compress and you can get some guarantees. Although this is a bit cheating because they assume that the data come from the invariant distribution, okay? So there are very minimal guarantees that if you could, you will never be able to get the good data that you can compress it without loss. The problem of understanding what is the price you pay to observe trajectory instead of steady state, we don't know yet. Okay, but some of the meat of the second part is really this reduction of the problem of learning a dynamical system to lift it to observable, which is not what we did. And then the observation that in kernel space, you have some extra simplicity coming out. This is for me to explain, but I'll give it a shot of like an example of how you could use this kind of stuff. This is a molecule and you have 45 distances between the atoms, okay? So you can view this molecule as a 45 dimensional vector. And then the molecule of both of our time and you want to understand how it somewhat displays itself in space to see if it gives, or rather if you have a sequence of atoms and you want to see how they displace themselves, okay? And this is a basic case where we know that all this model is actually governed by two angles. So the red part is just the raw data. So it's time and the 45 dimension. And the other part that is gonna appear here that has two colors is the value of these two angles, okay? So what did we do in this case? We just take this raw data. This was half a million points, okay? We took half a million points of this 45 dimensional data. So half a million of these pairs of trajectories. And we computed the first two eigenfunctions of the Kuhlman operator using the exponential kernel. This is the projection of the first eigenfunction, the projection of the second one, and this is what you get, okay? How do you interpret this? Well, somehow the eigenfunction tells you that there are basically these are just the configurations. And what you see that roughly speaking, there are three configurations, okay? Three somewhat clusters described by your eigenfunction. And somewhat suggest there are these three metastable states, these three stable configuration of the set of atoms, okay? And of course, these results need to be then validated outside of a machine learning framework, but just to see if this makes sense. But this is just to give you a flavor of this thing can be used, okay? I guess really to my presentation is more that we could run it on half a million points, okay? Which is kind of a large scale set of points. All right, this is more or less everything I wanted to tell you. The three take home message here, if you wish, are you can modernize kernel methods and make them efficient with guarantees, okay? You can adapt these techniques to learn operators and through this Kuhlman operator theory try to bring the value of the techniques that you use for scalar function to dynamical systems. What are we doing right now? A few things, so one idea is the problem of learning that the kernels is really crucial, everything is hidden in the kernel. The kernel is good, you can do a lot of stuff, the kernel is not good. What are, that's a big gap with neural networks, okay? So how can we bridge the gap somehow or at least improve what we have so far? Look more at the operator scenario because I think there is much work to be done and then the last thing that we actually more or less start doing is try to apply the Kuhlman operator theory with kernels and approximation, not only for system and dynamical system analysis and identification, but to control, okay? And that's more or less what we're doing. All right, I'm a bit over time, so thanks for your attention. Yeah. You could do that, that's forecasting. So this plot is more about, if you wish, visualization, okay? So in this case, what I do is that I do the more, the Kuhlman operator can be used to do forecasting, I give you an X that you have not seen, tell me the next step. Or you can do it by saying, okay, I'm going to do the same thing, I'm going to do the same thing, I'm going to do the same thing, I'm going to do the same thing, I'm going to do the same thing, by saying visualize the system and try to analyze it, try to understand what are its main features. So this is about the second point. And everything you would do is to say, take this data, use a part of them to train and learn the Kuhlman operators and then use it to predict the next one. And if you do that, what you see, that it works decently well, but then after a while, if you look ahead enough, then you lose this precision and then some, my co-authors are working on that, okay? No, both, both, both because, so you can predict the observable, okay? And then by two, you can also pick as observable the coordinates of your system. And then you can also predict the evolution of the states. So if somehow the space of function you consider is rich enough so that the coordinate system live in there, you can also just specialize it, okay? You have, basically, there is a nice equation that shows that if you do have the observables, if you have the kernels, you can just use that. And if you do have the observable, then you can predict the next observables. If you have a sequence of observables, yeah. Yeah. Yeah, thanks for the really clear talk. I have a question regarding this compression, the second compression that you drew over here, there. As far as I understood, this is done in the preconditioning part of the problem. So are you aware of something like that being done for artificial neural networks? Because I think it could be, I mean, if a similar trick could be applied there. So it just was a premise which is almost correct and then a question. So the premise that we only use it for the preconditioning, no, we use it twice. We use it to go from the big system to the small system and then for the small system to compress its preconditioning. So we use compression twice. To compress the model and then to compress the preconditioning, okay? So we use it twice. Now, the actual question was, what about neural networks? So I think some of these ideas is the way I think they really somewhat dig their hands deeply in the linear structure of everything. Some of the Hilbert space structure, okay? I think the closest thing is that people try to do, it really is that you have sketching. So you basically use random matrices. So you have this big matrix. You write it in subselecting columns. You heat it with the random square matrix. This is a matrix whose hand three are random Gaussian, for example, okay? That's what it's called sketching. And that you can prove similar results, okay? And I think that kind of thing is sometimes using neural networks to somewhat reduce the size of the ashen and try to find the more compact representation of the ashen within the back prop. So it's also using transformers, because transformers have a very similar shape than kernel methods. And so it's used to somewhat compress when you have to, before the soft max, when you have to compute whatever it's called, similarity between the heads. That's basically the point that I know this stuff is used a little bit. It's like, it is improved as crucial. This here is really crucial. Those places, I'm not so sure. Thanks. That was a great talk. Thanks, Lawrence. When I played with random functions a while ago, they seem to have a bit of a curse of dimensionality. So you know, one or two D regression problems, you got really good results, really fast with very few basis functions. And with 10, you were already needing lots and lots and lots of basis functions to get a desired accuracy. So I didn't quite get where the dimensionality of the input space lies inside your guarantees. So, when basically the setting I'm considering is a setting where it's more general than that. But when you assume that you are in a kernel, and the kernel is bounded, the main example, they're more general situation. But the one example to keep in mind is having smooth function in Rd with S, which I think is basically this. When you get the bounds, the bounds should look like this. And so everything will depend on the dimension. So for example, this is the error you get instead of one over the root of N. Your lambda should be 1 over N, so N to the minus 1 over 2S plus D. And M is 1 over lambda, okay? So these are the real results, okay? You don't see them in my talk because I wrote it somewhere and I'm stupid and I've already forgot because I took S equal D over 2. And so this become 1, this become 1, and this become 1 over 2. And I got square root of N, okay? But if you do the analysis with the more general setting, then you see that the dimension enters and it enters. So S was the number of times you could differentiate a function? S is the number of times you can differentiate a function, this is the dimension. So basically if your problem is smooth enough, then you get these bounds, okay? But this goes back to the question to the person behind you that is like, is this M like this optimal? No, it's not. In fact, what you can see is that if you take points not uniformly at random, but in a better way, then here you can put some kind of exponent that would be much better than that, okay? So these answer the technical question hidden in what you were saying like, where is the dimension? It's there, it just didn't write it because I put myself in a setting where I could delete it, cururgically, okay? The practical question is would this stuff work in higher dimension? Progression probably which are one, two or not? My experience is that in one, two, three dimension, instead of picking points uniformly at random, you can do much better stuff. Because basically you can do good quadrature methods and interpolation techniques, okay? And then that's gonna work and this stuff is gonna not work very well. If you're gonna go in a setting where again the dimension are high, not too high, then all the quadrature techniques that typically rely on some notion of partitioning or anyway good coverage of your space will be terrible and this stuff give you a compromise, okay? In places where you don't have the luxury to say the accuracy you want, meaning that you cannot run the algorithm anyway. So somehow this gives you the compromise you can get, okay? And in my experience, most of the time the issue is not that much M as much as the kernel. At some, typically often see that you saturate performance. It's not the issue that you need a lot of this, but more that your model at some point is too simple somehow. But I'm happy to discuss it, and it's more like heuristic comments. The first question is easier, that's the math. Florence, so again.