 Okay. So yeah, particle physics, for example, evaluation of particle accelerator sensor data and up to the largest scales, for example, simulation of galaxies. And you have use of machine learning in all these fields from the very pure like in helping to solve differential equations and mathematics, very applied engineering. So today as well here this workshop is on materials right so we are in the middle row here, physics, chemistry, material science, everything on the scale of atoms basic. And let me, since this is a different talk that we skip that slide here. So what what is machine learning. Right. And how do you define machine learning. And there it's difficult of course right it's a big field. Here are two definitions one by Tom Hitchell in his 97 book. He says, in a bit more words algorithms whose performance improves with data machine learning. And this definition, because it captures the learning effect right if you think about points in a plane and you fit a line through them well the more points you have the better your line fit will be right learning from data from experience. Now, the definition by Samuelson, and you see it's from 1959 so quite a while ago is that machine learning is about algorithms that solve problems without having explicit tasks specific solutions. So that means, instead of provide for example, instead of providing an algorithm to find the shortest path and the graph explicitly to say, look, these are grass these are examples of shortest path in the graph. So let's not figure out what characterizes them. So no specific problem specific algorithms. Okay, now, in a bit more abstract way machine learning is about the systematic identification of regularity or correlations in data. So finding correlations in data sets. Yeah. And at this point, it's used and right and one more and it's used machine learning is often used to predict something to analyze data auto controller system. So now at this point, at this abstract level already right, there should be some questions like, Okay, but what are the arrows that these methods make right, do we have our bars or not. And are these models interpretable do we know why a prediction was made right can be trust these predictions, how reliable others, and these are good and important questions. And here's just one example. So this is a yeah it's a well known example. This is from an image classification neural network so the network was trained on a database of images. And it was trained these were classified according to what they showed for example this image shows a horse, and the network had to predict the appropriate class for for new images. And now researchers analyze this it was performing very well, and researchers then analyzed how the network made its decision. So they wanted to know which features is the network using. Is it looking for like edges maybe or certain colors or combinations of shapes and colors, and when they analyze the network. They found out that the network indeed found a very good correlation and turned out that all the most of the horse images in that database at this copyright notice at the bottom here. So what the neural network did is it realized, if this copyright notice is there. The image is showing a horse. So the network did exactly what it was told. Right, it identified horses based on a correlation in the data, but this did it in a way that was very different from what it was expected, right, and this is I think an important lesson when you use machine learning in your work. Right, try to make sure that the network is really doing what you wanted to do and for the right reasons. They're countless other examples like this. It's clear they have to be because if you ask a machine to do a task but you don't tell it how to do it, you're bound for surprises like this one. Okay, that was the my general very brief introduction to machine learning. And the caveat that goes with it. Now, I want to very briefly show some of our work and then we'll go to the kernel learning here. So we are interested in the exploration of large molecular and material spaces on the right hand side the column, but we want as a, I think it is clear that materials and molecular spaces grow combinatorially right if you think of molecules is labeled graphs and you count the number of possible labeled graphs, then you see that this grows extremely quickly. And for materials is the same thing. So, and to be able to explore such spaces and find materials with certain desirable properties or optimize existing ones. So at the starting point, then you need the ability to predict properties of materials like stability in the previous talk, and that's the middle column. So we are interested in the more midterm inaccurate and precise property predictions for materials. This is a molecule but same for materials. It requires, at least for the properties we're interested in to the ability, I will show an example on the slides, the ability to run long dynamic simulations for large supercells. Now we want to be able to simulate large atomistic systems over a long time scale. This is possible with classical force fields but unfortunately, you either have to parameterize the force field for a specific system to get reasonable accuracy which is can be quite difficult and tricky. Yeah, or you don't have the necessary accuracy, and maybe the force fields are even then if you parameterize them as well as you can, they can still be not flexible enough. So either one of these difficulties all of them combined. So what we want to do is we want to use machine learning to construct so called machine learning potentials that given the position of the atoms, but not the electrons. Just the atoms predict the forces acting on them and predict the energy and the negative gradient gives you the forces. So these are machine learning potentials. And let's be extremely brief here, maybe you know some of these methods right the current workhorse method for jumping functional theory scales let's say roughly cubically depending on which functional you use and what exactly you do. I do see it. Okay, sure, please do ask questions and randomly saw a question I don't see the chat but one one window popped up here so what is the accuracy, but in this context when I say accuracy I mean the I think that in the energies and forces for example measured in the root means as events the mean absolute error the root mean squared error in whatever units you prefer like electron volt for for example of the energy per atom or of the, of the force and per data for atomistic systems that were generated, according to the same distribution as your training data. This is the classic machine learning thing to do, but it's a good question still because this is certainly not enough I'm digressing a little bit but the question is I think how do you judge how good a machine learning potential is, and that's a difficult question. But I think the, what everyone can agree on is perhaps that once you have shown that you have low retrospective errors in energies and forces will use the machine learning to run a simulation. And if your simulation crashes then your potential was not so good right. I'm summarizing here right but for high dimensional machine learning potentials holds so called holds and the potential energy surface where there's just not enough training data to make a good prediction aren't actual problem that can be taken care of, but it has to be taking care of right so use the potential and then you will see how good it is that that is my answer to how do you measure that. And here in this let's let's not spend too much time in this slide. The point is compared to frequently used DFT parameterizations current state of the art machine learning potentials are about three to four orders of magnitude faster which is a substantial improvement right, otherwise this whole thing wouldn't be worth it. And if you really go for speed you can get another three to four magnitude. I'll come back to that. I'm demonstrating the idea and then I will show some work from a group. So if you at this, this is again the same idea of learning the potential energy surface here in this sketch the horizontal axis is your atomistic system. So the coordinates of your atoms and whatever chemical species and whatever else information you have there. Like spins for example, or let's not get into that excitations for example what about charges and so on but let's not let's not talk about that. So let's just think the horizontal axis is positions and chemical element species encoded somehow in a high dimensional vector space let's say. Right so the horizontal axis is a high dimensional space, and the vertical axis is what you want to predict let's say the energy, you can also try to predict tensor valued properties but also let's not go to that. And then the black line would be like, if you could use your up initial reference like DFT or quantum Monte Carlo or cluster to calculate the solutions for many many points in this input space, but you can't because it's too expensive. And what you can do is you can do some calculations of the red points the reference calculations, and then train a machine learning on these points right on your selected reference calculations. And then you get a fit the blue dashed line and if you have done your job well, then it will be close to the up initial reference, and then you can use the machine learning potential, instead of the up initial potential, getting the speed up for the larger sizes. There's a lot to be said about this, for example, but but let's not do that. Maybe just mentioned one point here that I think is important. If you have a finite number of training points and you fit a very high dimensional flexible function through them well of course there are many possible solutions right. So you have to make additional assumptions and it turns out that the usual machine learning technique that is defined or so called regularizer that controls the smoothness or complexity of your fitted surface, that that goes along well with up initial potential energy surfaces that tend to be smooth, which we can exploit here. So we find the smoothest function through our training data. There are many other points like this but let's not stick with them today. Basically, I want to show one example in this general introduction. So this is work from a group, we are using the, we want your interested mentioned that we are interested in predicting properties of materials. And here we are trying to predict the thermal conductivity of a material so how well it transports heat. And we want to use so called green cubo formalism for this which requires us to run long simulations of large supercells. So we will not go into the green cubo method here, but if you're physicists, you will have an easy time figuring it out. I just want to say we need long simulation times and large supercells. And here you see a preliminary result figure. So on the horizontal axis, you see the simulation time in nanoseconds, and on the vertical axis you see the estimated thermal transport coefficient copper. And the different lines are different sizes of unit cells, and this is for Syrconia. So pristine Syrconia, no, no real things like defects and just to show it's whatnot. No, it's just pristine. And this we chose Syrconia because we had reference data from the previous purely up initial study. And you see the black one line here, right, and that was a substantial computational effort at the time. And you see how far it got right at that the black line you don't know whether you're converged in time and indeed you're not. And you don't know whether you're converged in size this about 100 atoms and indeed you're not. So, but if you run a machine learning potential. In this case we used a neural network potential schnett, the author of it sits here Christoph shit. Hi. I'm sure he will talk about that too or got the successor perhaps pain. So, and then you see that we can easily run long simulations we can do 1000 atoms a few thousand atoms, and then we see, okay, if you run for half a nanosecond and have 1000 atoms we're approximately good that the graph itself would need a lot more explanation, but this is the gist of it. Here's a comparison with experimental data and also other machine learning models but here also I think the statement is just if you compare the red diamonds with the black crosses. You see that we are roughly in the right this in it compared to experiment. Yes, that I wanted to show that as an example. And now the other thing I want to show today is some machine learning potential that we are actually currently developing. And I should say as a prelude that people at the moment use mostly three different types of machine learning approaches for machine learning potentials that is linear models, kernel models, and neural networks, and they all have the different types of neural networks, maybe I can come back to that I'm not sure. And here, and here, this is an attempt. Well, maybe it's at the other end of the spectrum of neural networks, whatever that means, let me show you. What we try to do here is we said, okay, let's not jump on any hype train right now but let's take the concepts that we think will lead to a machine learning potential that is both extremely fast as fast as classical potentials like the Jones potential, or the Linger Weber, or Morris potential so extremely fast. I don't think you can be much faster without course training. And that at the same time will be robust and interpretable. So we didn't aim for the highest accuracy and indeed this is not as the accuracy of this model will not be comparable to the accuracy for example of a large network potential. And because it has much many fewer parameters, right, but it's robust, it doesn't have holes, and it can, you know, and it's interpretable you can plot the different components and look at them. So what how did we do that. Okay, one trick that basically any potential does to be able to scale up to large systems is you don't predict the energy of the whole system. With machine learning you can do that you take all the coordinates of all the items at once and predict the energy, but that's problematic for larger systems you have too many degrees of freedom. So what you do is you assume that your energy can be written as a sum of atomic energy contributions to be clear this is not the case this is an approximation. The energy is a many body function of the atomic ordinance, but you can approximate it this way and it turns out in practice this often works quite well. And then you predict atomic energy contributions for each atom and sum them up. So that's what we use. And that's much easier it's much easier problem because you only need to consider a few neighbors around that few. That's maybe in a radius of three to 10 angstroms on that scale. So we use this standard trick, then we do our model is not a model is a many body expansion so we say the. We have one body term so one constant offset per chemical element species, we have two body terms that is, we have energy contributions from each pair of atoms. So from the central atom in the local environment to all its neighbors but separately. And we have three body terms for example we could to higher order terms. Very well known thing to do. And it's different from most of the current machine learning potentials that are have higher fitting capacity because they look at the local atomic environment as a whole. Whereas we just look at all the pairs and all the triplets. I mean, there are methods never sees all of them together. But we pay with accuracy but we gain in computational efficiency and robustness. So we can represent these 123 body terms. Well we use for two body terms it's easiest to see right a two body term is just a function as the atomic energy contribution is just a function of the distance between two atoms and we learn such functions using splines splines are low order systems that have finite support finite support is important because imagine. Imagine you have like the distance, and you have the potential and you place small polynomials along the distance axis, and then you evaluate and you fit their coefficients and that gives you a for a 2D function. So you have your potential as a function of your data. And these, since these little functions you put the spines have local support, when you evaluate the potential you're in our case, we only ever evaluate at most four of this low order potentials. And that is extremely fast to do. So this is well known splines are a well known technique they have they already discovered many every generation basically right. So some ideas from more modern machine learning if for example for the regularization, we don't use the usual regularizer but we regularize the curvature so we are saying like, if you if you have data, and then the data stops but here we have data again well then choose the function choose coefficients such that the function doesn't changes as little as possible. And this is very easy to do in our setup, it can be done for high dimensional potentials but it's much more much more it's a bit more complicated. Okay, of course, in these five minutes you can I mean, hopefully I gave you some keywords that you can look up it's on the archive you can read the whole footprint if you would like. So here's a question if the machine learning potential is trained on density functional theory, you don't have to you can usually it's done but that's because conch on DFT is so popular. You can train it on any reference you want to and indeed for example in my group we are currently building models based on quantum on the color. So, most of the time is DFT but that's because DFT is popular. That's the main reason. Okay, so what do we do with that. So here's the right column is just some fallen specter to show you that it works. There's a lot more validation in the preprint. And on the left, what you see here is on the horizontal axis computational costs so how fast it is right. Lower corner is density functional theory in the middle you have some state of the art potentials like gap soap for example. We have, we have a newer version of this graph that also has momentum potential so we have current state of the art there, all of it of course but some of it. And on the very left you have classical potentials like Leonard Jones, LJ and Morris, and all refitted to the same tungsten data in this case and you have or will have us potential in the two and two and three body parameterization. And we're happy that it improves on the current Pareto frontier of error versus efficient as a computational requirements. And finally, it works somehow and now at the moment we are trying to apply it to as many systems as possible to learn about the limitations right it will fail eventually because it has limited fitting capacity but the question is when. And let's not get into this. So we have some ordinary experiments on the circle me that I mentioned and you see basically the DFT versus the UF energy, but there's a question do we have any control on the arrows of the model or some systematic way to improve results yes, I'll come back to that in a moment. And on the right hand side you have the same for the forces, and you see, sure, if you have a large data set and a high capacity model, then this would be like almost a straight line for the energies and the very enclosed it for the forces. So here has is just for two body, I think, and this still has relative as a, it has errors that are good enough to run and indeed, but we are working on improving them further. I think that's a message here. So that was a question. Yeah, error control. So, sure, I mean the typical as the primary thing to do would be to provide more training data right the more data you have for training the more accurate your model will become. And I'm not sure I will be able to remember all these questions but maybe Patrick can help with that. So, so the first thing you can do for a machine learning model is give it more data, as long as it has enough parameters enough fitting capacity, it will be better, the more data you throw at it. I mean, here we were looking for a model that we can train with as little data as possible because we want to reduce the number of DFT calculations. And then yes, I mean, the various different models have some of them have predictive and so error bars like Gaussian processes have predictive variance. And I'm going to give you a measure of how reliable the prediction is, if all the assumptions of the Gaussian process model are fulfilled, which they are usually not. But you can build ensemble models that same on, you can analyze your arrows. For example, you can also try and instead of minimizing the root mean squared or as a form of average error you can try to max minimize the maximum error. It's more tricky and more volatile and of course you will make as it has problems but you can. So there's a lot of things you can do technically to play with your model and understand your errors better and improve them in a way but whichever way is necessary for your particular application. For example, one thing that can be important for high dimensional machine learning potentials is to recognize when your simulation has moved into a corner of face space where you don't have enough training data because your predictions will be unreliable. This simple model here doesn't have this problem as much because it doesn't have so much fitting capacity, right, these are all low dimensional like two and three dimensional fits, and you need don't need much data and and won't you can look at them and check that they're roughly physical and then you just run the simulation is the accuracy we may not the same of course but it will provide a robust baseline. So let me go to the end of this more data on hydrogen and the pressure. And let's let's skip all of this now. Okay, that's good. I just wanted to give a very short general introduction to machine learning and spotlight on what we're doing in my group, and I would suggest that we are now answering questions, and then we move to the second part where I will talk a little bit about machine learning. I see the first question. Thank you for a nice talk. Can you go back to these Zirconia results. It's very loud. Here. No, yes. This one. No, the following this one. This one. So, could you do you have an order of money to the number of, you know, DFT calculations you had to produce for the training set that you show here just because you said this is much more efficient. Yes, yes, yes. Okay, I can try. This for example. On the left hand side, you have a comparison between you have two and three body parametrization and on the right you have a high dimensional potential based as a snap snap which based on the bi-spectrum. It's a good one let's say, and if you look at the horizontal axis we have number of training data, number of training configurations, and on the vertical axis we have errors, top row energy, bottom row forces. Now if you look at this you will notice different things. Well, the point here being that the high dimensional models on the in the right column they start out higher with the same training amount of training data right because they have more parameters it's harder to fit them. Whereas our model starts at low but then they improve a lot more and if you feed them more data they will improve further. Whereas our model saturates relatively quickly. And you asked for quantitative data so if you look at the horizontal axis you see that already at about the top row where we have like 10 training points. We're already not that bad. And those are random configurations so you're generous. At the moment for I mean we are currently setting this up right for real application you wouldn't choose random structures but you could think about it for example, generate. That's actually not a tribut point because the, well, for example, if you build a high dimensional potential and then you run a simulation you might discover you need more data in some region then you need to actively learn that. So, my current idea of how I want to use this is, if you're given a set of data you could use a simple heuristic like choose k points which are furthest from each other in a greedy way. I start with one choose the first one apart, then choose the third one which is the first apart from the other two and so on, and that covers roughly the space. Right. And if you have here what we could also do is, you could run, you could have very few data train it potential, this potential here which can deal with little data run a simulation, then you have a large number of different configurations use the k perth. For this point heuristic and do the DFT for some of them retrain the model and so on. Hopefully a low number of iterations set up that that will be let's see. Okay, thanks. Question at the back. I just was wondering if you could comment a little bit on how and errors in terms of RMSE is mapped to errors in your observables like your copper. And if you if you systematically kind of looked at tuning your, your machine learning model in terms of RMSE and then seeing how that affects your accuracy and what you really care about. Thank you. That is a very good question. I would also like to know. So, in principle, one should be and I think someone has recently done this actually for machine learning potential. You can propagate your uncertainties through to your final observable that you're interested like thermal conductivity, but we haven't done that in my group. Otherwise, I cannot give you any quantitative, anything quantitative here but I would say the general feeling is that once you have reached is whatever that is is sufficiently accurate model it doesn't matter if you make it more accurate or not. Probably more worth spending your time on getting the calculation for the observer right, maybe fix other parts of the model or as a so we seem to observe that it's not necessary as a just increasing the accuracy of the machine learning potential for its own sake is not necessary for to get a reasonably conversational service, but I cannot give you any numbers right now. I might be in a few months from now. Thank you for this very nice talk. So my question is that when calculating the thermal conductivity for the the conium oxide. How do we define the heat flux. The heat flux. I'm not sure if I can answer this right now but what I can say is we looked into different definitions of the heat flux. And we saw, and for example for the network for the machine learning potential that we chose this shnet model that's a message passing neural network. And we saw for example that the original heat flux definition we wanted to use wouldn't have been right because there is information coming from outside the local cut off. We adapted the heat flux, you're currently writing that up so if you wait, maybe, I don't know how long but hopefully less than less than three months, then you can read the answer to this question. But it was, yeah, it was, it was an important point of all of this how to define which heat flux definition to choose which implementation right there for example and then for a while there was a wrong implementation. So it was, it was not as it was an important part of this study. Thank you very much for the fantastic talk, I did want to make a remark related to the uncertainty propagation which is something I've been working on in mechanical periodic group. The idea is, you start by having uncertainty on the energies, and you use relating schemes and propagation schemes to account for, yeah, this uncertainty on top of the uncertainty on the observable to statistical something. And what did what we did find is that if your uncertainty on the energy is somehow the one we would accept in this community, then it turns out that is not particularly damaging on the convergence of the statistical observable itself. The math it's a bit complicated but we can discuss further later on. So we recently have a preprint on this. I mean, yeah, sorry about that. Very recently together with Aldo Andrea and cloud we did publish also a comparison between uncertainty on energies because for example of committee more I mean through calculated through committee models, and the distances, according to the sampling density and I think it's something Aldo Yale mobile so touch upon in a tutorial the next week. So thank you very much for the assist. Yeah, you're welcome. I think it's an important topic. I think we're moving on to expert questions here so maybe one more. One more question and so I just wanted to know in the slide you showed where you show the computational cost versus the error of the model, you probably said it already but I missed what data did you. This was on a tungsten data set by Gabor. So so this was a data set of pure elemental tungsten so no, it's just one component, but it had a lot of variety in the structure so bulk and surfaces defects say. But by now we have also applied it to other data sets of course. There might also be questions in the chat now. Yeah, we're looking at it now. Which one. So it's a learner of ml I would like to first verify the ML method and calculating some electronic property of a set of DFT optimized materials available in literature, what should be the procedure. Well, there's a lot of benchmark data in the literature by now. But of course it's important to understand what you can demonstrate with a given data set. For example. I think Felix mentioned this. There is a QM nine data set which has about 130,000 small organic molecules with some properties calculated for them on the DFT level, but these are in their ground state minimum. So if you train on part of this data set and predict the rest. Then you know how good your model can predict these properties when the structure is in the ground state. Right, but it doesn't tell you if you can take a random starting structure and relax it into the ground set with your model because model has never seen anything like that. So be careful. So that's what you can learn from a data set that said there's lots of free data sets available which is great. There are all kinds of materials data set. Oh yeah when you do much and you also have to pay a bit attention on. Yeah what you're doing with the machine learning. If you're doing something like a machine learning potential. It's important that your data set is homogeneous in the sense of, for example you cannot just pull down data from the materials project and just use it when you can of course but your error will be not as low as it could be because some of the materials have been trained with different settings as far as I remember, for example the K point density. And if imagine for example, half of your data set is trained with the ones settings for your DFT method and the other one is trained with slightly different settings. So from the point of view of the machine learning model these are two different functions, and it tries to learn two different functions at the same time, which will only work up until the differences between the two functions right. That can happen accidentally to for example I know a story where someone was running the DFT calculations on his cluster, and when he trained the machine learning model it didn't work as well as he expected it to. So it turned out on half of the machines in the cluster, there was a constant offset to the energy for whatever reason no one figured out why but it was there. Right. So, this, this episode is maybe interesting also in that it shows that training a machine learning model can also be used to identify problems in the data. This aspect is not so often investigated but I think it's an interesting one. But yes, again, I mean look at quantum dot slash quantum slash machine learning org for example, I think is the address or qml dot org, or just look at the tons of literature so lots of data, lots of data available freely. So I think there are no more questions from the chat that we haven't already addressed. So I think, since I don't see any other questions we can probably move on. Okay, great. And let's change gears now switch from this. And let's find out if I can use my iPad. Yeah, let's switch to learning with kernels. So as I mentioned, there are three major model classes around from machine learning potentials, but also for many other tasks in the materials community. For example, as a linear kernel and neural networks there are other methods like random forests, for example, but they tend to be used less often for whatever reason. And then there are also of course special methods developed for particular application. But linear kernel and neural networks and here I would like to talk a bit about the kernel learning. I cannot do all of this now right we already spent some time, but I surely I want to cover the introduction and the key ideas behind kernel learning. Let's see how detailed we can do the regression but at least show a little bit about how to regression with kernels. Let's take a look at a few standard kernels and briefly address how kernels are characterized mathematically. I think that's how much we can do so I'll be very selective in the content here because of time. So, what is kernel learning, which problem does kernel learning address. So, the problem is, if you have the choice and the linear model works for you then absolutely take the linear model. It's much easier to train it's more robust it's faster than all the others. And we understand the right least squares regression principle component analysis there there are literally dozens of algorithms. Linear algorithms, but data is often nonlinear. And if your data is nonlinear, you have to do something about it right. So what can you do. So, and by nonlinear I mean let's look at the regression problem you have a set of inputs let's call them Xi. This could be for example x could be a description of an optimistic system. You have your outputs your labels, why I, for example the energy of that configuration and the and you have you want to learn the map from the axe to the why that's supervised learning. Yeah regression basic. So, how, what can you do if the function that you want to learn is nonlinear. Well, you can use a nonlinear algorithm that would be a neural network for example. So you have a nonlinear model and try to capture the nonlinearity with that. You can also try to change your input features. So, for example, if your inputs are described by, let's say three dimensions ABC. Okay, just use more features like the original ABC, but also a squared AB B squared C squared BC ABC, right products for example. What you get is polynomial regression you get a lot of new features. And since these features are nonlinear functions of the original input features, maybe a linear model in these nonlinear features can capture your nonlinearity, or you can try to do that systematically and implicitly learn these new nonlinear features. That wasn't said entirely right. You can try to. I'll show you, you can try current methods I'll show on the next slide. Yeah. So let's look at an example that's maybe better. So here, you see an example of a classification problem so our wires are either orange or blue. So you have input dimension X. And the task here would be to separate the two classes so maybe we'll go to remember so a if you have a linear model that means you have for example here if you find, you want to find a dividing hyperplane, a hyperplane is an object of dimension one less than the enclosing space. In this example, you have one input dimension. So one dimension less is zero. So you need to find a zero dimensional object that is a point on this line that separates the two classes. This is not possible in this case it's a nonlinear problem and your linear model doesn't work. So, but what you can do is, you can transform your features. For example, by adding a second feature which in this case here is the sign of the first one. And now, again, a linear model is now our input space is two dimensional, one less a one dimensional object so a line that separates the two classes, and the X axis here neatly separates the two classes. So what has happened here is in a nonlinear problem in a low dimensional space became a linear problem in a high dimensional space by applying the right transformation. And that is what current learning is about a half of it let's say you want to somehow transform your input features, such that your problem becomes nonlinear sorry linear in the new features. And the other half is, we want to do this implicitly. I can do this explicitly right and I take my input points X and I just add a second dimension sign and that's it. And you can. But if in a real problem you might have hundreds of thousands of input dimensions. And then if you add like the products like in the slide before, then maybe you add polynomially many dimensions. And very soon, you will have too many dimensions too many features to do your linear regression explicitly, because you will have many features. It turns out they're also interesting spaces where you can do regression in that are infinitely dimensional. And then, well how do you do that explicitly right it's a bit tricky. So, you want to avoid that. And how can you do that. The second part is the observation that there is a class of functions, the kernels, which you evaluate in your original input space here on the left. So okay the kernel function takes two inputs from your original space so that we put two points here on the line, and it returns a number and the property of kernels is that this number is the same number you would have gotten. So you had calculated the inner product in this high dimensional space with the new features, the transform space. So, instead of having to explicitly compute inner products in this high dimensional space where it will not be practical. You can calculate the kernel function in the original space, and you get the same numbers. So kernels are a way to compute inner products in a high dimensional space without ever going there. So this is your original low dimensional space. And these two things together, nonlinear transformation plus implicitly working in that space through kernel functions. This is what is called the kernel trick. Okay, so that was the gist of it in a nutshell, let's take more careful look at this. I have maybe one one more thing before. So why inner products I said you can calculate inner products in your high dimensional space but why is that Well, if you think about it inner products contain a lot of geometrical information. Imagine, for example, you can see here that if you know inner products between points you can compute the distance or square distance between them. So inner products can tell you something about distances between points. And it also they also contain annual information the cosine of this theta here is you can compute it using the inner product and two distances which again you can compute using inner products. So if you can compute inner products you can get angular information. So that means, imagine, for example, we have n training in points x one to x and, and you know all the pairwise current evaluations all the pairwise inner products. Well, then you know all the distances between these points and all the angles, which means it's a lot of geometric information you know how every point relates to all the other points. So the inner products convey all the information that you usually need to run a linear algorithm like a regression or dimensionality reduction, for example. So, there is some information that is not contained in the inner products, for example, you don't have an absolute origin, but usually you don't need that. And, okay, let's ignore the other stuff that is the important part why why in our products. So now you have what you have seen so far is the key idea. And now let's take a look at an example at regression. Yeah, here. My original plan was to use this blackboard. But the iPad has a notes feature so let's see how well that goes. I'll try to keep it a lot shorter than I planned. So I just, I wanted to do the complete derivations with you but let's do just the, the important steps then. So what's the situation also let me switch to the, let me also switch the microphone. I need to be more stationary now. Let's see how that goes. So, we want to do regression. So remember, we have. We have n pairs inputs x i and corresponding labels y i. This is our input and we want to learn the function that maps x i to y i so specifically our assumption will be that y i is some true function f that we don't know of x i plus some noise with this noise is distributed according to a normal distribution with zero mean and some variance and importantly this epsilon is independent and identically distributed. So we have some noise on each label, but this noise is independent for each training example x i y i. So that's a that says the situation. And notation for us let's already introduce here. We put the access into a matrix capital X. So that will be like x one to x n. So this matrix is real. We have n training data and each one has D dimensions, each input vector. So what I show as everything that I show here will be for real numbers but you can do current methods with complex numbers as well if that is what you need. So, and we'll have a shortcut the vector by which has all the y one to y n. Okay, so what's the model so let's start. Okay, I think we can do at least this. So let's take a look at the linear model, and then let's switch from that to the kernel model for the linear least squares regression our model has this. We've had for estimator. Of course notation between machine learning and physics as a whole as a complete mess as you will find out. But there's nothing to be done about that. So, our model looks like this. Okay, so, so if you get a new input, let's call it set. Then our prediction is a constant offset B. And we multiply each component of the vector set with a parameter beta I these are regression coefficients. So, beta is a D vector. And this is what we want to learn the constant offset B and the vector beta. So what we will do with the help of a training data. And the first thing for now for this pedagogical exposition, we will get rid of the B, because we don't want unnecessary complexity complexity here. And there are various tricks you can play here. One trick that also works for current methods is imagine you have one input dimension. Let's put dimension y here. And here's your input X. And you need to be because you need for this basically. Now if you center both your axis and your wise you shift the whole colors, you know, let's use some colors. You shift the whole data here. If you subtract the mean from the axis and from the wise, you get this and now suddenly you don't need you just need to fit a line through the origin you don't need to be anymore. That's how you can get rid of the B. There are many other ways, but this is one of them that also works for kernels. Okay, so we have gotten rid of the B. Good. Now, how do we get the beta? Well, in learning algorithms usually optimization problems, you define like your predictor here, and then some optimization problem how to find the beta you solve that using the data and that's it. So let's see how that works here. So we want to find the best beta let's call it beta star out of all possible betas so arc mean over all possible betas. And what are we minimizing well. So ideally what we want to minimize is the arrow over future for future data so data coming from the same distribution as our training data that we have observed, but we can't because we in practice you don't know which gender which distribution generates your data, and you only have a finite data set. So instead is we minimize the error on the data that we have, which is called empirical risk minimization. So what that is is we have a sum over any training points. And we take the for those we know the label why then we subtract our prediction have had of Xi. That's the squared error. And if I put one over N and the square root here, we have the root mean squared error. Now, I would like to very briefly digress here and say, if you're a machine learner, you can ignore the root and the one over and because it changes the value of your as it changes the value of your optimal position but it doesn't change its position. So you will still get the same beta just the, of course, if you take the square root then the, the value of the error changes but the beta is still the same. So you can remove those. If you're a physicist you shouldn't because your wise have units and if you don't have the square root then your units change and maybe that's not so great. So remember sometimes to also hear think of units. Okay, but for us it doesn't matter. We will drop the square root and everything and just take a look at this term here. Okay, how do you solve this. Okay, I'll very briefly go through it, but not in full. So, first of all, the observe that we can write the f hat of Xi is this sum J beta J Xi J th component, but this is nothing else than beta transpose Xi. Right, this is just the inner product between the beta and the Xi. So let's insert this here. And you leave out the square root and everything. Then I will also skip the argument for brevity here. Then what you're left with is some I from one to M. Why I minus beta transpose Xi squared. Okay, so far so good. Now let's rewrite that in matrix notation. So for this, we have a sum, which goes from one to N. And then we have the ith entry of why, and we have the ith entry of something else. This looks like an inner product then it's this is squared so this looks like an inner product of some quantity with itself. And indeed if you look a bit longer at it you will see that this is simply the inner product. This is minus beta transpose capital X transpose, and this is a bit pointless let's do. Let's immediately write this as X transpose beta. So, sorry, X beta, I should say, so why minus X beta. This is the math in a quiet moment and if you don't see it immediately and you will, you will see that these two last lines are the same. Okay, so far so good. What next. Now you multiply this out right. So this is like Y transpose Y minus. This is the usual like Y transpose X beta and so on. So this is minus to Y transpose X beta plus the last term is the middle two terms of the same that's where the two comes from. And the last term is minus minus or times plus and then you see that this is beta transpose X transpose X beta. So this is minus this quantity. How do we do that well we noticed that this is essentially a quadratic equation. And so what we do is we take the derivative also the gradient and set it to zero. Yeah, so we want here, we say the gradient with respect to beta of this expression should be zero. So the solution right it's like a parabola, you set the derivative to zero and that's where the optimal value the minimal values in this case. Okay, how do we do that well. So, you notice for example that why transpose why doesn't contain a beta so that goes, and then you need to either look up or remember the rules for taking the derivative of these in your algebra expressions that same. But if you don't know that look up the Peterson matrix cookbook on Google, you will find a lot of hits and just this is a very nice collection of formulas and it also has the formulas for taking the derivatives here. So put them here on the right hand side in case you're not yet familiar with this on. So for example D X transpose B X. Let's first of the simple case, D A transpose X by DX. Now access the variable here is just a. So if we apply this here on top and we see that this is applicable to the middle case here. And we just get minus two. Why transpose. Wait, what have I done here. X transpose. Yeah, because of the transpose and the formula I think that's right. And now the other formula is if you have the matrix in between. Then you get B plus B transpose B plus B transpose X in this notation. Yeah. So if we apply this rule to the last expression here, we see that we get plus now X transpose X plus the transpose of that by the transpose of X transpose X is again X transpose X. So we just have two times X transpose X beta is zero. And now we're almost there. Right, we divide by two this goes, we bring everything with beta on the left hand side everything without beta on the right hand side. So that is X transpose X beta is plus X transpose Y. And now we multiply with the inverse of X transpose X and we are done. Our solution the coefficient vector beta for linear regression is X transpose X. X transpose the minus one X transpose Y. If the inverse exists. Okay, so that was roughly how you derive least square the linear least squares regression. Now, if you have this, what we will do next is, we will take a look at how you turn this into the kernel version of it. So, of course, I mean this is a bit. Just enjoy the show. And later, you can either look this up in a textbook and reproduce the individual lines or just do it yourself. So how does the kernel version of this look like this is the gist of it now the so we saw a linear algorithm. And we wrote down our model is just a coefficient vector times the input vector. We said we want to minimize the least squares error we rewrote the equations a little bit and found that this is the solution. Now this expression. Now we want to do the same thing, but only implicitly in a kernel feature space. So let's look at this. So the situation is, we do have our original input space let's call it calligraphic X. It doesn't have to be by the way, a nice thing about kernel methods is that you can have non vectorial input spaces for example you can define kernels on graphs on sheep if you want on strings whatever So we'll come back to that. And now a point X here will be transformed using some transform five into a new larger space. Let's call it F the kernel feature space. So to the point fire of X. Now, this is the situation that we have, and we know that we have this kernel function K. The way of X Z is the same as the inner product in the transform space. So this is our situation. Okay, first of all, let's do what we will do now is we'll take the exact same model and try to do it in the kernel feature space. So let's write down what our model is our predictor F of a variable Z that used to be beta transpose that, but now we want this to happen in the kernel feature space. So this is. Like, another f hat of phi of Z. And there we have a new coefficient vector let's call it beta prime perhaps the transpose times phi of Z. Right, so so this is conceptually what we would like to do. So the beta prime would be now in F, not an X anymore. Okay, this is from the intuition what we want to achieve. So let's write this down. F hat of Z as we had this before as well right we just explicitly some over the dimensions but now it's the dimensions of the kernel feature space so a different deal let's call it the prime beta prime I. Okay, this is one I think this should be fire said. I. So this is the same equation as before but not just written in current feature space. We can't do this explicitly because we don't have access to the fire we just have access to K. Okay, but what we can do is we can rewrite this. So I hope I'm not too brief now. Let's see. We have our sum, but the betas I now want to rewrite instead of beta I beta prime I, I want to write some J from one to an alpha J. Phi of XJ times. Phi of Z I. So what all I did was I took these beaters. And claim that I can write them like this with some new coefficients, alpha one to alpha N. Now this is maybe not immediately obvious and. So Patrick how long do I have here. 20 to 20 past. 20 past. Six. Okay, all right, sorry about that. Okay, then let me be let's take a quick look here why you can write this maybe it's better to do this one thoroughly and then the rest will go faster on. For the if you let me take a different, different color for this as well. So, if, okay. If you look at the linear case. You'll see that the sum. I from one to diesel the original linear case. I said I that you can write this as the same sum. And I replaced the beta I with some alpha J. I z I. This is the same thing as on the left hand side but in the original input space. Now here I think this is plausible right because let let us assume that we have more training beta than dimensions of course. So I think this is plausible right if you have d dimensions and you have n input data, then you can just use your input data like, well not an orthogonal basis but some, some basis for an expansion basis vectors, and just write your beta in terms of these And there is a reason why you can do this in the kernel feature space as well, but I don't want to go into that reason right now because that's an orthogonal attention to what we're doing right now. But maybe this is a little bit of a motivation by rewriting the beta I prime in this form is possible. And if you really want to know you can ask me later. So, we can do this. Now. Yeah, let's now just rewrite this equation a little bit and then we will see the current form. So we pull out the sum over the alphas. First and then what is left in brackets is the sum. Over the eyes. And now look, okay five five of x j i five z i. So, now, all I've done was I pulled the sum over the alphas to the beginning of the expression. And now look at this. What is, what is this here. Can someone see or say what this is. I'm going to speak up really loudly otherwise I won't hear you. Okay, maybe it's a bit hard. Yeah, so might it's just, it's just I wasn't sure if someone said the right thing because I don't hear you as well. So if you look at this, it's the sum goes over the components and it's just from the vector phi of x j and the vector phi of z the ith component and you multiply them and sum up that's an inner product right. So it's an inner product between phi of x j and phi of z. Okay, and now we're almost there because we know that the kernel function is a function that computes an inner product. So we don't know which kernel from as there's not a specific kernel function here right but we know such a function exists. And that means I can write this as some j from one to number of training data coefficient alpha j times the kernel function evaluated between the jth training point and the new point set, and that's it. That's how if you remember maybe you remember we started out with f hat of z. This is our prediction for a new point set. So let's just compare to the order to the linear one as a to recap what we did here. This was the key point. What we did here is, we, we wanted the same type of model coefficient vector in a, in a product of coefficient vector beta with a point set, but we wanted to do it in this. And what we did that and what we derived is that we can write this expression coefficient vector beta prime in the current feature space times phi of z, you can write that as this sum sum over some other coefficients alpha times the kernel function between the other training data and the new point set. So that's the form of our predictions all kernel, either all almost all kernel models have this form. Right, they are a sum of the training data, you have the evaluation between the kernel function, a training point and the point you want to predict, weighted by some coefficient alpha. Now in the interest of time I'll be able to now skip the rest. So what we did before is we wanted to find now what we need to do is we need to find the alphas. Okay, but the best alpha is then again, our community over all alphas. We minimize the error but of this model not of the linear one. So that's some over the training data, label minus prediction but now this prediction, no the new one squared. And you can solve, I will not do it now because of time, but you can solve this equation the exact same way. So you rewrite this as an inner product, you write it in gradient with respect to alpha, you do, you solve for alpha, and what you get. So the sequence of the proof is exactly the same. And what you get is alpha is K inverse, K inverse Y. So, this is a little bit, so where K is my short hand for the evaluation, all pairwise, the matrix that contains all the pairwise evaluations of the kernel function on the training data. Okay, and if that may, if the inverse exists, this is the solution that you get for alpha. So, very brief, of course, but I hope the idea came across. So that means what that means is, you can, if you have a your training data, Xi's and Yi's, you have a kernel function K you compute this matrix capital K inverted multiple by the why that gives you your coefficient vector alpha, and then you can predict new points by calculating your f of z here, this one. Right. So that's your kernel regression, or kernel least squares regression. So let's switch back now to the slides. So I have to be brief now. There are various other aspects but for example, you can do, you can add a ridge, the ridge is, you do the same thing as before but your optimization problem is not only minimizing the error. This also minimizes the norm of the coefficient vector weighted by some lambda. This makes the model smoother. Maybe I can even show you why in a moment. And the solution looks sorry, the solution looks pretty much the same in the linear case, except that you have this at this lambda to the diagonal of X transpose X. If you do the kernel version of that you get a similar thing you can this is a choice but it's a usual choice people make so you define a regularizer. Now in terms of your alpha coefficient vector. And if you solve again you add something to the diagonal of your kernel matrix and then do the inverse. And if you add something to the kernel matrix to the diagonal. It will basically, it will. Okay, it's a kind of circular problem here but it makes the model smoother we will see why hopefully still. Okay, so that's a comparison of the two different models you have the linear case on the left hand side and the kernel case on the right hand side and you see, instead of explicitly weighing the coefficients of for each dimension you have no this sum over the training data. So what does this mean, what does this equation here mean what are we doing here when we predict something in a corner model, well, there is turns out there's a nice interpretation. So, imagine you have one input dimension here, your X. And you place you have these are your outputs why and you have your training data the orange points and you want to learn the blue function. And now imagine you use a kernel that looks like a Gaussian function. Well, that's some versus some here this some here is nothing it's looks like a basis that expansion know you have a basis function this K and your weight, and you put it on the different Xi so on each training point Xi you put one here of this Gaussian functions. You multiply with alpha maybe some are negative some are positive some large some small, and if you add them all up, you get the dashed line which is roughly the blue line you wanted to learn. So you can view this models as kind of a basis that expansion the predictions for the model itself. Okay, I'll skip this. I'll skip this to. I also skip this but I want to say something, at least. So what I wanted to show here is. We saw now one algorithm linearly squares regression transferred from the linear case to the current case. And as a kernel methods were invented in the mid 90s by Wapnick and co workers they did support vector machines. They didn't realize their approach was generic but they just did this one algorithm for classification. And then later. When I took up Alex Muller and close Robert Miller derived used the same idea to derive kernel principle component analysis. And then people, they saw okay people so well this is a general approach and then basically every linear algorithm that people could find was And I wanted to show here one more example, how you can mention the centering before right you center your inputs x and your wires your labels. But for the kernel algorithm you have the labels you can still center like before subtract the mean of the why, but the inputs XI you now have to center in kernel feature space, not in the original input space, and you can do this. You can, you do this as a you you you can, but I can't show you the solution right. Maybe I can. No I cannot. Okay. But you start you do basically you follow the same ansatz you first you write down what you do in the original input space you just subtract one over and some I excited the mean of the XI. You can do that in the kernel feature space and the way you do it is like, given a kernel k you figure out how you have to like transform that kernel, so that you're, you're, it's like the preview like the original kernel but the with a centering in the kernel feature space. So I'm maybe from this explanation it's not clear how to do it but okay, at least know that you can do the centering in the kernel feature space and it's just another example of an algorithm and how you can colonize it. Yeah, nothing. Let's see but I think we don't have the time to do that. Okay, now we talked about algorithms let's have at least a look at some kernels right. What do kernels look like. Let's skip this here. Yeah. So, the simplest kernel you can do is the linear kernel, the linear kernel is like the identity. Nothing happens. Right. If you take the linear kernel so just the inner product in the original input space. That's it, your transformation if I is the identity you get the original algorithm back. That's the statement. So if you if you take the kernel rich regression from before you insert the linear kernel you rearrange the terms and you get the original linear rich regression back. Exactly. So, the, yeah. Okay, now, in what you see here is like how the linear functions look where they're just lines and let's see it for the next one. This is a very often used kernel here we see a bit more is called the Gaussian kernel. On the left hand side you see how the function the kernel function looks like gaussians with a length scale Sigma. And well so so you see you take the distance in the original input space, you have a scale factor Sigma and then you take the negative exponential. And you get the functions from a stochastic process that is governed by this kernel function you get these functions here, and you see that these are very smooth functions because the Gaussian is very smooth. This is an interesting kernel, because you can show it's possible to show that it corresponds to an infinite dimensional kernel feature space. So, okay. And this this kernel is a kernel that works well for many problems, but it's also not the best kernel for many problems. Why, because this kernel is like kind of universal local approximator in the following sense. You see here that we have this distance in the denominator. And think about it what happens, if the two points X and Z in the original input space are very close, then the distance is zero and that the kernel takes on the value one in the limit. Okay, but if the points are very far apart the distance will be very large, and it's the kernel goes to zero. But remember the kernel corresponds to an inner product. And if an inner product is zero it means the input vectors are orthogonal. Because this kernel maps far away inputs in the original input space into orthogonal dimensions in the kernel feature space. So you can fit whatever you want. But the dis downside is of course you need a lot of data. Okay, well that's maybe as much as I can say in a nutshell. This is the same kernel maybe really I mean you know this probably be called as I called it here laplacian kernel we have basically the same thing but it shows one norm. So you have a square and this have a cusp and you can model less more functions with this kernel. There are tons of other kernels literally a lot of them. I will not go into details here. This is an example of a graph kernel so where you can define a kernel not on vector input data, but on graphs directly as inputs, but the kernel is then algorithmically defined right. There are kernels on graphs on sequences like think of for example DNA sequences on text on strings, whatever you want point close. Okay. Now, what I still want to say is, what are these kernel functions, do we know the answer is yes a nice thing about kernel learning is that there's a lot of theory. If you characterize these functions it turns out the kernel functions are exactly the symmetric positive semi definite functions. And, well, let me skip the formal definition here but you might realize later if you look at this that these are the same as the covariance functions, which is interesting. And indeed, the kernel if you want you can think about the kernel as encoding how the inputs vary together. So covariance between the inputs. The point here is there's an exact mathematical characterization of these functions. And that means, if you have a new function you can check whether it's a kernel or not, which is good. Right. And if it is a kernel that all the theory applies to it. That's a nice thing. Right, because of this structure, you can use any kernel learning algorithm. So kernel linear regression, sorry kernel rich regression, kernel principle component analysis, whatever kernel official fissure discriminant kernel support vector machines all the long list, together with every kernel, right, as long as your function is a kernel function you can use it with any of these algorithms which is nice. So you can, every kernel stands for a certain type of nonlinearity and you can use every one of the kernelized algorithms with any of the nonlinearities for which you have a kernel. So you can combine them, whichever way you need them to. Okay, let's skip this. Oh yeah, maybe. Sorry, one thing I want to say. And thus, then a kernel function is positive semi definite if and that there various characterizations I don't want to go into that, because we all have time but one characterization is all the eigenvalues of every kernel major you can form from a kernel function are non negative. And if you remember with maybe not but we had this regularizer, which added a constant to the diagonal. Because it, it not only now okay sorry it's too much far apart, it improves the inversion of the matrix from before, and also it makes it more. For example, if you have a function that is almost a kernel function like it has the matrices that you can form from it, sometimes have small negative eigenvalues well if you add a small constant to the diagonal you shift the eigenvalues up by that constant, voila, your function is positive semi definite. As long as the, you're only slightly off the positive semi definite is you can correct it this way, but maybe this is a minor point and maybe shouldn't have brought it up. The important part is you can characterize these functions and the other important part is, you can build new kernels from old ones, like building blocks, right, if you. And so formally the set of kernel functions is a closed convex cone, and it means, if you multiply by a non negative scalar. It's again a kernel, if you add two kernels it's a kernel, if you do a point wise product it's a kernel if you do outer, what is this thing here. Other product, then again a kernel and so on so you have certain operations, under which the class of kernel functions is closed, and so you can combine these kernels into new kernels. For example, to make this a bit a little bit more plastic at least. Let's say you have a linear trend in your data but it also oscillates. Well maybe you can combine a linear kernel with an oscillatory kernel to model your data. I'll skip the representative theorem I'll skip the reproducing kernel Hilbert spaces, basically you can prove that if you have a symmetric positive semi definite function you can always construct a kernel feature space that corresponds to it and vice versa. Yeah, but unfortunately we don't have time for that. This I still want to say, so I said that kernel learning has a lot of theory and it's true. And for example, in a neural network, you never know whether you found a good optimum or not for your solution. And for the kernel learning algorithms to find the parameters like the alpha before that is a convex problem and you always find the global optimum. So that sounds nice, but they are also free parameters like the length scale in that Gaussian kernel. So what about the we call them hyper parameters. What about those. And it turns out for those. You don't have a nice theory. So that's a non convex global optimization problem. Well, go figure. So you, you have to find either good heuristics for your free parameters, you can do a systematic search like grid search. If you have a few of them, you can minimize the likelihood of the models you can try to do a gradient descent somehow, but you can't get around the problem of that it's a non convex optimization problem. So kernel methods also have this problem but in a reduced form. That's maybe good to know. I will not go into details hyper parameter optimization if you do it if you want to do it well and robustly so not so actually not so easy. But if you can get away with the grid search, it's usually fine. Okay, anyway, so this is if you if you want a mental picture. This is about it. The dirty corner of hyper parameter optimization. This is I think from Boeing 747 it has probably about 1000 knobs and dials. Okay, that's hyper parameter. That's how that's a proper image for hyper parameter optimization, but there has been some progress in auto ML and such so maybe today it looks more like this. A little bit better but still, you have to somehow choose good values for your free parameters. So I wanted to cover a few specialties of a domestic machine learning potential but I can't. But let me, let me just mention them. So what do you do if you don't have labels for everything it's like you want to predict atomic energy contributions but you only have a label for the sum of certain atomic energy contributions turns out you can still do it maybe one of the other I think. Yeah, I'll come back to literature. So you can do it, then derivatives so important right if you do DFT, you also get the forces and you want to learn with the forces because they contain a lot of valuable information but let me skip one slide. Look at this right here are some functions from a Gaussian process prior whatever that means it's your possible models basically. We have training data these classes and so only the functions that go through them are now valid. But if we have derivatives we have so much more information right because we can not only tell go through this point you can go like this through the point right so you need. If you get the derivatives for free you need a lot less data. In some sense. So, and you can incorporate that into kernel models because the derivative of a kernel regression model is still a kernel regression model. Without going into the details you can calculate kernel values between like your energy energy energy force force force and you can get like a block block wise kernel matrix structure. You can train with derivatives and you can take the derivative of your model that's the point let me skip that. Let me skip that as well. And no dimension reduction today, but maybe an outlook, or maybe not. But I want to say one thing. So, we saw now a glimpse of kernel methods right, and we saw a linear regression model so. Okay, and we know they're neural networks, but when do we choose which method right. And I think there are two things I want to mention. One thing is so. You really have very few data maybe consult a statistician right, but if you have like on the order of tens of thousands of data points or 10 thousands kernel methods are a good choice. And they scale a cubically in the because you have to do this matrix inversion cubically in the number of training data at linear when you predict in the number of training data when you predict something. So, and you need to store the kernel matrix or computed on the fly but you have to do either one of these two things. And that means, yeah, if it's as if you invert a matrix that's like 10,000 by 10,000 you don't even have to wait. If you have a fishing code. And if you want to invert the kernel matrix like 40,000 by 40,000. Okay, maybe wait a little bit but still okay, if it's much larger you will have to wait, or have large computers or both. But so that seems to suggest that for the small to medium range here current methods are a good choice. And if you have really big data, maybe use a neural network. But and that is the other point I want to make. If you can, if you want to, you can fix most of the disadvantages of the of different methods, for example, for the kernel learning, you can if your kernel matrix is so large, you can do a reduced rank approximation of the kernel matrix. And then you can do your 100,000 points and no problem. Okay, that's maybe what I can say in the brevity of time let's skip that. Yeah, let's summarize. So what did we see. We saw the kernel trick. We saw implicitly running algorithms in a high dimensional space by using kernel functions that correspond to inner products in that space. We saw how to do that for regression. We saw three default standard kernels, and we discussed very very briefly, a few aspects of this. And I want to mention that you can do a lot more with current methods than I show today naturally. Right, I mentioned the structured inputs you can have structured output learning predict tensors predict graphs if you want. We mentioned that reduced rank approximation there's so much theory. This is a lot more than that. Okay, with that, I want to conclude as a head over to Patrick. Okay, thank you Matias. And I think the hyper parameters will come back at you tomorrow in the tutorial. So you get to fly one of those points to question. Yeah. Thank you very inspiring talk. So, I have one very general question related also to the first half of the lecture. So it's very natural that the high dimensional space the data is very sparse so for like, there was the example of those ultra fast potentials and snap potentials. So naturally that with smaller amount of data, the complex model with work was and the law, more simple model with the error with such a rate faster because the coverage is already quite general. So, then it comes to a little bit related to this part so how these very simple methods can work so well, even if our problems are often very complex, but at least we think they are very complex. Maybe a little bit like a continuation for it. That's this really a bit like a refer that are we always thinking is that if is there really something underlying there which more like a reverse that it's a more simple is more beautiful. Like, do you have any kind of thought about this. Well, can you make your question more specific perhaps. More like why these simple methods like a kernel best measures which is put just the linear regression in a kernel space. How this can work so well. Okay, okay, yes, I understand. Okay, sure. Um, yeah, I mean, so if let me repeat your question to see if I got it right. You're saying we have a relatively simple model like linear regression we just do it now in this kind of feature space but it's still linear regression there so simple model, and yet we can fit complicated phenomena right. Okay, so, well, I mean, it depends on the number of features right. So the, the more features you have. I mean, of course, they cannot be all the same right but even if they're correlated right the more features you add the more coefficients you add the more flexibility you have in your model and that's why you can fit many things with with a simple linear model because you have a lot of features Thanks for the talk in the slide where you mentioned about the radial basis function. Yeah, you spoke about, and I don't remember right now so you mentioned that. But it's very clear that it's trying to find how close to vectors are in the feature space, and you mentioned about this orange curve, can you just repeat, and then I can ask the question that I had in my mind. So what we see as this is for the example of this Gaussian kernel as I've written in the first row. And what we see is on the left hand side, the shape of this function for different values of the parameter sigma of the length scale. So you can make your gaussians very sharp with a long length, a short length scale or very broad with a big length scale. And what we see on the right hand side is a Gaussian process or a stochastic process, who has this kernel as covariance function. And then you can draw functions for it's like a random variable where you can draw functions from and these functions are governed by this kernel functions case. And for this exponential kernel, I didn't explain this properly because of time right, but this is what it is roughly and so these three functions on the right hand side are three random functions drawn from a Gaussian process which has this kernel as covariance function. And that means basically that you can use this Gaussian kernel to model smooth functions. Right, if you have non smooth functions like jagged functions the Gaussian kernel is not such a great choice because it naturally is very smooth. Okay, so I think my question was slightly different. So now, imagine you have these X and Z, and they are fairly low dimensional and in low dimensions closeness of two vectors make sense. Right, if the vectors X and Z, for example, are very, very high dimensional. For example, the many body tensor representation of a molecule right, it can very quickly go into high dimensions, and in high dimensions, like, if we imagine, or almost every vector is equally distant from every other vector. So, do you have any idea of when do these models break apart or like how do you know when would this model break apart. Okay, let's see from the top of my head if I can say something on this to this. Um, so, usually these original input space where the X and Z come from will be maybe not that high dimensional right but let's assume it is. And then you're right if I remember correctly, I mean, if you're high dimensional vector if the components are random then basically they all lie on a sphere or something and you have, yeah, they're all kind of the same. But, I mean practically there will still be signal right. And it's like, imagine you have two strong input feature signals that are relevant for your why, and then you have like random additional components. I mean, your, your learning will degrade, maybe the more random components you add or your signal will be weaker in some sense. But, but it will still be there right. So I think it will still work. Maybe it will gradually be great. I'm not sure, but that is how, yeah, how would right now think this would go. So yes, even if you have high dimensional input vectors, the distances between them are not exactly the same right there will still be some signal left. And if the components are maybe many of the components will be zero. And if that is the case then it's like a low dimensional vector. So I think in practice it's not that much of a problem. But you could, you could, I don't know, maybe think of cosine distance or something like this but I think it's usually not necessary. And since we are a bit progressed in time, it's a good time to stop now. If you do have questions still, I think, Mattias would be happy to answer them one on one. So you can ask sure and you can also write an email if you want. So let's thank Mattias again. Okay, we are over. We meet tomorrow at nine here. So try to be here at 10 to nine and also important. Remember to sign out. Sign on the paper. We need that. Thanks.