 the links to the Google collabs on the website. And if you need any help, reach out to these three students for any help. Without further ado, the floor is yours, Milica. Thank you. Do I have to hold it really close like this all this time? OK, everybody, welcome. No, it's fine. Welcome to this second tutorial. My name is Milica Todorovich from the University of Turku in Finland. And I'm one of the organizers of this workshop. And it's so really nice to see you all here and enjoying the program. Nice to thank you that you could come in person and also thanks to our invited speakers. But most of all, thank you to Claudio, Kevin, and Stefano, who are doing a fantastic job hosting this event. So thank you very much, again, to our local organizers. All right, let's get going with the second tutorial. This is going to be a direct follow-up on the lecture of Matthias Rupp yesterday. So he introduced, very kindly, the kernel methods, especially kernel read regression and Gaussian processor regression. And now we will cover the implementation in a very simple form, probably the simplest possible. So one disclaimer immediately, I know some of you are already very experienced. And this might be too basic for you. But we've tried to write down every single step and go through them and explain them. So I hope this will also be useful to you and all of the people following us on Zoom who are many. So let's get going. It's easier to actually just use the computer. We have a little bit over an hour. So we will discuss a little bit some introductory slides, but very few, just summarizing the lecture that Matthias gave yesterday. Then we will do two different notebooks, one to learn how kernel read is implemented in practice, one about Gaussian process regression. We can take then a little break if anybody needs to go to the bathroom. And then we will move on to consider active learning with Gaussian process regression and how Gaussian processes can be, or the properties of Gaussian processes can be exploited to curate data sets and to optimize materials properties. And I want to introduce Jarno Kunal and Manuel who are sitting here, Kunal is over there. We're monitoring the Zoom channels and also you. So if anything goes wrong during this tutorial, you can't execute something. Something's not working for you. Please raise your hand and one of them will come and sort you out. Right, let's begin with something very basic. All machine learning workflows start with a data set. And I assume that many of you already have your own data sets that you love and despair over. And the next step is to implement them into some kind of representation, which is a encoding them in a structure that represents them to the machine learning method. And Anton kindly reviewed quite a few representations this morning. So now we get to the machine learning method. And this is going to be a topic of this tutorial. There are, of course, many and we will cover some more tomorrow, for example, some neural networks, but today some kernel methods. And I will briefly introduce how you do quality check on these implementations and how you can improve your implementations. So the field of machine learning material size is very broad and very many different methods are commonly used. Unsupervised learning doesn't tend to be covered enough in these kinds of meetings, but it's used enormously often in combination with supervised learning. And of course, you can use unsupervised learning on labeled and unlabeled data sets. By labeled, I mean that your data entry has x, y structure, where you're trying to kind of map x to y. And x is often the representation or your data set encoded in the representation and y is what we refer to as the label. So if you have an x, y type of data set, you can do supervised learning. If you don't have the y, you don't have the label. And then you can only apply unsupervised learning to that data set. Nevertheless, unsupervised learning is often used on both labeled and unlabeled data sets to find structure in the data and learn more about your data set. So I wanted to also mention active learning because I will talk about it later. With active learning, you don't have to compute a data set beforehand. You can actually compute it on the fly. And the AI tells you how to compute a data set, directs you in the assembly of the data set. And that's what I will come back to later. And when we have supervised learning, which is the topic of today's tutorial, depending on the data type, whether the label is a category or a continuous variable, we talk about classification regression. So I just wanted to set the scene with this very basic introduction. So what we'll talk about today is mostly supervised learning regression. So this part of the chart. So let's start immediately with supervised regression. And this is very much what Matias was talking about yesterday in a lot of detail. Let's say we have a straight line fit problem here. So we have just a very simple type of data in X, Y. And we wanna fit some line, the data is in red. And let's say that the straight line fit is our optimal model. And for every data point, of course it doesn't really fit to this line. So there's this green line that represents the error. So for example, for this point all the way up on top, the prediction from this model would be here, sitting on the blue line. But of course the real data point is a little bit off, right? So that's what the green thing stands for. So let's pose the problem. The very common way to pose the problem in general terms is to express Y as a function of X and some beta parameters which characterize this line fit. And then for each red point Y, we can represent it with this straight line with some kind of parameters beta. And but then to get to the actual red point, we have to also add the error term in green. So that's the very, very general way to pose this problem. To go about solving it, we keep the errors on one side and we put the function on the other side with a minus sign. And then we square up the error and we sum it up over all eyes which are all entries in your dataset. Now we have the sum of squared errors on one side and what that translates to on the other side is each data point Y minus this prediction which is a function of X and also beta the parameters of this line, right? So your error now becomes dependent on the beta parameters. So what we're trying to do then like literally every supervised regression method does the same thing, you just wanna minimize this error. That's the ultimate objective. So we're looking for the beta arguments, beta star that minimize this summation which means they minimize this expression here. And there are many different ways of obtaining this beta star and there are many different beta parameter types with different kinds of supervised regression models. And in the case of of course a straight line a beta has two parameters, a slope and intercept. That's the simplest thing. But when you talk about neural networks there's a lot of lot of violence in beta. Okay, and then let's say we've done this we've employed some algorithm, we computed the beta star we plug this in back into this formula here and we compute the prediction and that's the end. So that's supervised regression and not shell but he has said it much better in much more detail yesterday. Okay, but let's move on now how this is actually done in practice. As you may know, you take a total dataset you divide it into a training set and a test set. Each one has this X, Y entry. So you have to have that Y label and then you'll take only the training set and you train your machine learning model until it's trained. And that usually fits the beta parameters. That's the step that fits the beta. So when you've got your beta star which is the optimal solution you do a prediction step on this test set. So you take the axis from the tech set you plug it into the model you obtain Y prediction. And it's really important that your trained model has never seen the data in a test set. You probably know that just stressing. And then of course the last part is the quality check you check your predicted labels and you check how close they are to the original labels. And that determines the accuracy of your method so far. And of course you think that's the end of the story but it never is then you go back and you modify your model, you modify your representation and you do this cycle many times probably until you get very good results. But it's really important just to implement this once to get this working. All right, let's go to kernel regression. As Matias pointed out, the important thing about the kernel regression is that it transformed the problem counter-intuitively from a lower dimension into a higher dimension where it's easier to model the data in a higher dimension. So this actually comes directly from Matias's paper. So if you look at the data in one day, it's very, it's inseparable but then if you look at it in 2D if you project it, this becomes separable and the solution is easy to find. This tutorial by Matias Rupp it has launched hundreds of PhDs probably so far. You know that in our group we started following this tutorial and it was so well explained. So if you're new to the field, please look it up. In practice, kernel regression models are written like this where the prediction of the label is expanded as a series of kernel multiplied by an expansion coefficient, alpha. And this kernel describes the covariance of your data point R with every single other data point to the data set. That's what the I stands for. And then the sum runs across every single data point in your data set. So you write it down and it seems reasonable. So this is the property that you have and these are all the data that you have. And then you have to make a choice for the kernel. So there are many choices. I write down here a few but you will find many more in the literature. Let's look at the Gaussian kernel in a little bit more detail. So here is your data point and here is every other data point in the data set. And it's really just a classic Gaussian expression. We're here, typically the representation enters on top and you have this sigma width of the Gaussian in the bottom. And the kernel, once you define it, when you choose it, you can plug it back in there but we still don't know the alphas. If we knew the alphas and we knew the kernel, we could just compute our predictions very easily. So one also thing that we have to think about is that there's this hyperparameter coming into the kernel. So now the beta star task, finding the beta star is really finding the alpha that minimizes the machine learning error. So we come back to this very generic approach in solving supervised regression. So here is our expression for the error squared. We're looking for the alpha that minimizes this entire expression. This is your test set labeled. This is the predicted one. And then we must not forget that there is a regularization term here. And this exists because if you just try to minimize this, it can happen that your model over fits. So the coefficients get adjusted so that the model goes really exactly through the data points, but the model is overly complicated with too many features and it's kind of overfitting and it's not transferable. And you will know that it's overfitting if it predicts really well in your test set, sorry, on your training set. When you plug it into the test set, it's rubbish. So that is classic sign of overfitting. So this regularization term here helps to smoothen out the models. So it forces it not to go through every point, but to kind of find a way that is kind of smooth and continuous and typically this improves the predictions. And then here we have what this regularization term is is just a sum of these parameters alphas squared over all data multiplied by a gamma which is the regularization parameter. And this is another parameter of the method that we have to fit. So the other one was the Gaussian width. Right, so actually the alphas can be found directly by performing this matrix operation. You have to add the regularization parameter into the kernel diagonal, invert this matrix and that's the computational bottleneck matrix inversion and then multiply it by your actual labels. So this is very easy. So kernel ridge regression in general is very easy to implement computationally, but you do have to pay attention that there are these two different hyperparameters, the sigma here and this regularization gamma and that you have to fit. While you're fitting this model more or less and I will demonstrate how and then I will tell you this is commonly done with grid search. So then I will tell you how to spot the right solution. Okay, so these are the two hyperparameters. Very briefly about Gaussian process regression. This is a Bayesian regression type. And when I say Bayesian, when everybody says Bayesian usually this just means it's probabilistic and not deterministic. That's a key difference in statistics. Bayesian statistics is commonly contrasted against the frequentist point of view which is the kind of statistics we learn in schools like probability of coin tosses, things like that. But Bayesian statistics almost has this quantum uncertainty incorporated into this quantum concept. So when we talk about Bayesian regression it means fitting with many lines, not just fitting with one line but fitting with many lines. And the distribution of this line gives you some kind of broadness to the width to the fit. This doesn't happen with other frequentist fitting methods. So here's an example of how you put this data set with the linear kernel with straight lines but many straight lines. So before you know the data you know your end fit is gonna be a straight line but you don't know the slope or the intercept it could be anything, right? You haven't seen the data. So this is what we call the prior and we sample from the priors they look like this. Any one of them could be a possible solution. And then you have this data and you take this prior and you fit it to this data and this is the posterior, right? That's the result. And note that we have been, we fitted many different lines from here that kind of correspond to this data set or describe this data set well. But, and they don't seem to be too bad, right? But all of them could be possible solutions. So then what we do is we computed the average of these lines and that's this posterior mean here. And that's statistically the most likely solution given this data and given the fact we want to fit with straight lines. And then what we can do is also use the width of these possible solutions and defined the posterior variance to standard deviation which is shown in the dashed line and that shows the limits of kind of confidence of this model. So we don't actually know what happens to this data before and after. So the model gets kind of uncertain here at the beginning of the end. It knows very well what happens in the middle because there's lots of data here but you can see that the variance is some kind of measure of confidence in the fit. So that's it in a nutshell. GPR is most, sorry, Bayesian regression is very commonly done with Gaussian processes. And if you wanna learn all about Gaussian processes please look at this Bible by Rasmussen and Williams everything's in there. And this image shows directly from there. So with the Gaussian process you always have this prior and posterior form. A prior form is before you see any data. Gaussian process, let's say it's, this is not gonna be a linear kernel anymore. We can use like other kernels like Gaussian. And that means that our final solution to the fit will be some kind of wiggly line that will correspond to the data. But before we've seen any data, we only have the prior and here you can see some samples from this prior. Any of these lines could be our solution but we don't know yet which one because we don't have any data. So as soon as you have some data you apply the base rule to this Gaussian process prior with this data and this collapses the space of all possible solutions from the prior to only those solutions that pass through the data points like these two data points. And you can still that there, you can see there's still very many possible solutions. So that those are represented by these wiggly dashed lines but they all have to pass through these two points. So it's the step applying the base theorem is a little bit like collapsing the wave function. It kind of collapses the space of possible solutions. And what we can do is again, compute the average of all of these solutions and that gives us the full line and that's the posterior mean. And then the space spanned by all possible functions describes the posterior variance which is this measure of confidence of the fit. And you can see that variance banishes where we have the data because we're really sure the function has to pass through these data points but away from the data. Here for example, and here in the middle, we really don't know what the function might look like. So the variance is quite high. And here comes the kernel. So any two points X and X prime here are connected by a covariance function here. And this is encoded in a kernel. And here I'm showing again a classic Gaussian kernel which is also called squared exponential or the radius basis function kernel. And surprisingly, there are some parameters here. The signal variance multiplies the Gaussian and the length scale enters into the denominator of the exponent. And again, describes the width of the Gaussian. And these are parameters that you have to fit during the model. With kernel ridge, you had to do it with grid search but this one is usually done automated way by the package by maximizing the log marginal likelihood which is the standard way to do it in Gaussian process regression. So this is all I wanted to say by ways of introduction. I just wanna remind you that there are these model parameters in kernel ridge it's the kernel width usually and this regularization parameter. And in GPR it's the length scale and the variance of the kernel. And for kernel ridge we do a grid search with cross validation as we will see in the notebook now and then in GPR we maximize log marginal likelihood. Right, the important thing then is to know how to evaluate the quality of your results. And typically this is done by means of a learning curve. On here are some learning curves. On the x-axis is the training set size. You wanna increase it gradually. And on the y-axis is the error, mean absolute error on your dataset. And if you've done everything right the error should go down. Error is not going down, you've screwed up somewhere. And what's also very interesting is to plot different types of datasets. For example, on the same ones you can see how the learning rates can be very different. The learning rate can also tell you how fast the error is going down if you keep adding data. So you can see that for this green dataset this is very sad. You can keep on adding lots and lots of data beyond 100K and the error is not going down much. So that's probably what's gonna happen. So there's something about this dataset. But for this blue one the error is going straight down. So it looks like if you keep adding for example if your level of accuracy desired accuracy 0.1 here you can maybe reach that if you go to say 40K or something. So when you're looking at these learning curves it's also important to consider what is good enough, right? So depending on your research problem you have to consider which error is acceptable for this model and where do you wanna stop? And considering that these learning curves will tell you how big the datasets need to be. And another thing that you can do is compare the performance of different representations. And so here for dataset two which is this blue one we actually have two different representation and you can see the dark one is obviously producing lower errors than the other. So that's something that you really wanna do when you're starting a problem you don't know your data very well. You try out a few representations you build these learning curves and you find the ones that work best are always the ones that minimize the error. Right, that's in a nutshell how you evaluate the learning quality. Right, let's stop here and I'm going to switch to a browser where we will open a notebook and start coding. We're all here because we love to code. So I guess you've had some practice now with the notebook execution in the previous tutorial so I'm just gonna sit down and oh, I guess I have to reshare now the notebook. All right, are there any questions while this is all loading? Do you have any problems loading the notebooks? Just raise your hand if you do. Yep, I'm sorry, I can't hear you. Is there a microphone or can somebody actually? One second, guys four. So I was wondering in normal models when we fit them we can make many models and then do a query by committee to get some probabilities. What is the difference between that and this probability you get with the patient? You mean you query the model to get a prediction? Yes, and then you can make some estimate about error. About what? About the error of the model. The error of the model, yeah, yeah. So it's exactly the same with GPR. You're talking about the Bayesian approach, right? It's exactly the same with GPR. The prediction you retrieve is the posterior mean, the average of all solutions. And that is then the, that is the actual prediction from the model as we will see now in the tutorial. And it's entirely comparable to the predictions you get from the kernel ridge. It's just we have to remember that in the GPR it came as an average of many possible solutions. And the one difference that you get in GPR from kernel ridge is that you can also obtain the width of those predictions which is the standard deviation on the prediction. That doesn't come out from the kernel ridge but it does come out from GPR. And that's where the Bayesian component comes in, right? You get this width and you can actually inspect it and plot it and we will do that in the tutorial. Thank you. Good question, not at all obvious if you haven't used these methods before. Thank you Kunal. Any other questions? One. Hello. So if you were to use an ensemble or ensemble of kernel ridge regressions you'd also have some estimate of uncertainty, right? So. That is true. You could do statistics with kernel ridge regression. You can cycle through a data set shuffle and always take different training sets. Yeah. So how would that compare to Gaussian process regression? Well, actually that's a very nice question. You can use these tutorials to evaluate this for yourself and see how it is. Right. Thanks. I should, one thing that I can say here I should say that Gaussian process uncertainty during fitting reflects something inherent in the data set. If you don't have data in a particular region of your data set space, right? If your data set is not equally representing data between the, in the entire data range you will get uncertainty predictions like such that the uncertainty is high where you're lacking data. And we will see this in the tutorial. When you do kernel ridge regression you can shuffle the data set and you can get some spread but it doesn't tell you anything about the data that's missing. You just, yeah. Just get data statistical inference from these data sets. So that is maybe some extra information that you get from the GPR that kernel ridge doesn't deliver. Kunal here is an expert on GPR as well. So feel free to add something. Not sure if I'm sharing the right screen. It doesn't look like it. Looks like you guys will be able to read in the tutorials much more easily than me. Okay, I'm sitting down now. Okay, so let's start with this problem where we are predicting a property of molecules. So we are mapping the structure of molecules to some property that we pre-computed with DFT. This work was described in, see this paper here. And you can see that we're mapping the structure of small molecules to the molecular orbital energy levels in particular the highest molecular orbital level, the homo. We're using the QM7 dataset of 7,000 small organic molecules. And we have pre-computed the homo's for all of them. And for the representation, we use the Kula matrix, which Anton helpfully introduced. So let's get through this. First, let's execute the setup. And if you're still not sure, shift enter will execute your cell, but you can also just press play. Okay, so now let's load the datasets. I think this just executes very fast. So with this WGIT commands, you can load the datasets that are posted on GitHub. And also if you're not sure where to put your datasets on Colab, that's one really useful way. You can just open a GitHub, upload lots of your data there and then just import them with these WGIT commands here, right there. Thank you Kunal for showing me that trick. Okay, so now let's load our data. Now this WGIT loaded the datasets in the memory of the notebook right now. So you can execute this and we load the pre-computed Kula matrices into X and we load the pre-computed homo values into Y and that's gonna be the structure to property mapping. And just for fun, you always wanna check what you loaded. So check the length of your Y array and you should have about 7K homos. And if you don't, something went wrong. So these are kind of typical checks to do. And then whatever you do, even if it's your dataset, it's always good to check what you loaded, not just from the arrays, but also just look into it and take a look, take a look at what it is. I'm gonna make this thing a little bit longer so you can see here. We just picked one molecule from the X array and we are showing one Kula matrix entry. And this is it. It's just a bunch of numbers with a bunch of zeros at the end because of the padding. So if you work with Kula matrices, this is a structure you should learn to recognize know and love. And of course you can print it. So here we are, we're plotting this entry for around the molecule in a Kula matrix. And this is how it looks like. It's all sorted by value of the element in the matrix so that the upper left contains the largest values which represent the heavy atoms. And then you've got this block which is all the hydrogens between all heavy atoms. And then you've got basically zeros in the bottom right of the matrix. And this is how it always looks more or less. Okay. And then this was examining an element from the X. Now let's examine the Y dataset which is just a distribution of HOMOs. We've just loaded and what you can see is the, this should actually be Q7, QM7 not QM9. So what you see on this graph is on the X axis is the energy, the HOMO energy. And on the Y axis is the number of molecules that have that HOMO. And so you can examine the distribution of HOMO energies. They range for about the largest HOMO is about minus four. The smallest ones are about minus nine. And you can see that the mean is somewhere around 5.5, right? So the biggest number of molecules in your dataset have that kind of HOMO. And indeed you can compute the average and plot it and it's minus 5.66. So this is just sanity checks that you've loaded your data correctly. There's nothing wrong with your data. Your label Y is plotted here and it's unimodally distributed. And that means that probably you're gonna do good quality learning. Okay, now let's divide the data into training and test sets. We do this here manually and I'm gonna show you how to do it manually but you can also use the test train split functions that are encoded in many packages. So if you're doing it manually, the first thing you wanna do is shuffle the data. Both your, so if you shuffle your X, it also shuffles the Y and this is how you do it. And once you shuffle the data, you've erased any possible order in which your X's and Y's could have been encoded. And often it happens that there's some kind of order like molecule size or something. So you definitely wanna shuffle the data. Even if you're not sure, shuffle the data. It never harms. And if you don't, you could like accidentally capture some structure or something that's gonna screw up your learning. So just shuffle. And then once you've shuffled, you've erased any possible order and then you can just go in order from the first entry, the zeroth entry. And here we're going to take a training set of 1000 and a test set of 1000. And so not a very exciting or good training set but let's just do it like this. It's gonna be fast. And so what we're doing for the training set is you're taking the entries from the first entry to the thousandth and then everything else is just kind of discarded. And then the test set is picked after the training set. And that's how we ensure that the training set and test set are completely separate. Okay, so we execute this. Now we have a training and test set. Right, now that you've selected training and test set you wanna do another quality check. So we're gonna plot the labels of the training set and the labels of the test set. And this is how that graph looks like. You can see that they're more or less on top of each other and that's a good sign, right? That means you're not accidentally taking any data predominantly in a training set or predominantly in a test set. This is random picking. So you want these two distributions to more or less overlap and to also overlap with this graph here, which we plotted. So it should kind of look like this. And to make this comparison more quantitative that's why we computed this mean value for the homo minus 566. And then if you scroll down you can check that the mean homo in the training set is exactly minus 566 and in the test set minus 565. So this is really good enough, right? You're doing a good job. Okay, so this is kind of end of sanity checks. We've loaded our data, we've checked everything. So much recommended. Now in this training section I wrote down a little bit of theory that was in these slides so you can kind of skip over that. It contains a lot of details about how we're gonna do the training. And read through the bottom parts where we describe how we fit the hyper parameters with grid search through this cross-validation. We're going to be implementing this now in practice. So this grid search requires us to set up a grid for the hyper parameters of our model. And this is what we're doing next. We've got the alpha hyperparameter and the gamma hyperparameter. And if I scroll up alpha are the regularization coefficients and the gammas are, yeah, enter into the kernel, right? Right here. So note that the notation has now changed from the slides and this happens a lot when you read papers about kernel ridge, they denote the sigma and a Gaussian and the regularization parameter with different terms. So please don't get confused. Just double check what it is. So what we're gonna do is we're gonna create a grid with this log space function where for each for alpha and gamma we're going to have entries going from 10 to the minus five to 10 to the minus two. So note that this is a logarithm. So all these numbers are actually going into the exponent of 10 and they'll be four by four grid. So let's make this smaller. Let's make it from minus five. Let's put minus four here and then there'll be three by three grid. So that's gonna be faster to compute. And this you can come back to and change yourselves. And then we're going to look at the cross validation numbers, right? So what's going to happen here is we have number five. It means your total training set is gonna be divided into five segments and we're gonna learn we're gonna put four together and try to fit alpha and gamma on the four and use the fifth as the test set. And that's done once. And then we're gonna do the same thing, but just, I don't know, taking the first one out and taking two to four as the training set doing the alpha gamma fit and then checking the error. And so we're gonna shuffle this five times and that's what cross validation does. And each time we're gonna do a grid search find the optimal values for alpha and gamma and that in the end we're gonna do an average of those values. So because we never know how you will select your training set so the best thing is to average. Five is an actual pretty decent relatively high value for cross validation. We're gonna put it for three maybe just to do a faster computation but you can vary this later. And then we have to choose a kernel and here I'm choosing a Laplacian kernel which is very similar in build to the Gaussian. And if you scroll up here you will see the forms for Gaussian and Laplacian so you can compare it. Choice of kernel is a variable in the method as well. And if you read Matthias' paper he will tell you which kernel is best for the Kula matrix representation but you can basically assure yourself of that as well by changing here the kernel from Laplacian to Gaussian and back and seeing which ones produce the smaller error. Whatever produces the smaller error is always the optimal choice for your dataset. And this will vary with a representation so then later if you switch away from Kula matrix to MBTR or SOAP or something different do check again the kernel choice. And then we have to define some kind of metric to check our results. So we are choosing mean absolute error and this is negative because Scikit-learn randomly decided that it's gonna produce negative values so we have to always minus it. All right, and this part is the key, right? This is where you define your grid search. And in the grid search you put in the kernel you've just decided here your alpha and gamma which are going to be chosen from this log space grid. Your CV, the number of cross-validations is gonna be entering here your scoring function is MAE and we're switched on verbosity so we can check what the method does and every time you start with a new method I really recommend to switch on verbosity so you can see what's happening under the hub. Okay, so having defined this function you want to train the model and in Scikit-learn this is done with a fit function. Well, actually in most packages it's gonna be just called fit and you put in your training X and Ys here. So let's execute this and this is the part that's gonna take the most amount of time here and you start seeing the printouts already. So check out this printouts. You start seeing the cross-validation structure here in each line, right? One of three, one of nine there's gonna be nine grid points and you're cross-validating three times that's how we set up the model and what you see in continuation is the choices for the grid. So our grid goes from 10 to the minus four to 10 to the minus two. So the first value that's tried is 10 to the minus four, 10 to the minus four for each one and then that's tried three times and what you're using your Laplacian kernel your score which is the MAE with this random minus sign which you can take out later is 0.29 that's your MAE and that's in the unit of your dataset label which is a homo energy and of course in our case it's electron volts and it took 0.4 seconds to do this step, right? And then we're gonna do this step three times for this set of alpha, gamma combination 10 to the minus four, 10 to the minus four and that's because we're doing these three cross-validations and we get three different errors, right? This is expected to produce different results because you're shuffling the dataset, right? And then you average this and that's the average value for this set of alpha and gamma in particular. And then in the next line, here we start the second point out of nine and alpha is still 10 to the minus four but gamma is now 10 to the minus three, right? So you're just going through this grid, the second out of nine possible grid points or combinations for alpha, gamma. And again, you're shuffling your datasets and you're getting different scores and you average that and that's gonna be a score for that alpha, gamma combination. So that's how you read this output, right? Here, if things are not coming out right and if these scores are looking too different that's like saying something's wrong. So always check your output. So I'm gonna scroll now all the way to the end. The very last one was this, nine out of nine. Now check that alpha and gamma are now the maximum possible, both 10 to the minus two. So that's the last combination we check. And here's the summary, right? We've done three cross validations with Kernel Ridge, Laplacian and these were all the alpha, gamma combinations we tried. So one thing that we then like to check is inspect the results for the cross validation. And what you see in this output in the first column so now there should be nine printouts here for all the combinations of alpha, gamma and these are the errors that they produce the mean absolute errors. And now we're very close to finding our final solution for alpha, gamma. We have to check which error was the smallest whichever error was the smallest that's the best alpha, gamma in this case. And one thing that we like to do a lot is plot these with this heat map and you can kind of tell here that setting gamma to 10 to the minus two was not a good idea. The errors are much bigger. They're over one electron volt, right? Remember that there's a unit and it's about 0.3 for the others. So what this is showing here is that kind of alpha didn't matter in this case in this very small subset of possible parameters. And to some extent, both of these gamma values are working well 10 to the minus three into the minus four. So that's how you would find your final result and set your alpha and gamma. So then having chosen the best parameters and here we actually numerically checked and it turns out that alpha of 10 to the minus two and gamma of 10 to the minus four numerically produced the smallest MAE and that MAE was 0.279. This was kind of rounded up. So I wouldn't read this off a graph. A graph is really something qualitative that you should check to see like some kind of sanity checks here. But you always want to take quantitatively the lowest value and this was this value from this list. Yeah, here it is. This is the best line. And now we pick these values of alpha and gamma. Now we have selected the hyper parameters and we put them in the trained model and we predict. So let's now do a prediction. And something that you always want to check is a scatter plot between your predicted value which is now in the x-axis and the reference value which is, sorry, the reference values in the x-axis, the predictive values in the y-axis. And you want this scatter of points to be as close as possible to the diagonal. The diagonal, of course, is x equal y. So if you're doing a very, very good job with your machine learning model, all the data should be really, really, really close to the diagonal. This is not the case here. Don't feel bad if you see something like this, it could go much, much worse and you'll probably see much, much worse in all of your fitting career. But your job is now to go back and try more cross-validation. Obviously, larger training set sizes. Maybe try a wider range of hyperparameter search. So you can now go back for all the key choices, different kernel and you can vary all of these choices in your model and then keep expecting this and it should get better. This scatter plot quality is always evaluated with the R squared score, right? That's why we always compute it. If it's right on the line, it should be one. This one is 0.6. So R squared should be always between 0 and 1. I've seen negative R squareds, this is totally random. Don't get scared, some packages just do that. But it should always be between 0 and 1 and the closer it gets to 1, the better your fit is, right? So always look at the plot just for qualitative interpretation but then compute your metrics for quantitative comparison. Okay, I think this is the end of this notebook. We wrote down a number of exercises that basically ask you to go above and change some choices and then run the whole thing again and check if your errors are going down. So what you're looking to beat is this value of 0.264, which is the prediction on the test set, right? That's your current best result. So always write down your current best result and then as you go up and change the values, it should get better and better. And I think we don't really have much time right now for this. So if you are interested in pursuing this further, please do this at home with these notebooks you have the links so you can just play around with them. I have to say this kernel rich tutorial is one of the simplest ways of implementing machine learning. And we have passed it on to our students who have never coded before. So if you're just beginning, it's really useful to just plug your data in directly into this kind of code. It's pretty self-explanatory. So feel free to use it. Any questions before we move on to GPR to do exactly the same thing? Yes. One second, there's a mic coming. So we don't repeat the question to watch. Yeah. So I just a quick, like more technical question because so you're using this Coulomb matrix and also with zebra padding. And you said that in Mathias' paper, like he also defines a different kernel. And so I looked at the notebook and like what was done is that you basically just flatten the vector or like the matrix into a vector and then you do the kernel function. And I mean like in practice, like how would you do it and also to avoid like wrong calculations because it could be that because also like your input represent actually matrix but you're using a distance function for vectors. So how? The way to make it exactly right is to flatten everything exactly the same way. So you and to prepare every Coulomb matrix for every molecule exactly the same way. So always sort, always like do everything in the same way. And then you take typically the first row to start the vector and you take the second row and concatenate. And then take the third row and put it under and put the four rows under, right? So you kind of pull out the matrix into a vector exactly in the same way for every single molecule. And that way all the information is encoded in the right look, incomparable locations with two different molecules because you're gonna be dividing, right? Sorry, you're gonna be subtracting one descriptor from another, right? So you want the information to be exactly in the same location. Does that answer your question? Yeah, okay. No, we're okay, all right. And also another question, if you maybe know like let's say if your input representation is actually like a matrix and if there are also all like kernel functions which also really respect the geometry and use like distances for matrices instead of just flattening them. Honestly, you don't get any additional information from preserving matrix form when subtracting. It is almost always the easiest thing to just flatten them exactly in the same way and then subtract and then take the Euclidean distance then between two vectors. I don't remember seeing any case where preserving the matrix form gave any additional information that wasn't already encoded when you flatten out the vector. Does anybody know? No, I think that's a very, very common thing to just go for the vectors. They're easier to handle but they could be huge depending on your representation. For example, MBTR 50,000 entries, cooler matrix maybe a thousand. Okay, thanks. That's actually the limiting factor. So one thing I have to say just for posterity is please download all of your data set in at the same time. If you are not planning to use the entire data set we've come into a lot of memory problems if you try to read in all of your representations for all of your data set but then you wanna pick a training set that's only a thousand, right? So not related to your question but I just thought to mention this there. Okay, good questions. Any other, any other question? Yep. Okay. So in this case you showed us that the mean absolute error is for a range of parameters is more or less the same than we extract a combination. But then given this kind of situation is there some other metric that we could optimize because let's say the mean absolute error does not change much and maybe we can make the model more general or something like this. I would say that you don't want to do anything different. It's always just, it's not a bad thing that the mean absolute error is kind of the same for different subsets. It just means that your training, your data set contains a lot of similar data. So either way you take a training set you more or less end up with a similar thing. That's all it means. It's a reflection in your data set that the cross-validation errors are so close together. If your data set was very, very different like super inhomogeneous and very diverse then you might spot that the mean absolute errors are different, but that doesn't mean that it's bad either, right? It just means that the data set is inhomogeneous. So I said that you kind of monitor this in case some of those mean absolute errors goes really bad, like something really, really massive that may be an algorithm problem, right? But you don't have to worry if they're too similar or not very similar. You always just take the average and that's how you proceed. Does that answer your question? All right, good. All good questions, feel free to ask. There's no such thing as a stupid question. This is the forum that should be where you should ask these things. Go ahead. Hi, thank you very much for the nice talk. I haven't looked at the data, but could it benefit to do some scaling of the raw data? Oh, that's something that I always say to everything. Normalization is one way, I guess, to scale data. So that's something that's commonly encoded into most algorithms. Always try it with and without normalization because it's difficult to say a priori which data set would work well with normalization and which doesn't. So you can always try some scaling. But that kind of goes in the realm of super fine tuning, right? After you've done all of this and everything's coming out reasonably well, if you wanna really push down the MAE a little bit more, you could try to do some scaling. Just beware, I've had project students in classes get stuck trying super, super elaborate scaling that they're trying just to invent out of thin air and tune the different scaling, scaling different properties with different things. You can also get stuck if you do too much scaling. I recommend to try maybe normalization or not, but be aware if there's some kind of physical motivation why you wanna scale some data, you can try it, but beware of trying to over engineer it because it could just take a lot of time and in the end you're squeezing out like the third significant point of accuracy or something. This one. This is very good. I thought this was maybe too basic, but these are all really good questions. All right, thank you very much for the tutorial. And so as far as I understood, we chose a range for the parameter gamma and a range for the parameter alpha. Yeah, and then for each combination of these two parameters, we compute the weights. Yeah. Okay, now if we use the QM7, we can do it in 10 minutes, but maybe if we use like another data set like the QM9 that computing all the time, all the weights could be expensive. So my point is, is there a way to go in this grid to travel the grid, let's say, jumping in the most meaningful area or most helpful area? For example, here for gamma 0.01, I mean, I have all I value, so I would like to exclude that. Is that? That's an excellent question. After we did this in our own research, we had the same idea and we did it with active learning and Bayesian optimization, which was part three of this tutorial. And it turned out, it works exactly like that. It just goes to the most relevant regions where the ME is already low and it's much more efficient if you have larger training set sizes than doing this over and over. And we published a paper on it, let me be one or two years ago. So if you look me up, you will find what we discovered, right? So we tested also random sampling and different things, so not just the grid search. And this is also an important problem because sometimes you're not only converging the current average parameters, but you might have also parameters of your descriptor or representation. And then this becomes a higher dimensional problem. You wanna optimize ideally all parameters at the same time rather than optimize first this ones and then those ones. Then you don't really know how they interact. So that's an excellent point and there are more advanced ways of doing this. Thank you. For the record, I should also say, even if you're using QM9, always start the hyperparameter optimization with a very small training set size, like one or 2,000. You can, even if your dataset is 2 million, you can always take one or 2,000 and just try. We've also done a lot of studies of how optimal hyperparameters vary as a function of training set size because we were a bit nervous about this in the beginning. Like, does your finding with like one or two K are those hyperparameters gonna be equally well performing when you've got 50 K in a training set? And there is a little bit of drift, but not that much. And that was also in the paper that we published. So you can like read up on this and because we found out there's not much difference, now we are kind of confident that if you start probing the hyperparameter space, even with small training sets, that the hyperparameters that you find are gonna be generally valid for larger prediction problems. Okay. At least with Kernel Ridge. Okay, thank you. Good stuff. Okay, maybe let's continue to the next notebooks. So let's do this exactly same problem with the GPR. So exactly same data, exactly same problem. And let's see how the MAs compare and how GPR implementation differs from Kernel Ridge. And let's start here. So please load the GPR tutorial notebook. And let's execute first the loading of the packages and then we're gonna load exactly the same data, load it up into X and Y. I'm gonna skip the sanity checks where we plot everything. And then we're gonna divide again in the same way to test and train. We've got again 1000 in train, 1000 in test. We are shuffling them. This random seed comes in here. So every time you shuffle, you might end up with a different order. Be careful if you wanna reproduce exactly same results, fix your random seed. So you'll end up with exactly the same molecules in the training, exactly the same molecules in the test. This is a very good trick when you're still figuring out how to work with a data set. So that everything's exactly comparable. Okay, we've loaded our test and training sets and loaded the data. I hope you all managed to get to this stage. And now we have to build a model. So in kernel read regression, it was a choice of kernel was just a keyword in a function. And to some extent, it is the same in Gaussian process regression. You have to first choose the kernel. And what we will do here is use a Gaussian kernel, which is very similar to the previous case. You could also switch it from Gaussian to Laplacian in the previous case. And Gaussian typically has an exponent where the in the denominator that the width of the Gaussian comes in. And then there is a number multiplying the exponent. And for some reason it's psychic learn. So typically this Gaussian kernel is defined with this, the number multiplying the whole exponent. But in psychic learn, the Gaussian kernel is just the exponential part. So we are going to multiply that kernel with a constant kernel. And the constant kernel is going to be this factor multiplying the Gaussian. It's very simple. But it looks like it's something weird because we're doing some kernel gymnastics. So we actually have a composite kernel that's going to be a constant times RBF. And the constant is the pre-factor and the RBF is the exponent part, right? That's all it is. Okay. And what kind of hyper parameters? I've linked by the way all the stuff from psychic learn. This is implemented in many, many, many other packages and they're mostly the same names. Please double check every time you use a package how exactly they define the kernel. You think a Gaussian and there could never be any difference, but apparently with different packages there are differences. So please check documentation carefully. It's your friend. Okay. So we've defined this composite kernel with a constant multiplying an RBF. And as I said, the RBF stands for radial basis function and that's a common name to call the Gaussian in GPR literature or squared exponential. So all of this is in this text. Please read it at your leisure. And now we look at the hyper parameters. So the constant kernel has only one hyper parameter. The constant. And that's literally the value that multiplies the Gaussian. And the exponent part, the RBF part has also just one parameter and that's the width of the Gaussian. And that's it. Those are the two hyper parameters. There's no regularization term here. And remember we now don't have to do grid search for these. These are gonna be fitted directly by the model fitting functions by optimizing this log marginal likelihood. So thankfully, you don't have to worry about this. However, there is something that you need to worry about. This automated fit that the function performs and this is the case in many GPR packages. It comes down to a bounded hyper parameter search. And for each hyper parameter, you need to define some initial value that's gonna kick off the search and then the upper and the lower bound beyond which the algorithm will not select or search anymore. So it's gonna be the lowest possible value. That's the lower bound. The highest possible value of the hyper parameter is the higher bound. And then there's some initial value that you just tell it and it starts from there and starts searching. So the way that this is defined is the initial value that's pretty straightforward. You put them in numerically and it's okay if you don't really know what it is. You can just try many different ones. The bounds are typically defined by defining a scaling factor, like a hundred, a thousand, 10 to the something, right? And then the lower bound is your initial value divided by that factor and the higher bound is the initial value multiplied by that factor. Does that sound reasonable? So like basically if your factor is a hundred, your search is spanning four orders of magnitude, right? Two down and two up from the initial location. If your factor is 10,000, then your search is spanning eight orders of magnitude. That's huge, but sometimes when you really don't know where your hyper parameters are in the beginning, you might wanna make your factor really big, make a really long search and then you see where the search ends up and then next time focus more on that area of hyper parameter space. That's very common when you're starting with GPR on unknown data sets. So I'm writing all of this down by the way in the notebooks under pro tip. So that's kind of what is maybe not obvious that you wanna do this, but as you get more experience with this, that's what you end up doing. Okay, so let's construct the kernel and in this kernel, we have a constant kernel multiplying RBF, right? It's pretty simple to construct these composite kernels. Any product of two kernels is also a kernel. Any sum of two kernels is also a kernel. This kernel development is a big area of research in GPR, but it's very, very common to multiply kernels together. So in the constant kernel, we have to specify this constant, which actually we specified here, what are the initial values, right? And then here are the constant value bounds and that's the initial value divided by this bound, which is set here to 100, you can change that. And then the upper bound is the constant multiplied by 100. And that's how you do it, how you define lower and upper bound. And then for the RBF, the length scale is the key hyperparameter. We've defined it right there as starting value of 100. This is random again. You can try all of this many times by varying these initial values. And once again, the lower bound is divided by the bound, initial value divided and the upper bound is the initial value multiplied. Very common, let's execute this. Okay, that was fast. And now we've got our kernel setup. Now we are going to define the GPR model. And here, this text you can read, but I'm gonna directly start talking about this data structure. The GPR model in Scikit-learn is called Gaussian Process Regressor and it has certain choices. These choices are more or less the same in any other GPR package. They're called something very similar. The key things to insert is the kernel that we've just designed. So that was in the upper thing that we've just done. And the next big thing is to add the alpha, which is noise on the data. And this is a very difficult concept in computational science because we always compute all our data with high precision. Everything comes out of the code with 10 to the minus 10, like accuracy. And you think, oh, there's no noise in my data. Everything's exact, right? Turns out that's not true. Even if you think it is, in practice, there are some sources of noise in your data. And so how to set this alpha is always like an interesting thing. You can go to very small alpha values. In the beginning, when I started working with GPR, I was like, there's no noise, alpha is 10 to the minus 12 or something. And then I realized that this alpha is added to the diagonal. And if your matrices are close to unitary and if you don't add anything to the diagonal, you have a lot of matrix inversion problems and your algorithms are unstable. So even if you think you have no noise, please choose like some 10 to the minus eight for alpha or something that's still small, but not ridiculously small. And that's gonna help the algorithm stability a lot and actually make it faster. Cause when the algorithm is grinding, it defaults to other methods, numerical methods to try to compute it. And then your computation could suddenly go from like, I don't know, two minutes to 10 minutes and then you know something's wrong. So be careful with this alpha. The pro tip for alpha is to do a grid search because it's very difficult to know what is the noise inherent noise in your data. And that's what really alpha is trying to describe. So something that we do now, every time we start with a new dataset is we, we do this whole process with alpha 10 to the minus eight 10 to the minus seven, 10 to the minus six and we check the MAE. And usually there's a sweet spot. If you make alpha too large, like 0.1 or something, it smooths out all the data and GPR just fits a straight line through all the data. It's clearly wrong. And then your MAE ends up being sky high. So you can tell when this is clearly wrong, right? If your alpha is very, very small, you can be in overfitting regime where the model bends over backwards to fit through every single point, right? And that's also not ideal. So that's why it's, it's kind of a nice thing too to check for this alpha. And I stress this because it's like a hidden parameter. It's like, it's not something we have a feeling for in our data. Obviously, if I work with experimental data, I kind of know what is alpha because the experimentalist tells me what is the resolution on the data point. But for computational data, this is a little bit more tricky. So please check it. So this is it. This is kind of the most important thing, the kernel and the alpha. There's some other secondary things. For example, somebody asked about scaling of the data. Normalization is something you always wanna check if true or false works better. Whatever produces lower error always works better. Rule of thumb. And then the next thing, there is this thing called restarts. And this goes back to this tendency to find the optimal hyperparameter by doing a linear search with the bounds, right? So you've defined some starting position and you've defined some bounds. And now internally the function uses a local minimizer from the initial position, trying to find the minimum or minimum of the minus of log of a marginal likelihood or whatever is the target in the phase space of all hyperparameters. So like with all local minimizers, you can sometimes get stuck in a local minimum or there's something happens to your minimization and you don't get an optimal value. And that's what these restarts are there to handle. So if you switch them on, as in not zero, I think there's zero by default. So please be careful. If you switch them on, the algorithm will try twice more to start from some random part of the hyperparameter space between your lower and upper bound and do two more searches. And it might find a better solution in some of these other searches. So it's always good to check by doing a couple of restarts. You don't wanna do 10 restarts because you'll find that probably just a few is enough to get a better solution. And this is particularly important when you're starting a new dataset. So you don't know how to specify the initial values for the hyperparameters. So instead of like trying with many different initial values, just switch on the restarts and then see what are the optimal values that come out. And the next time around, take those optimal values, make those the initial values. And then your search will be much, much faster. So those are some of the tips. So try with the restarts, but please take note that these restarts, the more restarts you include, it really makes your fitting a lot longer because it's doing this big linear search. And then once again, and then once again, so you'll note if you put 10 restarts, it's gonna like double your fitting time. So that's why I say, just try a couple and don't overdo it with it because you won't find any new information beyond a couple. So I don't remember if we're executed this, let's execute it again. By the way, I hope somebody's looking at the chat questions because yeah, you are, are you seeing like 60 questions or something? And now the model training, that's with this GPR fit function. Once again, like with the kernel ridge, all you have to feed in is your training X, your training Y, and just do GPR fit. We've just defined your GPR model. And after executing this, which takes about 40 seconds. So this is the longest part of this tutorial. Now we have to wait. So what the fitting process does is does these repeated linear searches in your bounded space for the hyperparameters. And what you get afterwards is the optimal model with the optimal hyperparameters found. So the first thing to do after fitting is to print out those hyperparameters that are optimal and check them against the initial values you put in, right? One really important check is to check if your optimal values ended up on the bound, right? Sometimes if you don't know where the bound should be, you define a lower bound in the upper bound. And after training GPR, you found one of your hyperparameters is on the upper bound. That means that the gradient went that way, but it couldn't go beyond the bounds, it stopped there. That's an indication you need to make your bounds larger, right? To give more space for the hyperparameter search. So that's a very, very common thing that can happen in the beginning where you don't really know where your hyperparameter is gonna end up in. Okay, so this took about a minute to execute. I think we're all computing. And here is how you extract those hyperparameters. And what you're seeing here is that the constant hyperparameter ended up at 106 and the length scale of the RBF ended up at 23.1. That's how you read them. If we go back to see what we said, we put the length scale at 100, it ended up in 23, not too bad of a guess. We had the constant at about four. I think that's after some trying here and ended up at one, more or less one. Okay, so this is now the optimal choice. And now the model is trained and those hyperparameters are set to those optimal values. Now you're ready to predict. So let's do the prediction. As I said, scikit-learn automatically takes the best hyperparameter combination and encodes them for prediction. If you're using a different package check that this is happening or you might manually have to put those in. And what you do is you do a GPR predict on the test set now. So all you have to feed in is X from the test set and it predicts the label, the Y. And what comes out is our two things. And this is related to the question, what is the difference between Kernel Ridge and GPR? Y mean is the equivalent of the Kernel Ridge prediction. That's the posterior mean. You can get other information from the model to see the uncertainty. And that's what the YSTD means, the standard deviation for Y. This doesn't come out by default. You have to switch that on if you wanna get the uncertainty as well. But I advise you to always keep this on because once you've fitted the model, this is free to recover. So always just extract this information because it's very interesting to analyze it. So when we evaluate the MAE, all we have to use is Y mean. That's it. And so here we just computed the mean absolute error by checking how close are the Y means predicted from GPR to the labels. And we are printing out this value. Let's execute. Okay, this is 0.3 is the error. And now we can actually compare it to the Kernel Ridge MAE, which if I scroll down, does anybody remember what was the Kernel Ridge error? Speak up, 0.264. So we have it, 0.26 or 0.26, 0.27. So this is a little bit worse at the moment at 0.3, but not too far, right? So they're more or less, they've got similar data sets, similar training set sizes, maybe some hyper parameter choice was not optimal, but it's in the ballpark. So this is kind of, this is good for a first try, right? Now you can tinker with things to make it better. Okay, so this is how you use, how you compute the MAE from the posterior mean. And now let's look at the posterior variance or the measure of uncertainty. And this is a very, very common thing to plot first. This is a graph, a histogram once again, where on the X scale, there's the posterior variance or the level of confidence in the model. And on the Y scale is the number of molecules. So what this is telling us is that about over a hundred molecules have an uncertainty of about 0.45, something like that. And indeed this is the average of the distribution. And then there are some molecules, like maybe over 20 that have a very large uncertainty, more than 0.6. And other molecules, and there's quite a few of them that have reasonably small uncertainties of the model. Yeah. So Claudia is just telling me that we have five minutes to go, which just means that we'll have enough time to cover in depth. Oh, it's fine. Kernelridge and GPR, like honestly, if we have to stop somewhere, this is a good place because it's really important that you understand the basics of these two really well. Afterwards, you can do all kinds of gymnastics with them, but it's going to go badly if you don't understand the fundamentals really well. So it's good that we cover this. Okay. And this is how you examine the uncertainty of the model. Why is this model so uncertain? Well, the noise could be large actually. And you will find that if you increase the noise, these uncertainties just get bigger because the noise is part of the model, enters into the model and it gets reflected in the uncertainty. And if, yeah, if your noise is very small, then sometimes your uncertainty is, I wouldn't say necessarily small, but like the smallest it can get due to noise variation. The biggest reason why this model is so uncertain is because we have so few training data, right? And one thing that we can tell is that if you look at this right-hand side of the graph, where the uncertainty is the largest, you can inspect which data points actually these are, which molecules these are. And that's something very interesting to look at. Where is the model uncertain? It could be that you just have very few data of that type. So if you inspect these end of the scale, you might find that they're all large molecules. And for some reason, there's not many of them in this training set. And then if you were to add more large molecules, this large uncertainty would move in, like would disappear. And so you can actually suppress this peak in different locations by adding data of a similar type. And then the model will learn better on that type of data. And the uncertainty for that region of the data set would get depressed and go down. And that's how you can, you can know what you need to selectively add into your data set to improve the model. And this is why GPR uncertainty is a very, very useful output from a machine learning model that allows you to systematically improve your data sets. And that's another thing that we're like now thinking about all kinds of data sets that we're using. But more importantly, it can be used to do active learning. And this is the next tutorial that I wanted to introduce you to. But first, this is the end of our GPR. So now we can talk about some questions. Note that there are some exercises asking you here to vary the noise, to try different hyperparameter searches, try different data set sizes, check how it affects uncertainty. So you can do that in your own time. Any questions? Well, thank you, but I'm not necessarily finished. So let's see. Just more applause, it's always good. How, any questions about GPR? Okay, this seemed very clear or not clear at all. One of the two. Okay. Maybe you can ask later. Yeah, that's fine. Thanks Kunal. So let's do something very quick in five minutes. I have prepared these two notebooks. Introduction to active learning with Bayesian optimization. And this you can actually run through on your own in your free time if you're interested in active learning. This is now becoming very popular. It's very, it can be applied to many different data sets including experimental data. And the notebook starts with an introduction to GPR, but you've now covered this in a lot of detail. We do so much active learning now that we've encoded it into our own package called boss for Bayesian optimization structure search. So this notebook loads this package. And then we set up a very simple Bayesian optimization situation. Actually, let's run through this very quickly because it's not that far. It's a very simple notebook. So we are gonna set up a function just a one-dimensional function that looks like this. It's just a one-dimensional non-periodic function and we're gonna start sampling data. So remember with active learning, you don't have to have a data set pre-computed. So we're gonna start computing a data set to try and reproduce this function. That's how that works. Well, actually the longest part of this tutorial is just installing the package. So let's see if this goes fast. Okay, I think I've kind of executed it. So I'll just show this stuff. So here you set up a Bayesian optimization run where you wanna tell it some kernel and how many iterations, how many data points to evaluate and then you run it and then you execute some post-processing routines that will analyze this Bayesian optimization run and you get a set of figures. So you can see where the data was acquired and but most importantly, you can see how the model improves. So we're aiming to reproduce the black line here and the blue line is the current model. So with only three points, the model is not very good. With five points, it's starting to get closer to the true function and the shaded area is the posterior variance. So this is nothing more than GPR just with subsequently like more and more and more and more data. And then with five points, it's pretty close to the model and that's how Bayesian optimization works. It uses this measure for uncertainty to decide where to take the next data point to improve the model most. And so you get very, very good fits with relatively few data points. And if you think about this model, it could be any molecular or crystalline property. It could be energies, it could be homos, it could be any kind of functional property. And you just keep on sampling structures and taking data points of that property and you're building a GPR model for like the landscape of that property. And you want to take, to make as few computations as possible. And that is the essence of active learning. It's very efficient because it samples only where the uncertainty is large and where the data is missing. Well, it also samples near the optimal locations. So you can read all about it and how to do active learning and how to check the quality of active learning runs. So you check convergence, for example, of your model and things like that. And then there's another notebook which describes how to do configurational structure search like optimistic structure search with Bayesian optimization. And I just want to say, so here I show some examples and I actually implement this example for this model for a conformer structure search. That can be actually done here just by executing this notebook. And here now you can see we took many, many more data points. Well, not many more, 40. I mean, we're not killing ourselves here. And here is how the model gets better and better with more data points. So this is the model with, I think, 10 data points with 20 and with 40. So you can see that after a while you keep taking data but nothing else changes anymore. This is a two-dimensional case. And the tutorial is set up so that in the exercise you can actually do a four-dimensional case. And if you look at this molecular conformant has four degrees of freedom defined by these torsional angles. So if you do the 40 case you've done everything for this structure search. So those are some nice exercises if you're interested in Bayesian optimization and something that I think I won't be able to show right now is but I'll still try are these slides which I hope also you will be able to get access to some point which will describe these principles for active learning and Bayesian optimization. Basically the idea is that you fit a model you evaluate the acquisition function do the calculation add it to the data set and then you repeat rinse, repeat until convergence. So much of the theory is exactly the same but now you have this acquisition function that tells you how to use the GPR model to decide where to sample the next data point. And there are many acquisition functions this is also like an area of active development but we use these ones that tend to balance exploration with exploitation which means looking near the minimum or optimal solutions balanced with searching areas of space where you didn't have data before. And then this kind of describes how the tutorial first tutorial works and this gives you these landscapes where you can map structure to property very easily and extract the optimal solutions. This is a little bit about the code and well I'm showing here a movie of the tutorial number two for active learning where you are varying four degrees of freedom and you're collecting data and at the same time you're building a model for the energetics and at the same time you're checking what is the global minimum and this movie on the right shows the global minimum and the bar at the bottom shows that as you keep sampling more structures the global minimum gets lower and lower and lower until you reach the right solution. At that point you can continue sampling if you want and nothing will ever change, right? You found the global minimum and that's how active learning works. You can plug in any source of data this tutorial was done with an Amber force field so very fast but we've used a lot of DFT and even quantum chemistry as Manuel there is always giving him headaches. Lots of very lengthy acquisitions but you can do many different things. You can, this was a conformist search problem but we've looked at a lot of adsorption problems where again you like sample a lot of adsorption scenarios and you're learning about the energetics and you're finding the global minimum of adsorption energy. Okay, that's the end. Save your applause. Kernel based methods, I hope you understand are very effective tools in machine learning for material science. They're very commonly used to map structure to property with the optimistic descriptors or representations. They're very easy to use and implemented in very many codes and packages that can be used on both CPUs and GPUs and GPUs tend to be like hundreds faster than CPUs. So if you can just use those. We find that they're generally good for very different data sizes. We used to think, oh, you know, 10,000s, 100,000s but then we started doing Kernel Ridge with like 3,000, 5,000 up to 10 and it's doing generally a good job, right? Depends a lot on your dataset and how diverse it is but it's not necessarily always a big data method and it's very easy to use but please be careful about the hyper-premises. That's like, they're so easy but the hyper-premises is the one thing you need to be careful about and hopefully now you've learned how to do hyper-parameter fitting for both KRR and GPR. Okay, thank you for your attention and sorry, we've run out a little bit over. Applause. Okay, thank you very much. minutes of overseas. Amazing tutorial, as always. The resources will be online, the slides and the recording we will upload them next week probably. Now we have our lunch. If you are a vegetarian or gluten-free please tell the staff. There should be vegetarian options for those of you who signed in the sheet and we will see you again at two o'clock sharp, please.