 So, hello everyone and to a lot of you, I know who are following along the series, welcome back. This is the fourth session in the hands-on data science and machine learning training series and this is the second session where we'll talk about machine learning. Specifically, we will build on the ideas that we developed last time and continue talking about classification and regression. Now, I know many of you are following along the series, so you might have seen this slide many times. So, let's just briefly recap where we are and where we are going with respect to the broad goals of this workshop. The big idea was to familiarize ourselves with the central concepts of machine learning or models that generally tend to learn from data. In the previous session, we saw how to build models, simple models like linear regression as well as complex ones like neural networks to perform regression tasks, so to predict continuous outputs. In this session, we will look at classification tasks for the first half, so we use the same neural network, but to perform a classification, classifying objects into different categories and we'll also look at another kind of regression. In the next session, we will look at dimensionality reduction. This might be of interest for people who are working in high dimensional space and want to use machine learning techniques to visualize their data in two or three dimensions or maybe use some machine learning ideas to distill a vast data set into more manageable and interpretable data spaces. Lastly, the last session will look at machine learning in the case of design of experiments, so this is the case where you use your ML algorithms to point towards the next possible experiment that will give you the highest probability of success and you can define your own success metrics and the machine learning algorithm will guide you towards the experiments that you need to perform such that you will achieve success faster than random trials. As always, we will continue to use Nanohub as our cyber infrastructure to test out these models to play with them and in this session specifically, we will also use Citrination, the database from Citrin Informatics to get some data and train some models as well. That is the outline for today, we will recap neural networks and for those of you who are new, I'll be going over the basics anyway again, so don't worry that you missed out the previous session, it's not a big deal, we will look at neural networks, just refresh our memories, use it for a classification task and then switch to random forests and see how we can train them or have these models learn from data. Let's get started. Let's refresh our minds about what neural networks are. We had discussed last time that neural networks are abstractions of the real network of neurons present in your brain, so the picture here on the right is an abstraction, it's an artificial neural network. The key abstraction being that each neuron represented in this circle here is now an entity that takes inputs, performs some computations and passes that computed output along to the next neuron, that is the central task of one neuron. While building these artificial neural networks, you typically want to arrange neurons in layers. So last time we looked at the example of using a neural network to predict young's modulus. We will look at a different example today, but let's just stick with that for the purpose of illustration. Let's say we had two inputs to a neural network, the melting temperature and the resistivity, you noted Tm and rho, that would constitute our input layer, this would be the inputs to the model, the neural network. The output layer would be the quantity that you're interested in predicting, in this case, in the previous session, it was the young's modulus, in this case, it's going to be a different example. In between, you can have as many hidden layers as you want, but as many neurons in each layer. In this case, I have one hidden layer with two neurons in that layer. The first thing that we need to understand to see how neural networks work is we need to be able to take in some inputs and compute the outputs using the neural network. So let's say we wanted to compute the value for neuron A1. Some of you may recall the use of terms like weights and biases. The central idea being that if the value for neuron A1 is going to start by computing a weighted average of the inputs. So the weight W1 times the melting temperature Tm plus W2 times rho, the resistivity plus a bias term B1. And at this point, if we were to stop here, a neural network would be perfectly capable of predicting linear combinations of your inputs. So you'd imagine that if I have an exponential function or a quadratic function, something that is beyond linear, a linear combination of the inputs might not be enough to capture that. To capture that nonlinearity, we add something called an activation function. And we can do the same for neuron A2. Again, look at the inputs, perform the computation, pass it along to the next neuron. The computation to perform is weights times the inputs, so W3 times Tm plus W4 times the resistivity plus the bias B2. Another activation function applied to that as well. You can do the same for the Young's modulus, the inputs for which are going to be A1 and A2. And the prediction there is going to be the prediction of your model. Again, today we'll look at a slightly different example, but I think the illustration holds good. So this is how you define the structure of a neural network. The key terms that you want to keep in your mind as you foray into this field are weights, biases and activations. This will come along frequently as you train more and more of your own neural networks. Given the structure of a neural network, maybe we can briefly spend a second looking at the activation functions, the nonlinearity, so that we have a concrete idea of what that is. The simplest activation function is the one that does nothing. It just takes in the input and spits out the same output. So it's a linear activation function. The second activation function that we can look at is the hyperbolic tangent or the tan H, which takes in the inputs and squishes them such that they're always between minus one and one. The one that we used in the previous example was rectified linear or ReLU. The job of this activation function is to suppress the input if the input is less than zero. So f of x is zero, f of x is less than zero. And if x is positive, it's going to return the same output as the input. This is the one that we used last time and I think this is the one we'll use today as well. So it's good to familiarize ourselves with that. And now that we've defined all the components for the structure of our network, we need to now define a scheme where we will actually learn from the data. To do that, we need a metric of how good or bad our model is. So if our young's model is the quantity that we're interested in predicting and maybe the ground truth data is 100 gPa. If your model predicts zero gPa, the model is pretty bad. But if it predicts 95 gPa, you know that the model is pretty good. How do you encode this mathematically? One way to do it is something like a mean squared error function. We saw mean squared error and the mean absolute error last time. Given this metric for how good or bad your model is, you then need to understand how do we need to march our weights from their initial random values to values that will give us decent predictions. So one way to do that is by updating the weights repeatedly using the gradients. This is called back propagation and there are many schemes to do this. This is specified in the optimizer command of the code. We'll look at this again in a second. But that is basically the fundamentals of the neural network. Today, we will look at a classification task where the question is very simple. Last time, we looked at how to predict the Young's modulus of a material given heat of formation lattice constants and other atomic properties. For the classification task, we'll ask ourselves, can we predict the crystal structure of various elements in the periodic table from the same atomic information, heat of formations, melting points, atomic masses, etc. Let's think about that a little bit more. What is the input to the model and what is the output of the model? The inputs to the model are the same as last time. All the data that I've highlighted here in this table is much more. The output, what we want the output to be is a crystal structure. But typical crystal structures are assigned names like face-centered cubic and body-centered cubic FCC, BCC, HCP. These are strings and you would imagine that working with these strings can be difficult. So, one way to mathematically encode these strings is to use something called one-hot encoding. So, what is one-hot encoding? Let's say we have nickel, the crystal structure of which is FCC. In our example, all the materials can belong to one of three crystal structure types. They can be either FCC or BCC or HCP, face-centered cubic, body-centered cubic, or hexagonal close-backed. Given that we have only three possibilities, we can represent the FCC crystal structure as a set of three numbers, which is one, zero, and zero. In a similar way, we will define BCC to be zero, one, zero, and the HCP crystal structure to be zero, zero, one. Why are we doing this? Why are we representing this text in a set of three numbers? One, because working with text is difficult, especially when you define metrics like loss functions and how good or bad your model is. Working with strings is difficult. You want a mathematical representation of everything. Second, using these sort of encodings, as we'll see in a second, will allow our network outputs to be interpreted as probabilities. And this links to the underlying concepts of maximum likelihood estimation and the probabilities of the outputs of the network actually representing a probability distribution. So, that is the reason for doing something like one-hot encoding. Again, the key idea is that we will represent an ease crystal structure as a set of three numbers shown here. So, let's actually jump into the code at this point. So, to do this classification task, I'll again require everyone to go to this link here. This is where I'll jump into the hands-on demonstration. So, if you're falling behind, you can refer to the handouts and work at your own pace, or you can follow a long live as I run and edit the code. So, let's go to nano-hub.org slash tools slash mscml. And let's hit launch tool. This is again a Jupyter notebook tool. I'm sure most of you are getting increasingly familiar with Jupyter notebooks as you progress along our sessions. Even if this is the first session you're attending, that's okay. We'll spend a little bit of time explaining what you need to know to get started for this session. When you hit launch tool, you will see a landing page containing four links. And I'll request everyone to click on the fourth link, neural network classification to predict crystal structures. When you do that, you will see a Jupyter notebook pop up like this. The key thing that you need to know again is Jupyter notebooks consists of cells. Cells can contain code or not code. The way to run any cell is by hitting shift enter or by hitting the run button here. The first cell here, for example, contains text. So, it's an example of a marked cell. We're not interested in that. We're interested in actually the code. So, let's keep hitting shift enter till we approach this thing over here. The first thing that we will do is import libraries just like last time. We will use PyMagin and Mendlib to get the atomic data. This is things like atomic number, volumes, basically all the inputs to our neural network. We will train the neural networks using Keras, which is the API built on top of TensorFlow and comes from Google. You might have heard of other libraries like PyTorch, which originated from Facebook. The choice is yours here. We're going to use Keras. Let's run this cell. The star means that the cell is running and once the cell has completed, you should see a number here, like one. What we've done in this cell is import the libraries and make a big list, a laundry list of all the properties that we want to query PyMagin and Mendlib for. The query actually happens in this section here. Again, this is pertaining to getting data from databases. This is something that previous sessions have covered, but all you need to know is that this cell gets the data and arranges it in a pandas data frame. If you're interested in how this is happening, feel free to go over the code at a later time at your own pace. What you will get from this cell is a data frame. Again, just as a brief reminder, pandas data frames are objects that give you the look and feel of Excel in Jupyter notebooks while giving you all the flexibility associated with Python objects like slicing and indexing and filtering out using complex queries. Let's look at the first row here. This is going to be the inputs to our model, so let's take a second to understand what they are. We have many rows, which each constitute one example, one training example. We have atomic numbers. I don't know what 27 is, I forget. We have atomic volumes, boiling points, atomic radii and resistivities. We have our data neatly arranged. If you've been following along the previous sessions, you might remember that the next step that we will do is divide our data into training and testing data. The reason to do that is so that your model has an ability to not just predict decent values for something that is in your training set but is also able to generalize well to something that's outside. That is the whole concept of dividing your data into training and testing sets. The whole job of the test set is to evaluate how good your model will perform if it sees something new for the first time. That is what we will do next, but just like in the previous session, we will also normalize the data. To understand that, to refresh our memory, maybe let's look at the data once again. The data like boiling points and atomic masses and resistivities strongly depend on units. You want your model to not struggle with these units and be kind of agnostic of the units. More importantly, treat quantities of different scales on the same footing. Something like the resistivities on the order of 10 to the power minus 8 or minus 7 is very small as opposed to something like the boiling point. So you want the model to treat this on the same footing and not ignore the resistivities just because they're very small. The way we'll do that is normalizing. There are many ways to do this. One way to do this is normalizing your data such that the maximum value is plus 1 and the minimum value is minus 1. That's one way to do it. We will use standards core normalization. You might have seen this in your statistics classes. You take the data, subtract the mean and divide by the standard deviation. And hopefully if your data is well distributed, this will give you values roughly between say minus 2 to 2. As you can see here from this print statement. I hope that's comfortable. Just as a reminder, this is where the division into training and testing sets happen. This uses an operation called slicing. This is common to lists and arrays. So if you are interested in how that division is done, look into that. There are other library functions that allow you to automatically randomly split data into training and testing sets. So that might be something of your interest too. But let's jump into the more interesting part. Let's create the model. We have our data. We've divided into training and testing sets. We're ready to use a neural network to do the classification. We follow the same outline that we did last time and that we have been doing for all models. So let's briefly revisit that outline. The first step is to define a model object. In the case of linear regression that involved using the linear regression class. In the case of neural networks that will involve using the sequential class. This comes from Keras. This is where we define the model object. The next thing that we will do is add layers to our model because this is a neural network. If you are doing a linear regression you don't, it's nothing like a layer anymore. So you're ready to go once you've declared a model object. So we'll add layers. So let's do that. The syntax for doing that is the model.add command. So the add command lets you add layers. We will use dense layers again. Dense layers are layers that are designed in such a way so that Keras behind the scenes can collect all the inputs that you have specified in this layer and connect them to each and every neuron in the previous layer automatically behind the scenes. So let's say we wanted to use a dense layer with 16 neurons that would go in here. If you wanted a layer with 100 neurons you would swap that number out. The next thing that we'll specify is the activation function for this layer. We will continue to use the ReLU, the rectified linear unit. You can use tanh or any other activation function that you like. There is not really a rule of thumb although for deep learning architectures these days the ReLU function seems to be preferred. These are the minimum requirements for you to specify a layer. There are two additional inputs that we should look at. One is something called the input shape. If this is the first hidden layer that you are specifying, Keras needs to know how many inputs you have to your neural network. Because by definition the first hidden layer is going to be connected to the inputs of your model. So it needs to know how many inputs you have so that it can make the dense connections properly. The way to do that is with the input shape command so the input shape specifies how many inputs does your neural network have. In this case I think we have 16 so that's going to be 16 but we've specified it in a general way such that it works for any number of inputs. The other command is the kernel initializer which allows us to initialize the weights. We said that the weights would go from random values to values that would give us decent predictions through the training process. So how do you specify these random weights? You can explicitly specify them using the kernel initializer command. You don't need to. The default is going to be random but if you want to control the randomness with the seed or you want to initialize your weights in a specific way this is the place to do it. That is the one hidden layer and if you want to add 100 hidden layers all you would do is do model.add over and over and over again and you can do this in a loop. But for this demonstration I'll just use one hidden layer with 16 neurons. The next layer that you see here which is the model.add is the output layer. How do I know that this is the output layer? Two reasons. One there is a comment here that says output layer. The second is that perhaps the most serious reason is that I know that this is a dense layer with three neurons and we said that we will one hot encode our crystal structures to three values and that's where we can allow the network outputs to be interpreted as probabilities. So this is where we see the three outputs which means that this must be the output layer. So let's spend a second on this. How do we ensure that the network outputs are probabilities? That boils down to using the correct activation function which in this case is called a softmax activation function. So let's say you wanted to do a regression like the Young's modulus or I don't know solubility or any other quantity of your interest housing prices. A typical regression matrix would be the mean squared error or the mean absolute error because you're interested in how far away from the solution the ground truth you are. For something like a classification metrics like these don't really make much sense especially if you want to interpret outputs as probabilities. So the job of the softmax activation is to take in the raw model output and convert it to probabilities. How do you do that? Let's try to work through an example. So let's say the raw output of the model is three numbers like 10, 20 and maybe 50. The job of the softmax activation function is to look at these three numbers and convert them to a set of three probabilities. So that would be maybe 0.2, 0.3 and maybe 0.5 for example. How does it convert this? Well there's a simple equation but I don't want to get into the math. I'll just briefly outline it here. You can go ahead and read on your own time. It's basically taking an exponential of each of the inputs and dividing by the sum of all the exponentials. So that's it's going to be an equation like this and if you are very quick in your mind you might see how doing such a scheme will ensure that the converted outputs will always be positive and be between 0 and 1 because you're normalizing by the sum. So my definition is going to be between 0 and 1 but the math is not important. All you need to know is that the softmax converts raw outputs to probabilities. So let's work through another example. Maybe just to get us as comfortable maybe your outputs are 150 and 30. Then that's going to be maybe 0.6, 0.2, let's say 0.3 and 0.1. So that is the job of the softmax activation function and thus your outputs are now probabilities. So the loss function and the metrics involved to train the model. So how do we measure how good a better model is and how do we train the model? Specified in the model.compile command. We need two things here. We need the loss function and a metric for how good our model is doing and this because this is a classification task we can use a very intuitive metric called the accuracy. The accuracy as you would imagine tells us what fraction of the inputs am I guessing correctly. So 90% accuracy means that you're guessing 90% of the inputs correctly. The loss function is something called categorical cross entropy which is again an equation but let's not get into the math here. All we need to know is that this is a loss function that goes to zero if we make more and more predictions correctly. Let's just run this cell. You see a small table pop up that is because of the model.summary command. Again this tells you how many layers you have and how many neurons does that layer have. This is useful for debugging. Okay so let's train this command. We've declared the model object and we've added all these layers. The next step in our outline if I can go back here is to use the model.fit command to train the model and then we will use the model.predict or the model.evaluate to evaluate the model. So let's go to step three which is using the model.fit to train the model and the metric that we will track again is the accuracy. Let's remind ourselves of a key concept here which is the validation set. Remember when you're training neural networks we have a training set and a testing set. The testing set is never seen during training. It's a blind evaluation of the network. Once you feel you're done you just evaluate it on the testing set and that is your prediction. While training the model a key question that you might want to ask yourself is when do I stop training? And this is where you use the validation set. A validation set is a small subset of the training set which you will use to periodically check how good or bad your model is performing. So typically you would plot the training loss and the validation loss. In this case our metric is accuracy. So an accuracy of one means everything is correct and that's we've achieved that because we have only like 50 data points. In a real-world example you might be seeing accuracy of 90 percent 95 percent but we see that our training accuracy is pretty high and the validation address is pretty good too. To actually get these concrete numbers we can use the model.evaluate function here. So this gives us the accuracies on the training set and the testing set. So you can call it with whatever argument you want and it will give you the accuracy on that set. So we see that the training set accuracy is 100 percent and the testing set is about 71 percent. It's pretty good. To make some predictions with this model if you're satisfied with the training now we can use the model.predict command and that is the job of this cell over here. We use the model.predict to make predictions but there's one extra step that I want to highlight. This is different from the regression tasks. We said that we will encode crystal structures like the FCC into three numbers one zero zero something like that. So when we're actually using this network how do we get back the crystal structure from the labels? Let's work through an example. Let's say the outputs of our model are 0.6, 0.3 and 0.1. We have to add up to one. The way we will convert that is we will look at what is the maximum probability. So the maximum is clearly 0.6 and that occurs at index zero. Remember most programming languages will start counting from zero and because index zero represented the FCC structure as one zero zero we say oh the maximum is at index zero which means it's likely to be an FCC structure. Let's work through another example just so we are comfortable. Let's say the probabilities are now 0.3, 0.6 and 0.1. The maximum is still 0.6 but it's now at index one, second index. That would now be BCC because we had encoded the BCC to be zero one zero. That is one extra post processing step that you need to do so that you can convert the labels into an actual crystal structure prediction. That is the job of this for loop here it does this sets of operations and arranges them in a data frame. I'm just going to run this cell again because I forgot if I ran it and we get a data frame called plot df so let's look at that. This is a data frame that says oh the first row was atomic number 27 the true crystal structure was HCP and the predicted crystal structure is HCP. So that's good and it just lists that for I think all the examples but rather than looking at this we can make some nice plots and again plotting is using plotly. I would encourage you to look at some of the previous recordings if you're unsure of what plotly does. Just take that cell to be a block of code that gives you a plot like this. So let's look at one example let's say nickel here we know nickel is FCC. The network thinks that the nickel is 76 percent FCC. That's good it's pretty sure of nickel being FCC. It thinks it's a 16 percent chance it's BCC but as long as it's strongly thinks that the net that the nickel is FCC we will take that to be our prediction. Look at something like iridium over here the network thinks that it is 82 percent HCP which is actually not true iridium is actually FCC so this network has actually got this wrong. That's okay that's why our testing set accuracy was 71 percent and you can look at the other lists for BCC and HCP elements too but I hope that is comfortable to you. All you need to do is kind of stick to the outline if you're new to this I would encourage you to stick to our outline here define the model object add layers use model.fit to train the model and use model.predict to make predictions. Okay so I hope that is comfortable let us switch gears and look at another regression technique called the random forests so if I can jump back to my power point here we've seen how the classification tasks perform let's actually look at random forests and decision trees for a second the reason to do this might be that you might be working in such a field where data is not readily available neural networks can be very data hungry and you might need thousands of data points to get something going in some fields like the financial sectors and in some fields of engineering decision trees and random forests have proven to be quite successful let's maybe it's worth understanding them before we jump into neural networks maybe this is your first model that you train let's look at an example let's say we want to predict the ion ionic conductivity as a function of the heat of formation the plot looks kind of like this we have it's an orange purple green and yellow so the job of a decision tree is to decide how to split the data into different branches so tree would have branches right you split the tree into different branches such that each branch makes the prediction with the least error let's understand that let's say the decision tree made a split where it said if the heat of formation is less than 300 I'm going to predict conductivity of 10 to the power minus 4 how do you get that number because I know that if the formation is less than 300 I have only two data points in orange I can take an average of those and that is my prediction if it is more than 300 but it's less than 340 I can now take the average of the purple points and call that my prediction and so on and so forth for the other points as well so the job of a decision tree is to figure out where to split the data such that it can make these predictions in orange purple green and yellow such that the predictions have the least error so if you were to train a decision tree ideally for this data set you would come up with a tree like this okay and you would imagine that if I'm splitting it not based on one data point or one feature like the heat of formation if I had an additional feature like the lattice constant there would be more splits involved so consider I split heat of formation less than 300 here's my prediction maybe there's an additional split involving the lattice constants heat of formation less than 300 and lattice constant less than four so that would be another branch in the tree so you can keep adding branches and this brings us to an interesting point if we can keep adding branches like this we can very quickly overfit to our data right because all I need to do for this orange set here is make one more split and I will make the prediction perfectly because there is only one data point left in that split so my prediction is the value for that data point so it's a hundred percent correct this is wildly overfitting how do I stop this this means that the model is very unlikely to generalize well to outside of the training set so how do we fix that what we can do is use random forests so forests are collection of trees and random forests are collections of decision trees we've seen how we can use the heat of formation as an example to make a decision tree learn where to split the data so how do we use random forests well what we can do is we can create an an ensemble of decision trees so maybe two thousand trees or three thousand trees and what we will do is we will still let each tree overfit to its data but we will restrict the data available to each tree and I've highlighted here the key terms that you might want to look up and understand in detail these are called bagging and bootstrapping this is something you might come across regularly the point is that for each tree in our forest we would choose a subset of the training data so if you have a hundred examples maybe tree number one sees only the first 15 if we have 15 features 15 inputs to the model tree number two we'll only see maybe five features doing that means that even though we are letting the tree overfit on its data when we collect the trees most trees will make bad predictions because the data point is going to be outside of its training set but some trees will have that data point in its training set so when you take an average of all these predictions you're predicted in this case conductivity is going to be a decent value so I hope that broadly makes sense to see this in practice I would invite you to join me in clicking on another link that takes you another tool so this is actually I have it loaded here and the reason I have it loaded here is that this tool uses an API key so I'll request everyone to go to nano hop.org slash tools slash citrine tools and click on the launch tool button I have already launched the tool I'm going to wait for a second so that everyone can catch up what you should see is a screen like this a landing page again okay so I hope you're at the landing page what you should see here is a text box an empty text box which is where you would put in your citrine API key I have already done that because I do not want to do it live and share my API key with everyone else once you load in your key and hit enter you should see a success message and once you've done that I want you to click on the second link here which is machine learning guided design of ceramic oxides for batteries I'm going to load up that notebook here and I'm going to again pause for a second so that people can catch up enter their API keys and follow along and again if you fall in behind a little bit do this at your own base it's not a big problem okay let's continue you'll again see some text markdown text shift enter to keep keep running the cells first thing that we'll do is load data that we need so let's put some context to this problem we said that we will look at random forest to do what well we are interested in predicting the conductivities for some novel battery materials called garnets we'll predict the ionic conductivity and to do that we need a real data set unfortunately pi-match and in mid-leave don't quite cut it here we need to use additional data repositories like citrination in this case and we'll do that by using these libraries over here matminer is a library that will allow us to convert raw data into features we'll see that in a second let's just hit shift enter on this cell it looks like my cell has completed execution the first thing that we'll do is query citrination for the database of ionic conductivities it's loading the data here brief side note here this data set I think is uploaded by one of our group members because they had to do the hard work of going over hundreds of papers and collecting this manually so if you are in that kind of a situation and you have no problems with putting your data up openly I would consider doing that so that maybe your collaborators can train models and aid your process let's look at the raw data we have for chemical formula lithium lanthanum zirconium there's a bunch of elements here and we have the conductivities and we have many examples here first thing that we will do is filter some data because this is a real world dataset you would imagine there are problems like duplicates or maybe temperatures measurements not being consistent across various temperatures so let's just trust me on the fact that the next few cells are going to just clean the data remove duplicate indices and get rid of temperatures that are a bit higher too higher too low those are the next few cells I would encourage everyone to keep hitting shift enter till they come to section two which is where we obtain the features from matminer so to understand what is happening here let's just go through the cell and then we can look at the data frame and understand this in a second let's just run the cell we would get a data frame called x underscore df and we look at that and visually understand what's going on so let me add a new code cell here and just display the data frame you'll see a bunch of numbers but let's try to compare this to what we had before we had previously a chemical formula and a conductivity again just like with the crystal structure cases working with strings is difficult you want to convert this composition into a set of numbers that the model can use to learn and train itself how do we convert that there are many options one is to do it manually which is possible but not ideal especially in the face of libraries like this like multiple featureizer this library converts that composition into a list of features ready made to put into your model for training what are those features let's look at one example the first example here has a feature called feature x that's 5.0 this is the number of elements present in the compound so if an element if a compound has five elements it's going to be five if a compound has six elements like here it's going to be six the next few numbers are weighted averages of the compositions so given a five element system different compounds can have different compositions and that is encoded in these five numbers here the next few numbers deal with weighted average of atomic properties so if you remember the previous tutorials we've been looking at properties like atomic mass atomic volume which was easy to do because we had only one element to work with if you have five elements what do you do with that one way to do it is to take a weighted average depending on the composition and that is exactly what these numbers do again this is possible manually but if a library exists that can do this automatically for you i would encourage everyone to use the library so those are the seemingly arbitrary numbers that we have generated we've generated a lot of numbers because the library gives us a lot of features and if you're interested in the answering the question of well do i really need all these features i would encourage you to stay on for the next session i think that's on friday that is for unsupervised learning where you will see how you can reduce high-dimensional feature space into lower-dimensional feature spaces but let's just assume that we have a feature set ready to train our model as always if you are training neural networks we would divide them into training and testing sets and actually the first section here trains a neural network so feel free to go along and run the neural network if you want i will not do that in the interest of time this model can take a little bit long to run but feel free to do it let's just keep going and jump straight to the section 4.2 which is random forest and just pause for a second so that everyone can find it in the notebook it's section 4.2 and here's where we will train our random forest one thing that i will request everyone to do just because of the peculiar way in which this code is written i would request everyone we need a line of code so that the plotting works correctly so if you hover over to the cell above you will see a line that i've highlighted here it's a variable called layout zero i'll request everyone to copy that and paste it into this cell that we are working it simply because we've skipped a bunch of cells if we were to run through each of these cells this would work just fine because i'm skipping the cells i want everyone to be able to run the cell without getting errors i have already pasted it in here the first thing that we'll do is get the data dividing it into the training and testing sets like always and then we will follow our standard outline right declare the model object that's happening here this is now an instance of the random forest regressor class there are two arguments here which is the n estimators which is 2000 which is the number of trees in the class in the forest sorry second argument is a random state that helps us reproduce results the next thing that we'll do having declared the model object again this is a random forest so there's no concept of layers or weights or biases it's a slightly different model you do branches in your data till you get a decent prediction so all you need to do is declare a model object the next thing that you'll do is train the model using the model dot fit command as always and then you'll use the model dot predict command to make predictions and then there are a few lines of code that allow us to do the plotting let's hit shift enter on that you will notice that there is no counter here like for the neural networks that looks at epochs and how the loss function evolves because again there is no central concept of weights and biases it's strange slightly differently but those are details for another time i can address that if there are questions but the plot that we get at the end of the day is a plot of predicted conductivity from the random forest on the y-axis and experimental conductivity from the data set on the x-axis and you'll see that the model does fairly well there are a few points for which we start to predict not so well but by and large for the training and the testing sets marked with the right crosses you see that the random forest is read decently well but that that's really the basic outline that you need to train a random forest an obvious question that will come up because we've scrolled past the neural networks is how does this compare to a neural network why should i use a random forest it kind of depends on your data set how many data points you have how the data is spread out for this data set i would encourage everyone to run the neural networks on their own time and compare the errors see how which model does better which model does worse perhaps more importantly you can use this as starter codes for your examples and see which model works better i believe that is all i had to say i hope this session on classification and random forest regression was useful and it fits well in the larger scheme of things as always the tools are available you can run these starter codes when you feel like it but with that i think i should stop talking now and leave some time for questions and answers all right thank you very much siketh for this presentation i would like to invite everyone to meet yourselves and please give siketh a round of applause for finishing up and so with that i'd like to invite everyone to remain in the chat room as we're going to go into the q&a session we have a we've had a very good discussion throughout the presentation a couple questions were answered but i'd like to ask siketh you the questions that weren't answered are those that came up near the end okay so the first question that came up was could you comment on the concept of standard error by running the model several times and averaging the results and the uncertainties yes so so let me let me do this in the context of the neural network ideally so i think that's what we've done here we have specified seeds for weights which means that most likely if you specified all the seeds then if you run the same model many times you should get the same results now there are ways in which you can run an ensemble of networks let's say you don't specify the seed if you run the model different times you get slightly different predictions so yes you can use that in the most standard way to make let's say you train 10 neural networks and say well on average it looks like this is my prediction and these are my uncertainties that is certainly a valid approach to do that in fact when classifying not classifying when adding uncertainties to neural networks one common approach is in fact just that there is an additional tweak involved which is adding something called a dropout layer that way you can randomly throw out 20 or 25 percent of your neurons and then train each model so that in addition to the seed you have extra randomness layer one will have eight neurons but you don't know which eight neurons are connected so that way you can run many new many neural network instances and collect that get an average get a standard deviation and report that as you run certainty that is certainly a common way to do that I hope that answers the question seems complete to me so uh next question how can you choose the number of estimators what's a good metric for deciding that this is with the random tree so let me pull that up here I agree here's where you do that 2000 is the number of estimators which is the number of trees um so this is the brief answer to your question is you will have to look at validation sets again you don't want your model to overfit right so you would look at validation sets and broadly I'm just for people in general I'm going to this is commonly known as hyper parameter tuning so with most models there are hyper parameters that are associated with the neural networks there are things like that that we glossed over like the learning rates and things like that um those are certainly classified into hyper parameter tuning and the way you would do that is again you have your training set and the testing set and you would use your validation set to see which of these models overfit so maybe if you used 500 trees and you evaluated the error on the validation set you would see that maybe the validation set is higher which means that probably 2000 trees overfit the data in which case you should probably use 500 so the way to do that is by looking at the validation set it's not very obvious here where the validation set is specified because of the way this library works the way to do that is to manually define the validation set and you can import functions from scikit learn like r2 score so what you would do is then you would after your model not here I guess after your training is done let's say model.fit you train the model you can call the r2 score command on your predictions and the ground truth data and this is on the validation set and use this number to see how many number of trees you need the reason for doing this so in the neural net was we could conveniently chop it in the middle and say stop here stop when the validation error is so much you can't quite do that here because there is no concept of weights and biases that are sequentially evolving once each tree is trained it is set in stone so there is no concept of going back and updating the splits that's why that's why you have to do this kind of outside the loop of training so you train look at the error on the validation set and then change the hyperparameters I hope that answers the question sounds good I just want to interject real quickly and say that yes these recordings will be put online as soon as they are processed it takes a little while yes because we need to post process and make sure that some background noise isn't involved it takes a little bit of time so they will be online the one thing I so are there any other questions that anyone has and if you would like you can meet yourselves and ask them at this time we encourage you to do so I would this is one thanks again for the nice presentation I'd like to go back to one question that I posed that was answered but I wonder if it's worth discussing it further and is that of a concept of normalization if we think of your example of conductivity we know some physics of it that it follows say Arrhenius equations so I understand the normalization is trying to get everything between zero and one then the distribution is not going to be typically normal if it follows an Arrhenius behavior is there any advantage in using that known physics into the way you normalize the data before you run your network that is a that's a very nice question and that is in fact something that comes up routinely if you're working with models that involve some physics most of these normalization schemes and all that were developed with the sort of with the background knowledge that oh you might be working with housing data sets or things that for which you have thousands of data points but they really is no underlying trend so more likely than not if you collect enough data points by the law of large numbers it's going to be a normal distribution but you're absolutely right for many physics based physics motivated problems the data is not going to have a normal distribution things like conductivities have a predefined distribution now the goal there is that like you mentioned there is an Arrhenius dependence but at the same time there is still some hope in that if you conduct if you collect conductivities for a wide variety of materials across a wide variety of maybe compositions and temperatures there is still some hope that it will not be very strictly Arrhenius because different materials that have different activation energy so the data will be slightly off one particular distribution it still won't be normal most of the cases but yes having said that if you don't have enough data points and your data is not quite this normal this normal distribution maybe it doesn't make so much sense to use the standard score that is where other normalization schemes come in you can just put them between say zero and one or minus one and one there is not the only physical motivation to doing normalization is that various inputs are not treated unequally just because one of them happens to be measured in units such that the values are very small that's the central idea there even if your data is slightly skewed in distribution there are if your data is highly skewed there are techniques to alleviate that you can get more data either by actually doing the experiment or synthetically using a model like this to generate more data or you can down sample if you have a noisy data point you can remove that but that is getting into another tangent I think the question is very valid and standard normalization schemes like the one I outlined are not ideal for data with strong correlations but the central idea is again to get rid of problems with units and things like that as long as that is achieved your model is still going to try to learn from a skewed data set so there is potentially problems if you throw in a data point in the testing set that does not belong to the distribution at all say your training models to predict conductivity of battery materials and then your testing set is graphene it's probably not going to get it right but the hope is that your training set has enough representation of scenarios in which you'll want to use the model such that a normalization scheme wouldn't strongly affect the results I know that doesn't quite give you like a recipe of how to normalize your data but I don't think there is one maybe someone else can jump in if they have better ideas thank you another question came up when should the biases be used I guess the biases in a guess also when and as an example why you should use bias um so by default if you don't specify anything here the model will have biases the way to not use biases is by specifying a command called use bias and you can set that to be equal to true or false the only real reason to not use biases is if you know for a very special model that the model only depends on the inputs there is no need for a bias term which is a very niche case it comes up very rarely I have in fact only so far trained two or three models that I know for example the output is only a linear combination of the inputs there is no bias required but that's generally not the case the use of the bias is to basically if the weights can be sometimes if you're doing the training the weights can get stuck in a local minimum and so a weighted combination of the inputs is largely going to hover around a given range of value so the job of the bias term is to push that weighted average up and down so that as the bias term gets updated the model can get itself out of local minima and drive it towards weights and biases that are more reasonable for your problem it basically boils down to biases being extra degrees of freedom they are not necessary to use but it's kind of hard to see any case in which an extra degree of freedom wouldn't hurt that's why biases are commonly used the most common scenario in which the bias would be helpful is again that a weighted average is getting let's say you have the inputs in a certain range the weighted average is always between minus 0.4 and minus 0.5 because you're in the middle of a local minimum while training the weights are getting stuck between some values that is where a bias term could get updated from 10 to 20 and then change the value of that neuron and push it away from that local minimum that is where extra degrees of freedom are useful that could be thought of one of the use of the bias but that's generally not the case the case of for the bias is that it's an extra degree of freedom i'm going to use it why not sounds good are there any other questions at this time and if anything does come up then you are more than welcome to email any one of us i guess you can email via session siketh um with any questions and any one of us will attempt to answer it i have i have another one this is one again um is there any correlation going back to the way we would normalize it or so is there any correlation with how that would affect the activation function so if i'm thinking of using the hyperbolic tangent i could think of that as you know either similar to say an abrami behavior and therefore the weights would be a little more more easy for a network to follow for the data my mixing two separate things no no absolutely you that is completely a valid approach it's just that that works only if you're dealing with a network with one layer for example the moment you have a network with even two or three layers that logic might be useful but it will fall apart kind of quickly because let's say your data happens to be distributed like a hyperbolic tangent of course using a hyperbolic tangent is going to be pretty nice but then if you're passing it along the second layer then there's there's really no saying unless you have one or two neurons and you exactly control the weight values there's no saying what's happening in between so it may work or it may not work so the while the idea has married in practice if you're training network with many networks with many neurons or many layers that kind of correlation that oh my data is kind of distributed this way let me use an activation function like this kind of falls apart once you go to the third layer because the network is doing whatever it wants in the middle like let's say your data is distributed like hyperbolic tangent and for some reason the network is deciding that all the weights in the third layer should be minus 100 then it's not going to work out quite well right so it may work but there is no guarantee of that so it's not a recommend I would say it's not a recommended strategy because there is no recommended strategy here the choice of the activation function is largely motivated by getting your network to train properly so the choice of hyperbolic tangent versus ReLU comes less from my data's distributed this way and it comes more from I need to train my network to make correct predictions maybe something like the hyperbolic tangent which squishes everything between minus one and one is not ideal maybe I need something more flexible and that's where ReLU or something like that comes in so the choice of the activation function stems from that thought of that line of thought rather than how is my data distributed you're certainly welcome to try it but I think what's going to happen is that once you add more and more layers that initial trend that you expect to see will die out in the future outlayers in the future layers so might not work quite as simply as you think is my brief answer thank you all right last minute are there any other questions well if not then I'd like to invite everyone again to thank Saketh for the Q&A session and the overall presentation while I'm meeting yourselves and coughing