 Thank you all for being here for the fourth session in the hands-on data science and machine learning training series. This is the second session where we will talk about machine learning. The last session looked at linear regression and neural network regressions. In this session, we will look at classification and using random forests for regression tasks. I have been following along this series diligently and I have been looking at this slide many times, so I won't repeat the same thing over. Let's just recap where we are and where we are going in this series. The goal of the workshop was to get familiar with machine learning models or models that learn later. In the past previous session, we looked at neural networks and linear regression models that perform regression tasks. Today, we will look at classification and another regression technique called random forests, but the focus is going to be on classification. In this session, we will talk about dimensionality reduction, also a form of unsupervised learning. So if you are working with high-dimensional data and you want to reduce it into a few dimensions such that you can visualize it or you can understand it and plot it, this is something that could be useful to you. The last session will talk about machine learning in the context of design of experiments where the goal of the ML algorithm is going to be to draw and to perform such your likelihood of success is very high and the likelihood of success can be defined by you. As always, we will continue to do nano-hub cyber infrastructure as the place where we will implement all of these algorithms and try and test out new things and play around. For this random forest part, we will also use citrination to get some data as our example there. That's the plan for today. We will briefly recap neural networks, build on our knowledge from last time where we looked at neural networks for regression, extend that to classification and then move on to random forests. We will briefly recap what we have for neural networks. And if you are new, this is going to serve as your neural networks 101, a crash course in how these networks work. The key idea of an artificial neural network is to mimic the network of neurons that are present in your brain. And the question made is that a neuron shown here in a circle is abstracted as a quantity that can take in some inputs, perform a computation and then compute some output which it can pass along to the next neuron. So this example here, if we have a neural net that can predict the Young's modulus, this is the example that we saw last time. We can draw a neural network like this where we can arrange each neuron into layers. So the first layer that you'll come across is the input layer. Let the inputs to our model are melting temperature and resistivity. You can call them dm and rho for shorthand. That is the input layer we want to predict. In this case, Young's modulus today we'll look at a different example. So that will change. And in between, you can have as many hidden layers as you want with each having as many neurons as you want. Here we have one layer with two neurons, a1 and a2. So with how this neural network works and to do that, we need to understand how the neuron perform their computations. So let's say we want to compute the value for each neuron. Again, looks at inputs, performs computations, passes the output along to the next neuron. So what is this computation that this neuron is performing? Well, it looks at each of its inputs and the weights associated with those inputs and multiplies them to start with. So you can start by writing a1 as w1 times dm, w1 is the weight. w2 times the resistivity of the second input and the bias term b1. Now if we here, we had discussed how stopping here would mean that your network is capable of all it's doing is making a linear addition of the inputs. So if you want a highly non-linear function, this wouldn't quite cut it. And this is where an additional activation function comes in. The role of the activation function is to introduce non-linearities into your network. You can do the same thing for a2. It's again inputs times weights add in the bias term called b2 here and an activation function. This can be the same or it can be different. And you can do the same last time we did this for the Young's modulus, but if this is the output layer, you do the same thing and that is your model prediction. And this is the basic structure of your neural network. We had looked at some of these activation functions. I believe last time we used the Rayleigh activation function. If you're new to this activation functions again introduce non-linearities. You can have a linear activation function, which is the simplest one because it does nothing. It takes the inputs and spits out the same output. You can have hyperbolic tangent where it looks at the input and compresses it between minus 1 and 1. You can have sigmoids and the one that we used last time, which was Rayleigh or rectified linear unit. Whose job is to suppress the inputs if the inputs are less than zero and to spit out the same input if the input is more than zero. Positive. Given the structure of the network, we had also decided how we are going to train these networks and that involves specifying two key pieces. One, a metric for how good or bad your network is. So let's say we are working with the example of predicting a modulus and let's say that the ground truth value is 100 GPA. If your model predicts zero GPA, you know that your model is far away. But if it predicts 95 GPA, you know that your predictions are decent. You need to encode this information in something called an objective function or a loss function. One way to do this is mean squared error. There are other examples I believe in the last session we used mean absolute error. There are many other objective functions. Today we'll look at a completely new one. This is a way of specifying how the model is performing. In addition to that, given this metric, we need a way for the model to march the weights from initialized random values towards values that will give us decent predictions. The way to do this is back propagation. And you specify the details of the back propagation in a command called optimizer, as we'll see when we get into the code. And as we saw last time, but the key job of back propagation is to update the weights and biases iteratively such that over a period of time, the weights and biases reach the values that they need to such that they can make a decent prediction. So let's jump into classification. Last time we looked at how to predict the young's modulus of a material given a set of atomic data. The data consists of a table like this. We had lattice constants, melting points, atomic radius, etc. For classification, we will ask a similar question. Can we predict the crystal structure of various elements using the same information, using information from the periodic data like atomic mass, or using other simple properties like melting points? To do that, let's think about how we would structure this problem. The inputs to our model are the same. It's the same atomic information as last time, its heater formation, lattice constants, etc. The output should now be a crystal structure. For those of you who are familiar, you might have heard of nickel having an FCC, face-centered cubic crystal structure. Iron is BCC, magnesium is HCP, hexagonal close-back, etc. But you would imagine that we can't quite work with these labels, these strings, because this is a convention that someone has come up with. How do we numerically encode this so that our model can handle this, and then we can define concrete metrics? One way to do that is something called one-hot encoding. And if you're working with a classification test, you're almost always going to do one-hot encoding. What is that? What that means is that in our problem, we have three possible crystal structures, FCC, BCC, or HCP. What we will do is define each crystal structure as a set of three numbers. And we will denote FCC by the set of three numbers called 1, 0, 0. BCC is going to be 0, 1, 0, and HCP is going to be 0, 0, 1. The idea is that each crystal structure label gets transformed into a set of three numbers, and the three types can be labeled differently based on the index where the one is located. If you have a classification problem where you have 10 labels, maybe you're classifying things into 10 different categories, your one-hot vector would be 10 digits long. So it would be 1, 0, 0, 0, 0, 0, and so on. Why are we doing this? One, so that we can numerically encode our results. And two, so that as we'll see when we get to the code, so that we can interpret the outputs of our networks in terms of these three numbers as probabilities. And this is useful because this ties in very well to the statistical theory underpinning these neural networks, which is the maximum likelihood estimation theory. If you're familiar with that, you will then be able to interpret the outputs as probabilities. So let's try and solve this problem. We said our goal is to perform a classification task. We will predict the crystal structure of various elements given this atomic information. So to do that, I will ask everyone to go to this tool over here and launch the tool. Again, this is the point where I'll move into the hands-on demonstrations. So if you feel that you're falling behind, please look at the handouts and work at your own pace. Or you can just follow along with me as I try to run this code live. So you go to nanoherf.org slash tools slash mscml and click on launch tool. These are a set of Jupyter Notebooks tools. I hope most of you are now familiar with what Jupyter Notebooks are. If this is the first session you're attending, that's okay. I'll spend a little bit of time explaining what they are so that you can get comfortable with them. There we go. As the tool loads, you will see a landing page that contains a lot of text. I will ask everyone to click on the fourth link here that is neural network classification to predict crystal structures. This will launch a Jupyter Notebook where we'll actually do the classification. For those of you who are new, Jupyter Notebooks offer a very convenient way for you to mix code and non-code text in cells. So a cell can have either code or it can have plain text like the first cell here. All you need to know for this session is how to run the Jupyter Notebook and to do that, you need to run each cell by hitting shift enter or pressing the run button here. Let's do that. The first thing that we'll do is import libraries. This is very similar to what we've done before. We again get the data from PyMagin and mend live as we've been doing always. This is where we get information like the atomic number, atomic volumes, all the information that would be passed into the input for our neural network. And to create and train our neural networks, we will use the Keras library. Keras is an API built on top of TensorFlow and it comes from Google. You might have heard of other libraries like PyTorch or any other library. PyTorch comes from Facebook, I think. So let's go ahead and run this cell. All we're doing right now is just importing the libraries that we need and specifying a list of queries that we will perform from PyMagin or mend live. If you see the star here, that means the cell is running. My cell has completed execution, so you see a number one. This is the cell where we actually perform the queries and we are going to arrange this data in a data frame. This was covered in the previous sessions, but the key point here being that data frames allow us the look and feel of Excel inside of a Python notebook while giving us all the power that a Python object would typically have like slicing and indexing and doing all that. For example, the first row is atomic number 27, which I don't remember what it is, but there we go. It has boiling points, it has electronegativities, atomic radii, all sorts of properties. And now that we have our data, just like we did for the linear regression and the neural network regression, the first thing that you will do is divide your data into training and testing sets. The reason to do this, again, is because these machine-learn models have no pre-described functional form and they're going to learn based only on your data. So you want to divide your data, train on one piece of the data, and then evaluate how well does it do on something that it hasn't seen before. This is very important because if you don't do this, you run the risk of overfitting your model where the model will look at all your data completely overfit to that. And then when it sees a new point and tries to make a prediction, there is no guarantee that the prediction is going to be any good. We're going to divide our data into training and testing sets. And just like before, we are going to normalize our data. That's because our data can be in different units. Some of them could be naturally very small, 10-7s here as opposed to boiling points and melting points. And to eliminate the factor of units, we are going to normalize our data. We can do this in two different ways, actually many different ways. One way is to denote your maximum value to be plus one and your minimum value to be minus one and normalize it that way. We will do standard score normalization, which you might have seen in your statistics classes. Take the data, subtract the mean and divide by the standard deviation. This cell over here does that. Let's run that. This is the block of code where we actually do the normalization. So you subtract the mean and divide by the standard deviation. And you'll see that the data is roughly between minus one and one, which is kind of the goal we are going for here. So let's go ahead and create the model. And for those of you who attended the last session, this will be very similar. If you're not, it doesn't matter. The key outline, as we always follow, is a three-step outline. The first thing we need to do is declare a model object. In this case, the model object is created as an instance of the sequential class that we have imported from the Keras API. Why are we doing this? We're doing this so that this model object can be easily used to add in layers and add neurons as we want without having to make the connections ourselves by hand. So this is where we define the model object. The next thing that we will do is define the model. So given a model object, we need to start adding layers because this is a neural network. Each neural network consists of neurons arranged in layers. So the first thing that we need to do is specify a layer. To do that, we will use the model.add command. This adds a layer to our model. And we will add a dense layer. A dense layer is a layer that is defined in such a way that all the neurons that you specify here are going to be automatically connected to all the neurons in the previous layer. By specifying the word dense here, Keras can take care of all the connections behind the scenes for you. So let's say I have 16 neurons in this layer. The number 16 there is a number of neurons. If you wanted to do 200 neurons, you would change that number right there. And then the next thing that you would specify is the activation function, the nonlinearity, by using the keyword activation. And we are going to use ReLU, but you can use any other activation if you want. And that's all you need to define one layer. And if you wanted to define a network of 15 layers for your exciting research, you would repeat this command over and over again, or you could write it in a loop, depending on what you're comfortable with. But there's two other things that we need to know before we go on. In this case, I have three other things. First is when we're specifying the first hidden layer, which is this one here, we need to tell Keras how many inputs do we have for our model so that it can perform the connections correctly. And the way to do that is by using the input shape command. This tells the Keras how many inputs we have for our model so that when we say this is our first hidden layer and it has 16 neurons, Keras can perform the connections correctly. In this case, I think we have 16 inputs to the model, so the value for input shape is going to be 16. We have written it in such a way that it works for any other data style as well. We don't need this for any additional hidden layer. Like you see here, there's a commented outline, which is the second hidden layer. You can go ahead and uncomment it, just by deleting the hash. In case you want to train a network with two layers, but I'm going to stick with one layer. We don't need to specify an input shape here because this is the second hidden layer, which means that by definition, it's connected to the first layer that we specified here. So Keras already knows that you have 16 neurons here, so it knows all that it needs to make the connections. The next thing is the specification of weights and biases. We said that we would randomly initialize them and then march them towards the values that they should take. The way to do that is by using the kernel initializer command. By default, you will be randomly initializing weights, but if you want to control it for reproducibility with seeds or you want to specify some unique normalization method, this is the way it plays to do that. The last thing is the output layer, and I know this is the output layer for two reasons. One, there's a comment here that says output layer, and two, because this is a dense layer with three neurons. And we said that we will one-hot encode our inputs such that the FCC crystal structure, for example, is represented as 1, 0, 0. That means that for each set of inputs that our model gets, they're going to predict a set of three numbers, that is the output layer. There's another thing here, which is another activation that you might have not seen before, called the softmax activation function. So let's understand this a little bit better. The whole reason for doing one-hot encoding and getting a set of three numbers is so that we can interpret the outputs as probabilities. But let's say your activation function was linear. Your model output could very well be 10, 20, and 100. And those are certainly not probabilities. So how do you make sure that you can interpret these numbers as probabilities? What you can do is use the softmax activation function. And I'll try to explain this through an example rather than write the equation. So the goal of the softmax activation is to take in this set of three numbers here and convert them to a set of probabilities. So after conversion, this is probably going to look like maybe 0.2, 0.3, and 0.5. How does it do that? There's an equation for it. It's basically, and I'm going to write it very vaguely here, but I hope you get the point. The point is to take the exponential of each output and divide by the sum of the exponentials of each output. So if your outputs were 10, 20, and 100, you would say e to the power 10 divided by e to the power 10 plus e to the power 20 plus e to the power 100. So the math is, again, not necessary. All you need to know is that the role of the softmax activation is to take our raw outputs, which could be anything depending on your weights and biases, and convert them to a set of three numbers that could be interpreted as probabilities. Again, we need to specify a loss function, which in the previous time we used mean squared error and mean absolute error. This time we will use categorical cross entropy. Again, it's another equation. Let's not focus on the math here. The job of this activation function is to go to 0 if we are making predictions correctly. And the way we will measure those correct predictions is by this loss function and by a metric called accuracy, which intuitively should make a lot of sense to you. If you're classifying, you want to know what percentage of results am I classifying correctly. That is exactly what the accuracy measures. The details of how to update the weights and biases are measured or specified in the optimizer command. We are using the rmscraw optimizer. So let's run this cell. The model.summary command gives you a nice window into what the model actually looks like. So it's, again, useful as a debugging tool. Let's actually go ahead and train the model. We will use the model.fit command. This is, again, a part of our outline. Declare a model object, define the model, use model.fit to train the model. So let's run this cell. You see the counter going on. It's going to take a little bit while we do that. We know that once we train the model, we can evaluate it using the model.evaluate command. And the model.predict command can be used to make predictions. That is exactly what we did last time and we will continue to do the same, except that we are doing now a classification instead of a regression. And because I have run this cell before, and I run this again, there you go, it's done. So the plot that we are making here is the accuracy. And again, it's a fraction of how many guesses you're getting correctly. So an accuracy of one means that you're getting everything correct. So that's great. That means our model is doing amazing. Let's actually print out these numbers. We can use the model.evaluate. The training accuracy is one, which means every training image is classified correctly. And the test, the test, is one percent. That's okay. That's what it is. You can use the model.predict to make a prediction, but there's one step that I want to focus on here. Our labels on the prediction from the model is a set of three numbers, like 0.2, 0.3, and 0.5. How do I convert this to a label like FCC, VCC, or HCP? This is where you, again, look at the interpretation of these numbers as probabilities. If these were probabilities, you would look at these three numbers and go, oh, it looks like the model thinks that this number, 0.5, so it thinks that it's 50% likely to be this crystal structure. So you look at the maximum. The maximum is 0.5, and that is at index 2. Computers start counting from 0, so that's 0, 1, 2. And the index 2, based on our convention for 0, 0, 1, is HCP, which means that the prediction of the neural network is an HCP crystal structure. So that is how you decode these set of three numbers into a crystal structure. You look at what the maximum value is. That means the network thinks it's most likely to be that crystal structure. Look at the index of that location, and then say, oh, that's HCP. Okay. Another quick example. If the prediction of the model was 0.3, 0.4, 0.3, then we'll do the same thing. You'll say the maximum is 0.4, and that occurs at an index 1, which is BCC, because BCC was 0, 1, 0. This is where the one-hot encoding ties back. Let's run this cell, and what we've done here just to get ourselves familiarized is we've looked, plotted the data frame that we organized these results into, and you can see that for atomic number 27, the true crystal structure is HCP, and the predicted crystal structure is also HCP. That's good to know. We can also plot this data, and again, plotting isn't plotly. You can look at the previous sessions for details about the plotting. So let's focus on some of the results here. Again, the three numbers can be interpreted as probabilities. So let's take nickel. The network thinks that the nickel is nearly 81% FCC. So that's good. We know that nickel is FCC, so that means that the network is doing a decent job of predicting that. For something like iridium here, it's confused. It's maybe 58% FCC, 40% HCP. It's not quite sure, but that's it. This is the output of your network. This is one way in which you can plot your results, but I hope that the classification task is clear. It's the same outline as always. Declare a model object. Use the model.fit command to train, and use model.predict or model.evaluate to make predictions. I hope that's clear, and if so, we can jump on to the next part of our tutorial, which is random forests. The reason for doing this is that if you're in some fields of data or some fields of science, even really financial fields have seen a lot of use of random forests, sometimes neural networks can be very data hungry. And so using something like a random forest can be a better model in some cases or can offer some more interpretability or some more insight. So that is one reason you might want to consider random forests. They're also easier to train and don't require as much computing power. But it might be a first starting point for you before you move on to neural networks. So let's look at an example. A forest consists of trees, and a random forest consists of decision trees. So to understand decision trees, let's look at an example here. Let's say I wanted to predict the ionic conductivity based on the heat of formation. And my data looks like this on the bottom left. So if my heat of formation is less than 300, it's these orange points here. If it's less than 340, I have these purple points, green, yellow, so on. What is the job of a decision tree? A job of a decision tree is to learn where to make these splits, such that it can correctly predict the value for all data points. Again, it's just like any other optimization. You're trying to minimize the error. So the decision tree will learn, oh, what I should do here is split it based on the heat of formation being less than or greater than 300. If it's less than 300, I have two orange points here. I can take the average, and that is my prediction. If it is more than 300, but less than 340, I take the average of the purple points and make a prediction, 6 times 10 power minus 5. And do the same again and again. The job of the decision tree is to learn where to make these splits. So it will try making a split at, say, 500, compute the errors that it's making, and then decide whether that's a correct split or not. But once it's made each split, it stops there and it moves on to the next level. There is not quite the notion of weights and biases here. It's splits that you make along the tree. You sort of make branches. And once you make a branch, you keep going further and further below. If I had more than one variable to split from, maybe my conductivity depends on heat of formation and lattice constant. You'd imagine that there would be more splits, right? If the heat of formation is less than 300 and the lattice constant is less than four, then it's one value. If not, it's another value. So there can be many other splits, depending on how many features does your conductivity depend on. So that is a basic decision tree. It's very intuitive, and hence that's why people attribute it to be interpretable. But a decision tree can very quickly overfit. How is that? So let's say here I have, it looks like seven data points, and I have made three splits. And let's say for formation less than 300, I predict the average of these two points. There's nothing stopping me from making one extra split here. If it is less than 300, and it's also less than, say, 290, predict this point. If it is more than 290, but less than 300, predict the other orange point. And by doing that split, it is by definition going to get every data point correct, because it's split it so much that down that branch of the tree, there is only one point to be predicted. It's very easy to overfit a decision tree. So how do you counteract this? So you're not overfitting. This is where random forests come in. So a forest, again, is a collection of trees. So let's say our conductivity depends on a wide variety of features. But we saw how if it depended on the heat of formation, we could make a few splits and get reasonably correct predictions. While training the random forest, what we will do is arrange a bunch of these trees, maybe 500 trees, maybe 1,000 trees. And we will show each tree maybe a subset of the training data or a subset of the features. So maybe tree one will see the first 100 examples and the first 15 features. Tree two will see the next 10 features and the next 100 examples, so on. That way each tree can overfit as much as it wants and it can make a poor prediction for something that it hasn't seen before. But some other tree will have that point in its training set. So we'll make a good prediction. So when we take an average of all the predictions of our trees, we will get a decent value for our conductivity. Again, there's a lot of detail that are kind of glossing over. So you might want to look up the words bagging and bootstrapping, and this will help you understand these in more detail. But the point is to divide the training that are all the features into each tree randomly. So let's look at how this works in practice. For that, I will go to another tool, which is maybe I should, I can just go here. You can go to anaheim.org slash tools slash citrine tools and hit the launch tool button. I've already done that, and the reason for doing that is that once you launch the tool, you will see a landing page that looks like this. Wait for a second so that everyone can catch up. Okay. So once you see this landing page, you will see a box here where you will need to put in your citrination API key. And I have already done that because I do not want to share my API key with everyone. And I would encourage you to not do that as well. Once you put in your citrination key and hit enter, you will see a success message like this. And once you see that, I will request you to click on the second link here, machine learning guided design for off ceramic oxides for batteries. And I'll wait for a second while everyone gets there. You should see a notebook like this. Okay. I hope everyone is here. If not, no, you'll catch up in a second or two. It's not very complicated. So the goal of this notebook in this session is to demonstrate random forests. The way we'll do that is by motivating a problem of predicting ionic conductivities for a particular type of battery materials called garnets. Okay. So again, we have some markdown cells. Let's hit shift enter through that. We have some libraries. This time, you're not getting our data from PyMagine and Mentley. We're getting our data from citrination. So this is the library that we will import to get data. And another library that we will use is might math model. And that allows us to change compositions, raw compositions of materials into numeric values that we can use to train a model. We use matminer to make features or the inputs for our model. And you'll see that in a second. Let's go ahead and run that. We will be using scikit-learn to make the random forest. And that will come along below. I think my cell has run. The next thing that we'll do is get the data from citrine. Some of you may have seen this in the second session where we looked at how to get data from materials projects, citrine, and other databases. There we go. So the data looks like this. We have a composition for a material, lithium-7.5, lanthanum, zirconium, andolinium, oxygen, et cetera, and the ionic conductivity for each of them. But again, we can't quite work with these strings. We can't just feed this into the input for a neural network or a random forest. You wouldn't quite know what to do with it. We have to convert this into numerical features. To do that, we will first clean our data a little bit. This dataset happens to have a mix of measurements, including duplicates and values with high and low temperatures. So let's just trust me that this cell here does the cleaning for us and jump right along to section 2. Keep hitting Shift-Enter till each cell, till you hit section 2, where we'll obtain the features from Matminer. And if this is a lot of code for you, don't worry. Session 6 will again cover this in more detail. All we need to understand is that this command here called Multiple Featureizer will take in that raw composition and spit out a whole bunch of numbers that we can use for any machine learning model. What are those numbers? Let's look at an example. So let's run this cell. And we have generated this data frame called x underscore df. So I'm going to add a code cell here with the plus symbol and just look at the data frame. So we've gone from compositions and ionic conductivities to a data frame with a lot of numbers. Let's look at a few of them and understand them quickly. The first number, 5.0, is the number of elements present in that material. So some materials have five elements, some have six elements. The next five numbers are weighted norms of compositions. So it tells you how much of element one was present, how much of element two was present, but in an encoded way. The next few numbers will tell you weighted averages of many atomic properties. So if we had ion as the input, there's only one atomic mass. There's only one atomic number. If you have five elements, what do you do with the five atomic masses? You can take a weighted average based on the composition and that is what these subsequent columns will do. There are many other numbers, but we don't need to go into the details of that. We've used a Featureizer library to do this task for us. We will again divide the data into training and testing sets, but the first thing that we'll do here is train a neural network, which we've already done many times. So let's skip straight to the part from the random forest that is section 4.2. By all means, if you feel like running the neural network, please do, but I will focus on the random forest for this session. And because of the way this code is structured, I will request everyone to copy and paste one line of code. This is this line here that I have highlighted that helps us with the plotting really. The model will work just fine. We need to plot the data correctly. So if you copy out this line here called layout zero, I've already done it here, but I can repeat it and just paste it in the first cell that you see under random forest. That said, we should be good to go. Again, if you run through, then you're a network piece. You don't need to do any of this, but I'm just trying to save some time here. What we will do is divide our data into training and testing sets. We did this before using the slicing operations here. So you might recognize these as slices. And if you're not familiar with this, please go and look at slicing in your own time. It's a fairly simple concept, but it's very useful. And having divided our data into training and testing sets, we can actually again, follow our wonderful outline. We will declare the model object. Last time it was a sequential, if it was an instance of the sequential class, now it's an instance of the random forest regressor class that is coming from here, from scikit-learn, okay? And there are two inputs that we have past year. 2000 is the number of trees. So again, forest consists of trees, right? So 2000, or the number of estimators is how you specify how many trees you want. Next thing is the random space. Which is used for reproducibility so that you can run this again and get the same results. And that's all you need to declare a model object. There are many other inputs that I've swept under the rug, and we'll discuss that if you have questions, but that is the minimum that you need to define a random forest. The next thing that we'll do is train the model, using again the model.fit command, which receives the training data, train values and train labels, and then we'll use the model.fit command. Train values and train labels. The train labels would be the conductivity and the train values would be the big list of numbers that we generated from that featureizer library. So after declaring the model object and training the model, we will again use model.predict to make predictions. And then we'll make a lovely plot. We'll wait for that to run. You might have noticed there's no prints for epochs and things like that. I'm not using weights and biases. That's because random forests are fundamentally slightly different. And that is a detail for another time. We can address it in questions if it comes up. But you also might have noticed it took a lesser time for us to train these random forests. But the plot we can end up with is something like this. We have the predicted conductivity on the y-axis. This is from the random forest and the experimental conductivity from the database that we query from CitriNation. And the black line is the y equals x-line, and you see that our predictions are pretty good. There is some deviation towards the large conductivity regions, but for the most part, for training and for testing data, our predictions are pretty good. And that is it. This is a starting point into random forests. Of course, if you want better models, if you want more detailed understanding of how these works, I would encourage you to maybe ask questions or look at the documentations. Maybe play around with some of these numbers. We'll use 500 trees instead of 2,000 trees. Feel free to try it out. We have all the libraries installed. We have all the code here. And hopefully this was useful to you. I think that is all I had to say. Yes, we've seen this plot. We've seen how the random forest makes decent predictions for training and testing. And I hope this tutorial was useful to you, both on its own and in the larger context of getting familiarized with machine-learn models. With that, I will stop and I will invite any questions or comments. Thanks. Thank you, Siketh, for that enlightening presentation. So at this time, I invite everyone to unmute yourselves and thanks, Siketh, by applauding, if possible. So with that, we will now start the Q&A session. So if you have any questions, you can either speak up by unmuting yourselves or you can post it in the chat and we'll try to address it when they come up. So does anyone have any questions at this time? Were there any questions in the chat? There weren't really any questions. Now we're too difficult to solve. So one question that came up is the definition of FaceCenterCubic, the FCC, as 100, as 100 arbitrary. Not 100, 100 index. You can make BCC 100 if you want. That is completely up to you as your consistent tool. Yeah, this is just for the purpose of one hot encoding. You need to make sure that one of the index of the encoding matrix as a row vector or column vector, one of them needs to be flipped from a zero to a one. That indicates that it is FCC or BCC or HCP. So that's the purpose of having one of the index of the matrix being a non-zero value. Are there any other questions? Okay, two questions. When I read about machine learning, there's a term called descriptors, but we have not discussed it here. So can you address that maybe in not in this context or in this context? My understanding descriptors are the same as features or the input features for your model. So if I can go back here to the neural network task, I would say the descriptors for my model are the input features, which is the atomic number, volume, boiling point, etc. And this we got by querying the pie mansion or the Menli database. For the random forest case, we used the Matminer library, which gave us this huge list of features which you can also call descriptors as you see the function is called multiple featureizer. So that's what generates features or descriptors for us. Yeah, I think that answered the question. So descriptors and features are more or less interchangeable. In the context of this problem setup, the descriptors and features are very straightforward in terms of the ionic conductivity in terms like that. But when you start mapping it to a more machine-learnable features or descriptors, you start losing the interpretability, but then we're able to recover it at the end. So they're more or less interchangeable. Another question. Does the random forest methods have a higher accuracy than the Keras model? Maybe you can clarify what you mean by that. So in this case, I shouldn't compare the random forest and the neural network because we use them for two different tasks. We use one to make a prediction for conductivity like a regression and we use neural networks to do a classification. So it's not quite fair to compare that because the metrics aren't the same. But by all means, this notebook has a neural network just above it, which I lost over. So if you want to test out how well does the random forest do compared to a neural network, feel free to run the cells for the neural network and make your own comparisons because the metrics are going to be the same. It's the same problem statement. You're going to measure maybe mean absolute error or mean squared error. Feel free to do that and see what you get. And feel free to even change something and see if you can make one of them better or worse. I'm sure if you add three extra layers, the neural network would start to do better. Have a go at it. Thanks for that answer. Another question. In the regression part, do the random forests provide estimates of the accuracy of the prediction similar to Gaussian processing? Sorry, you'll have to repeat the question once again. In the regression part, do the random forest provide estimates of the accuracy of the prediction? Kind of like in the process. Yes, they do. Yes, so if I go back here, you will notice that there is no command called model.evaluate to evaluate how good your model was and do exactly that. Give me an error. How good or bad is this model? The reason is that the way scikit-learn packages implement, you don't have any inherent model.evaluate. The way you do that is by and again, it's not exact syntax, but I'm going to point you in the right direction. You can import from scikit-learn things called R2 scores and other kinds of metrics like mean squared error, mean absolute error, and then you can use the model.predict to make predictions. Let's say we call model.predict and we get predictions that we've stored in this variable. You can then call the R2 score function on your predictions and pass it the ground truth data. And that will give you the error. So the error that you would want, mean absolute error or the mean squared error would be obtained like this. So you have to do it outside the model.evaluate because the model.evaluate doesn't exist here. My knowledge. You can get an error. Sounds good. Okay, another question. For the probabilities, which I think he's talking about the softmax, why are we doing the exponential instead of just a simple fraction? Because for exponentials, the exponentials itself will grow for values greater than 0.5 or over 0.4. Exponentials will increase exponentially. So why are we doing that instead of just fractions? I'm going to answer this very sort of wiggly because that means going into other math equations that I have to go over on a call like this, but the point of using those exponentials is that the categorical cross-entropy loss function that we use has logarithmic terms in it. So when you use the exponential you compute a probability and yes, the exponentials can kind of blow up but then when you use the cross-entropy loss function which has a logarithm in it, you can again have a loss function that is manageable so the model can train on it. I don't see any fundamental problem with using fractions because again, the point is to represent the outputs as probabilities but I think the motivation for the exponentials is so that the cross-entropy function which has a logarithm in it can handle that properly. So it has more insight please jump in but this is my understanding of the situation. Yeah, it seems like it is intentional because of the cross-entropy that we use afterwards. Okay, so for backpropagation are we storing all the training steps and how can you calculate the partial derivatives kind of branching off of the neural network how that actually works? Yeah, so you can look at the evaluation of the weights over time and I know in libraries like PyTorch you can look at the gradients true too and I'm sure you can do it in Keras it's just I don't remember the command on top of my head but what you'll have to do and I'm kind of writing here in the neural network piece where you define the model you've done all this so here in the model.fit command you see this keyword called callbacks this is where you can add an extra callback that maybe monitors the weights over time and this extra callback you can define it as an extra class that says print weights and you can define functions that print weights for you. This callback is evaluated at the end of every epochs if you want to see how the weights evolve at time this is where you would quote it as far as gradients go I believe there is a way to do that in Keras but I haven't done that in the recent past so I don't remember the syntax off the top of my head but I think there is a way to do it there's definitely a way to do it in PyTorch you can look at the gradients and see if they're exploding or not sometimes it's a useful debugging tool but if you have millions of weights and millions of gradients it becomes kind of hard to spot the odd one out from there Sounds good. The most recent question the features that we were talking about earlier the descriptors do they have to be independent? Can this actually tease out dependencies between them or independencies between them at the end? That is an excellent question and this ties into many other things so let me answer this in a little bit more overview it way and other sessions will do more justice to this so like you'd imagine if you have one feature that is let's say x and the other feature is 2x you would intuitively imagine that the network doesn't really gain by having both x and 2x as inputs because it's really the same thing it's just scaled differently so you'd imagine that there is a minimal set of features that you get away with if you're working with a library that spits out a hundred features how do you know which one of them is important you can use techniques and maybe I can just write it down here so there are techniques called Pearson and Spearman correlation coefficients essentially what they do is measure the correlation between each feature and your output and the correlation between features the crude way to do this is to just plot feature 1 I can type feature 1 versus feature 2 and if it's a straight line or if it's something very positively correlated you probably don't need both of them but the more formal way to do this is to use Pearson correlation coefficients or something like that where you can manually rule out which features you need and which features you don't need another way to do that and this is something another session will get into is to use unsupervised learning so again you have a hundred features and maybe not all of them are necessary how do I reduce them to a minimal set of maybe 10 or 12 features you can use techniques like PCA you can use techniques like non-negative matrix factorization there are many ways in which you can reduce this feature set into something more manageable and that is a task of unsupervised learning and I won't get into the details of that because the next session will take care of that but this is the direction in which you want to look into if you're interested in that yeah that will be covered in the next session another question for multiple descriptors does the order of splitting it splitting the data and the decision tree actually affect the output does order matter does order matter I'm imagining order meaning I don't know so order in what sense order in which you specify which is output 1 and which is output 2 that will definitely not matter as in say you have young's modulus and then the conductivity and then something else first I am taking the decision tree in terms of young's modulus or the melting temperature and once I have done that I am further splitting it so if I change instead of first looking at young's modulus if I look at the conductivity first does it affect the end result so to be really honest I haven't trained random forest with multiple outputs ever so I can't personally comment that based on my experience maybe someone else can jump in and see if they have more but to my understanding the order of the splitting should not matter because you have an ensemble of trees so one tree is going to split based on conductivity let's say the other tree is maybe going to split it based on young's modulus so definitely if you have like a subset of features so tree 1 is seeing the first 5 features tree 2 is seeing the second 5 features then the order the number of features that tree 1 sees versus tree 2 sees that doesn't matter because each tree is going to do its own thing and then you'll take an average as far as multiple outputs go I believe the same thing holds true but I have not done it myself so I wouldn't vouch for the in practice applicability of it yeah I'll stick my neck out there and say it's very similar to like law of large numbers right because you can base the number of decision trees that you have so if you have a small number of trees you'll see probably a huge impact for that but if you go to larger and larger number of trees maybe this will start seeing the trees in the field of like a forest so it won't really matter at the end of the day if you have a large number of trees that's just my opinion yep go ahead you talked about having a large number of random forest and then taking the average of that so how do you decide how much to assign for each and is there a like a recipe or standard practice of doing that to first of all you have many decision trees not forest you will still have only one forest but many decision trees but one way to do it is to take the average you can define other schemes typically people would just take the average so it depends on how correlated your trees are and your hope is that if you have defined 2000 trees that each of the tree is kind of uncorrelated so a simple average should be just fine that is the general practice yes if you're doing classification tasks it is also common to do sort of like a voting procedure if you're doing classification there's no more average to take each thing is each tree is going to predict that or dog or something like that so that way you can do some sort of a voting process where you look at which which output was predicted by the most number of trees and you'll take that as your prediction so the maximum count of one of the outputs that's for the classification scene thanks next question do you have any suggestions for how to treat the noise in the inputs or the outputs that's a good question I there are ways to do that there are ways to handle noisy data one thing is to a few techniques one is you can do up sampling or you can do down sampling so the simplest thing that you can do is for example if you have one noisy measurement in a set of you could just throw away that noisy point because you know it's noisy that's kind of the simplest way to deal with it that's in some sense called down sampling you can also up sample if you know that the data point is valid it's just that there's noise to that measurement you could add more data around that noisy measurement either through doing the actual experiment or running actual simulation or by making predictions from the model and you can use predictions from your model as additional synthetic data or for the training so you can do that there are other techniques that people generally use for noisy data sets but that's like it's a there are many different approaches I'll just stop it at that I think the most common ones are down sampling the data so removing noisy points up sampling that means using some sort of a surrogate model to get more data surrounding that noisy data and if doing the actual experiment is two time taking or consuming then you can just use your model to get synthetic data around that noisy data point those are the two kind of common procedures if anyone else wants to jump in that's fine by me if they have more information on this so are there any other questions well if not then please join me in thanking Siketh one more time for this Q&A session and our presentation overall