 www.nanohub.org Online simulation and more for nanotechnology. Welcome everybody to the sixth workshop in our hands-on data science and machine learning seminars. Today I'd like to introduce Professor Ali Strakhan, who will take over the first part of the lecture to kind of introduce us to sequential learning and this idea of design of experiments. And then we'll follow it up with Juan Carlos, who will walk us through a hands-on example to work through each of these problems. Thanks, Zach. I hope you can hear me okay. Yeah, so Juan Carlos and I will tag team to deliver this our last installment, at least the last of the first series of hands-on sessions on data science and machine learning. And we're going to talk about sequential learning and the design of experiments. So the entire workshop has been dealing with data science, machine learning, and this is learning from data. We know that the first thing you need to learn from data is data. And over the last decade or so there's been a big effort in cyber infrastructure development. We're all using nanohub that delivers online simulations and we're getting data from a bunch of repositories that have been designed specifically so they're queryable. Data is easy to find. It's easy to mine. And Zach led a couple of sessions on APIs and interacting with these repositories. So once we have data, we learned from the lectures that Saquette Desai gave on machine learning. We learned that we can develop models out of that data, predictive models. We can do classification that's also very important in engineering and science applications. And Michael Sakano told us that showed us how we can use data science tools for dimensionality redaction. What we're going to do today is talk about the design of experiments. Often in science or engineering applications, we're doing experiments or running maybe physics-based simulations with a design goal in hand. Maybe we're trying to discover material with new properties or we're trying to optimize a process to minimize time to produce something. In all of these cases, what we have is a multi-objective optimization. And often the number of knobs that I can turn is a very, very large number. So a brute force exploration of space to find optimal parameters is very inefficient. And so traditionally we rely on intuition to figure out how to move these knobs. And we're going to discuss today a tool, a technique in which we can use, put together what we've learned in terms of machine learning to try to solve this problem. So we're going to do that. I'm going to give you a little overview of sequential learning for design of experiments. And then Juan Carlos is going to walk you through hands-on just like every other session of our tutorial. Actually, walk you step-by-step how to actually get this done. And after we're done, we'll be happy to discuss with you if you have your own problem, your own research that you're working on. We'll be happy to discuss how these tools can be tuned to fit your need. So let's talk a little bit about what's the idea, the overall idea of sequential learning for design of experiments. So the goal is can we find this optimal material, optimal process with the fewest number of iterations, with the fewest number of experiments or with the fewest number of simulations. So that's my goal. And given that this is machine learning and data science, I need to start with whatever I know. So I always start with some data. I have existing data. And what I'm going to do is use the tools that we learned, things like random forests, or I could use neural networks to develop predictive models for this data. And then with these models that are trained with my existing data, and by the way, this current existing data may come from different sources and then you have to aggregate information from different sources. You have to do this with quantified uncertainties. But once you have that model, then you can ask yourself, out of all the next experiments that I could run, which one is the one that I expect to give me the most bang per buck. So for the next experiment, which is the experiment that I'll learn the most. And we're going to see that there's this goal information acquisition function. Depending on your goal, there's different ways that you can select what the next experiment is that is going to be expected to be the one that provides the most information. Once you decide what the next experiment is, then you go and you query that information source. And that could be an experiment. That could be a simulation in Nanohub, for example. And whatever it is, you go and actually do the experiment that you decided was the most effective one. And then you ask yourself, well, am I done? Have I met my criterion? If you're done, awesome. You go and have a drink. If you're not done, you add your new data to your existing data and you start all over. But now you're going to start with more data. So your models are going to be better and the guess of the next experiment will be more accurate. And so the more data as you iterate, you get more data, you get better models, more predictive power, and you're moving faster towards your design goal. We're going to show you with an example that you can actually reduce the number of experiments that you need to conduct in order to achieve your goal very dramatically by using this technique. So we're going to do this with an example of our own research. This is an example that Juan Carlos has actually been working on. And it's the data and all the tools are available in nano hub in a tool that you're going to run in a minute. Our goal for this particular project is to find solid state lithium ion conductors that we could use in batteries to replace our current liquid or gel type materials. And we're going to explore a set of oxides called garnets that show promising lithium ion conductivities. So our design process is to find, to maximize, fund the oxide that will maximize the lithium ion conductivity. And we talked about the need for data and these repositories and data resources that are available. When we started this project with Juan Carlos, that data was not available for these oxides. So Juan Carlos and other members of our team spent literally months coming through the literature and getting the data, putting them in a repository. And this data is now available for every single one, every single experiment that has ever been conducted on an oxide for lithium ion conductivity. All this data is available in Citrination. And what we're going to do is when we get to the hands on part of the lecture, we're going to download this data and we're going to do a little exercise. Okay. Now, so to demonstrate sequential learning without actually running more experiments because obviously we won't be able to do that. In a demo in Nanahub within an hour, we're going to imagine the following. So after we come through the literature, we found 100 oxides for which we know the lithium ion conductivity. And so we're going to do demonstrate sequential learning in the following way. I'm going to hide all of these data, the 100 data points. And I'm going to start by selecting at random 10 data points. Okay. And so in this little graph that I'm showing on the left, what you see is every single one of those 100 data points for lithium ion conductivity. And you can see that there's a maximum right above there at 17.5. And that's the highest known, the best known conductor. Okay. So our problem is going to be the following. We're going to start by revealing only 10 data points chosen at random, making sure that the top performer is not there. And I'm going to start my sequential learning there. I'm going to develop a model for these 10 oxides. And I'm going to use this model to then predict what I would expect the ionic thermal conductivity to be for the 90 oxides that I have not that have not been revealed. And then for those 90, I'm going to decide what is the predicted ionic thermal conductivity, lithium conductivity with uncertainties. And based on that prediction of the model, I'm going to decide which one to reveal next. It would be equivalent to going to the lab and running, synthesizing and measuring that specific composition. And instead of going to the lab, what I'm going to do is just reveal that specific data point. Now, when we, let's say that my predictions look like this, like what you see on the picture there with uncertainties. And so which one should given this information, which let's say each one of these dots is our possible next experiment. Which one of those would we go after? And this is where the information acquisition function comes into play. Should we go for the maximum expected improvement, the one where the mean prediction is the highest? Should we go with the maximum likelihood of improvement where the mean plus the uncertainty tells me that there's the one that could be potentially the winner? Or should I go to the one where the maximum uncertainty exists? Because that's the place where I know the list. So we're going to do in the hands-on session is compare all of these different information acquisition functions. And I learn which one is the one that gets me to the highest performer, the one at the top faster. So once I do that and I decide which information source to reveal next, I'm going to color whatever data point I selected. I'm going to turn it from gray to black. And we're going to iterate and show you how fast you can get to the highest performer performance with the fewest amount of experiments. So with that, I'm going to hand it over to Juan Carlos, who's going to walk us through the tool. We're going to get our hands dirty as we always do. We're going to go step by step through the tool and show you all the steps to going from getting the data to doing the sequential learning. So Juan Carlos, take it away. So hi everyone. As we move into the hands-on activities, I'll ask everyone to follow me to this link. You have it here in the presentation and you have it in your handout too. And I'm going to go there too. So once you're in the tool, you're going to end up here in the landing page. The landing page includes some information about the tool, particularly how we're conducting this test of sequential learning and how to run these tools. The first thing that we're going to ask from you is to get your C3N API key in this box. So you just have to paste it and press enter. And then we're going to head on to the second notebook of the tool. Once you've done that, you're going to end up with a success, hopefully a success message in your API key. And then we can head down to the second one. So in this notebook, we will be exploring how can we use mined data, the data that we created to inform experiments in our ongoing effort to optimize materials. These materials, the ceramic garnets, oxides that we mentioned before, have an application in building battery systems that replace liquid counterparts with some solid components. What we'll be trying to do is to find how many experiments do we have to run before we can reach the maximum of our known database. So let's start. The first thing in any Python script, you need to import your libraries. Remember to run any of the cells is shift enter. So let's shift enter. You know a cell is done when this changes from a star to a number. Then, as we saw in part two of this series, the token squaring repository, we can access data from Citronation in this particular example using an identifier for the database, which is this number. So we can run this too. And you can see that we're using a function from that library that it's going to give us a pandas data frame with the data. Okay, so with the data frame, we can see that we have some properties for each of these compositions. The composition being the first column. In this data set that are curated, I also included some preparation steps as metadata and references to just point to the papers where I took the data. But the idea is that from this information, we're going to get the formula and we're going to get the ionic conductivity that we need. This is a function that we'll need after in a couple cells below to get that descriptors for compositions. So let's just run that. And if you recall, or from part one of this series, we can use binary filters with pandas to filter our data. If not, it's pretty simple. We have a condition here. So for this column of the data frame, if it's this value, we keep it. If it's not, we drop it. So we're going to use these binary filters to reduce the number of measurements that we have to only measurements that were made at room temperature, because that's the property that we are after room temperature and conductivity. So we can run these two. I'm going to go to a function that we have not seen before. It's also from pandas called duplicated. This method here, duplicated, would return a data frame that contains only the rows with values that are repeated in a particular subset of the columns. So let me explain a little bit more about that. So we have a subset here and we're feeding it the name of the first column. So this duplicated method is going to return the data frame, the data frame of rows that have the same composition basically. What we're trying to do is remove the duplicate so we have one measurement for one material. Machine learning algorithms struggle when you have the same inputs and different outputs. So we want to present it with single pair inputs and outputs. This first line keeps the indexes where in the data frame we can point to rows that are repeated and this one keeps the entire data frame. What we will do as we have some around 80 repeated values, what we are going to do is replace the measurement that we found from the paper. We're going to take all of the measurements for the same material and we're going to keep them medium. So in here in these two rows we're creating empty containers. This is going to be a dictionary of lists and we're going to take advantage of a property of dictionaries that is that we cannot have two keys with different values. So if we put the materials as the keys being strings, we can actually list all of the measurements in a list. Assigned to that value. So again dictionaries are key value pairs. If we have a key, a string, the name of a material, we can put the value to be a list of all of the measurements that we can, that we find from the data frame. And that's what we do in these couple four loops. Here we're replacing the measurement so we have a list of measurements for the repeated materials and we're just replacing with the median. Then plugging it again to the main data frame that we're working on data. And then we're dropping all of the rows. Here we're dropping all of the rows that are duplicated so we don't have to worry about them anymore. And we're resetting the indexes and just dropping this column index that gets generated here. So we can get a new data frame with unique values and they are all in the same filters that we put before. So we can start working. Okay. So here you can see that the cell already finished running so we can move on. In this section in obtaining features and descriptors from MathMiner. So machine learning models cannot would not work just by plugging in the stoichiometric formula to get an output as complex as ionic conductivity. So the way we're doing this is we're turning the formula, the chemical formula for the material. Into the descriptors using a library called MathMiner. In which we're not only putting the stoichiometry of the formula but we're plugging in information about the elements and the properties of the elements that compose the material. Some properties on valence and some elemental fractions at the end. To try and describe each of these materials uniquely to try and match it with their corresponding output. Another thing that we're doing is adding the measuring temperature. So most of the literature when they're reporting ionic conductivity, they report the value and they report at what temperature did they measure it. Temperature has a strong relationship with ionic conductivity. It's an orinous relationship if you're a material scientist. So if you measure the same material at a higher temperature, you would get higher values of ionic conductivity. So we want to keep track of that too. And we are going to put that in the descriptors that we'll be using. Here we're separating descriptors as X and values labels or conductivity as Y. And another important thing to note is when we created these descriptors, we just put them in. Like we have a formula, let me get all of these descriptors. But in machine learning, you want the differences on the materials to actually help your network. So here we're dropping from the data frame all of the columns that have a standard deviation of zero. So we're only keeping the ones that are different to zero. This is because if every single material in your database has the same value for some property, then the value is not giving anything back to your model. It's just going to learn anything from that particular descriptor if all of the materials have the same value, basically. So we can generate these descriptors. So after all of our wearing of the data and then filtering and removing duplicates and getting descriptors, we end up with 100 unique points with 108 descriptors for each of these points. Then we can move on to processing and organizing data. For the sake of time, I'm going to skip some of these cells. In this particular notebook, we had, let's run that one, so you can see the split. So we're separating, first we're separating descriptors and labels, so inputs and outputs, and then we're separating training and testing. And you can see that we have a split of 10%, where we're keeping 90, 90% of the descriptors, 90 or out of 100 for training and 10 for testing. But we're going to, for the sake of time, move past this comparison of irrigation models. So you saw in part four of this series how to create a neural network. So here's a neural network that we created to compare. And here's the training of that neural network. We're not going to run it right now, again, for the sake of time. And here's a plot of the model. So I need to stop here a little bit, because as we're skipping cells and Jupyter notebooks are, our Jupyter notebooks are designed to run from top to bottom. But we also cannot. We also can just skip some cells. I'm going to copy this cell. So I need you all to copy this cell too. If you were here in the first part, the first session of the first part, when I was explaining how to plot with Plotly, you remember that we needed a trace and a layout. So if we're skipping these cells and we need a Plotly plot, you can see here that we have a trace, but we don't have a layout. So we're going to log it in here. And then I can go back and explain what all of this is. So we're keeping random forests just for the sake that we can get a sample-wise uncertainty for each of the points. So again, we're processing the data. We're separating in training and testing. We're importing a regressor, a tool to make a regression with random forests from this library. And what we're going to do is make predictions. So here we're training with values, and here we're testing. And what we're going to see, hopefully, if I copy that correctly. Okay, this is a really big plot in this zoom, but you can see that we can get a really good capture of the values of the points. Here green are the training points, and red are the testing points. And you can see the error bar as the uncertainty. You can see that the uncertainty is lower when we have a lot of points here. There are a lot of bad conductors. But as the points get more sparse, their uncertainties increase because the model is not quite capturing the really high ionic conductivity if you have these few points. But sequential learning works by choosing an initial set. So sequential learning or active learning is creating these kinds of models, but with few entries which then we can query with experiments to see how many experiments we can actually run. So for this section, the sequential learning, we're going to start by, again, creating a new model, a new random forest. And we're going to start with an initial set of 10. This will represent the 10 black points that you saw at the beginning in the graph. The 10 black points that are revealed are these 10. So to keep track of which point is revealed and which point isn't, we're going to create this variable here. The train is going to be a list of zeros, but these zeros are going to be Boolean. So it's going to be a list of false values. And to pick this initial set, we're going to make a random choice out of all of them, the length of the data, which is going to be 100 total points, we're going to choose 10 randomly, and then we're going to change it to true. So it's only going to train with the points that are listed as true. As we know, the maximum value that we're trying to get in this particular example, we will assert, this is a Python operator, we will assert that the maximum value is not in the initial set. Okay, we can run this, and we can see that we picked 10 training entries. The following, I think it's three or four cells, it's three cells. I'm just going to show you how the model is predicting with this initial set. So this model.fit is making a prediction using the descriptors of the values that are true and the labels of the values that are true. And then we're going to make predictions out of those. Here we move into information acquisition functions. So here we have predictions for everything. So yeah, we have predictions for everything and we have uncertainties for everything. And these acquisition functions are going to choose different points depending on these formulas, these utility functions, on how they choose the next point. MEI, maximum expected improvement, would choose the one, the index of the value that has the highest prediction. MLI would be the same thing over the uncertainty and you would be the maximum uncertainty and UCB is the upper confidence bound. It would be the mean of the prediction plus a constant times the uncertainty. The idea is that they're really easy to implement in Python. So you can see here how each of the acquisition functions would choose the next point. This function nprmax, you will return the maximum of this value. This is a list of values, the maximum of this value. And we're going to return the index of where it is. So if we have a list and the maximum is element 10, this nprmax will return 10 instead of the maximum. So we can run this and then we can print what they're choosing. So for each of the different acquisition functions, we can see that the predicted and activity of material in this case is 54. Based on this acquisition function is this value plus or minus this uncertainty. You can see that, for example, the last one, operator confidence bound would choose the one where if you add those two values up, you would get the maximum and maximum uncertainty. You can see that the uncertainty is keeping track of where the uncertainty is highest. Now we're going to go into a code cell that is really big, but I summarized the repeating points. I summarized the repeating snippets of code that you will see in this cell. And they should be in your handout. So we're running a comparison of the different acquisition functions, but each of the acquisition functions is the only thing that changes with the code. So to iterate on that, we're going to start by describing the initial values. So sequential learning works by creating, by querying different values up to certain conditions satisfied. It could be give me the maximum value in 50 experiments that I'm going to give you. This is important when your experiments are really expensive or time consuming. You can also stop the sequential learning process once you find a material that is good enough, for example. If I want the property to be only conductivity higher than 1 to 10 to minus 3, Siemens per centimeter. And I find a material that is better than that. I can just stop the sequential learning. Right now we're going to do it with experiments. We're going to give them this algorithm 50 attempts to find the maximum. And here, so these 50 is the number of experiments here. This list all ends. It's going to give me the indexes of the entire data set. So 100 indexes. So we're going to start each of these variables here. This random, it would be the acquisition function underscore train. This is the list that contains the training indexes. So we're going to start with the initial set. So NP dot where in train. So in train weather initial set NP dot where is where this is true. We're going to create a list of that. We're going to return the indexes of where this initial set is true for all of them. And then we can move on to the iterative part of sequential learning. So what we're doing right now is we're assigning the training values. We're assigning the training values out of the last iteration of these variables. These variables are basically where we're keeping track of what the model is training on. So this is the current iteration. So the last of these would be the current iteration. And what we'll be doing is we already have the ones that are true. Now we want to offer them all of the ones that are false. So this random search ends. It's going to be the difference between all of the integers minus the ones that we are already training on. And this is going to be the space that the model will be exploring. Now we'll do a training with the descriptors of the set that we have right now and the labels of the set that we have right now. And we're going to get a prediction for all of them. You can see here that we are training with the train and we're predicting for all of the search. So the ones that we're not trained on. Then we are going to append. This is an acquisition function. This particular snippet was our baseline. So it's just randomly choosing one of the indexes. But if we go down to this acquisition function, maximum expected improvement, you can see that it's the same code that I described before. It's giving me the index where here the prediction is maximum. And then we're going to append this entire... This is a list. We're going to append this entire list back to our history. So once this runs again, it's going to be our current iteration. So let's not show how this sequential learning algorithm work. We can run this. It's going to take about a minute to run. So I'm going to start explaining something else with these runs. Remember, this is going to turn into a number once it finishes running. Now for the plot. So if you can move a little back in your handout so you can see the plot and try to understand what I'll be explaining. So we created plots that are animated. So we want plots that in each experiment we can see how the colors, how the grayed out circles are being colored. That means how the algorithm is exploring this compositional space. So again, in your handout, you have a summary on how to do this without repeating all of the acquisition functions. But the basic idea is we want to create an empty plot and then overlay it over our current plot. So imagine you have a plot and you're going to do all of your formatting. So I'm going to go through some of these steps pretty fast, but I'm going to explain how to animate it basically. Map to Live is a really popular library for research. So I would expect some knowledge of that, of most of you. But I'm going to spend maybe a minute explaining how to do it. So first you create a figure. Here we're using subplots because it's six different plots for each of the acquisition functions. It's five acquisition functions and one to keep track. You're going to see it in a minute. And what we're trying to do is we're going to create these empty plots. So if I go back to this plot, this is the plot that you're seeing in your handout. We want to create these empty plots. You have it here. So it's just an empty X and an empty Y and we're coloring it basically. So this initial set is going to be the black points that we have right there. And these all samples values are the great points that you see in your plot. The difference between this empty plot and this plot is in this syntax of Map to Live. You can see that this is just variable equals subplot. To make it animated, we need a function and we need to update the data on each of these plots for every experiment. So we're putting a comma as we're unpacking more values. So important thing for these animated plots is the comma. We're in the plot that we're trying to make animated. All of these commands here are just formatting. So you can set the axises. You can set the different, here I'm removing the labels on the X axis. I'm changing label sizes. So this is just formatting. The idea is that you create an empty plot. You create your initial set here and then you create all of the great points, sorry. You create the black points and then you create the color points which are empty if you don't have any experiment. Now to make an animation, you need an update function. You also have a summary of that. And this update function, what it's going to do is none here would be the experiment. The number of experiments that you have run so far. So if the number of experiments is greater than zero, what we will be doing, so here I'm trying to get the values for the indexes that we have right now. And if we find the maximum, what I'm doing is stopping the animation. I'm putting here a star and I'm setting the lines and the check mark in the graphs to some value that it's not going to change. So to update this function, we're using num. So the number of experiments and num is not in this. So once we find the maximum, we're going to stop updating the plots. But while we're finding the maximum, we're setting data. So we're setting data to this line that we created. This is the line for the first plot. You're going to see it in a minute. And this is the data for all of the points that are getting colored. And you can see that we're doing, after we color the first 10, the first 10 never changed. So in this list, after we color the first 10, we're going to color every point represented by a new experiment in a different color for each of the acquisition functions. Okay. And then to basically put together this animation in this figure, we're going to use this command for animation. So we're going to close our figure, and then we're going to turn it into a video. Some important things about the animation. We have to provide it with a figure. We have to provide it with an update function. Here, our frames are the length, are the experiments. So the length of the experiments that were run. So it's 50. And then it's going to turn into a video. Convenience. And you can see here. You can see here. The different compositions that are being explored by each of these different acquisition functions. So as mentioned before, once they find the maximum value here, they're going to stop. And you can see that we ran a comparison for all of these acquisition functions, and they all found it before random. So this first plot that I was mentioning is keeping track of where the maximum is for each of the other plots. So you can see here that our maximum is here. It got found, the algorithm found it, maybe experimented like 11. And then it never found something higher. So the line never went higher. So this is an interesting way to see how these algorithms can explore the compositional space. Let's see it again now with the first plot. So you can see how it's moving. You can see that all of them, all of the acquisition functions perform better than random. So a random sample of 90 experiments that we have possible would yield an unexpected value of 45. So in average, she would take randomly choosing these samples 45 experiments, but we can reduce it dramatically using these acquisition functions. So I'm going to head back to the presentation here to show the results. So you can see here the results of the sequential learning where we can find our best performing candidate using as little as 30% of the experiments of what it would take to just randomly choose from the pool that we have. And that's the idea with sequential learning. You want to reduce the number of experiments that you want to run. You reduce all of your expenses and all the time you're going to spend creating these experiments. And these acquisition functions and these machine learning models can help you with that. If you have a model that can capture some, it doesn't have to be a perfect model. If you have a model that captures some of the trends of your data and captures some of the values and it's able to predict within some reason the values that you're trying to get with your descriptors, it can really help you increase the number of experiments So with that, I'm going to again leave you with this information on why we're doing these workshops. So we're trying to get all of you to use machine learning and data science to inform your experiments, to create your models and use this cyber platform so we can create better science faster. Thank you. Thank you Juan Carlos and Professors at this time I'd like you all to unmute your mics and give a round of applause for both of our speakers. Thank you. And then we'll turn it over to a Q&A so we'll address some of the questions that are in the chat log but then also feel free to ask questions verbally as well. So first question that we have, there's a question earlier from Juan Claudio about moving the duplicates. I lost my chat log about halfway through the session my internet went down for a second. So if you can resend that question we can just do it. You're assuming the conductivity for duplicates are all at the same temperature and conditions, right? I am keeping the temperatures within a range of the room temperature, so from 18 to 30 and then I'm getting the median of that. So I am making an assumption that the temperatures around the room conduct the room, the temperatures I mean the conductivities around the room temperatures are comparable from 18 to 30 degrees but then I am feeding back again the measuring temperature for the for the materials. I see a lot of questions here in chat. Okay, general question, when to use Matlodlib and when to use Plotly? So Plotly is a library created to basically show plots interactive interactive plots show plots interactively. Plotly is really good when you're trying to demonstrate things in these kinds of Jupyter notebooks and keeping a publication quality. So if you want to zoom in for example in data or you want to explain complicated data and just play around with it, you would want to use Plotly and Matlodlib is just this other port of Matlab plotting to Python. So it's a really popular tool to create graphs for publications. So if you want a static plot, you will Matlodlib if you want an interactive plot, use Plotly. So one question here that I think can be answered pretty quickly. So Biswath asked 10 number of data points for the test case isn't it a small number of data to train on? I think maybe we can clarify how we're splitting our training testing data? No, I mean for the sequential learning it is starting with 10 points and it is a small number of data to train on at the beginning. So the idea with sequential learning is not that we have a perfect model at the beginning with all of the nuances of your data. The idea is that you can provide the model with some points, some guiding points basically. So we can make accurate, somewhat accurate predictions that I'm just going to try and get you closer to optimizing a property. It's not like when we're optimizing for a property that we just have to we just want to draw all of the data and get a number for one of the examples of the testing set. This is more like I want to capture the information from this data and I don't have enough data to just log it into the model. What we're trying to do here is not optimizing how we predict the properties. It's optimizing how we conduct the experiments that would lead to a higher property. Let me paraphrase that a little bit. The idea is you may start with so little data that your first guesses are as good as random. That's okay because you're going to add data and as you add data your model is going to get better and better. So you start with let's say 10 data points and 10 data points is probably too little for any predictive model. But then you're going to have 11 and then 12 and then 13 and every time you add an experiment you add it to your training data and the predictive the predictive power of your model improves. So at some point you're going to start getting some traction and your predictions are going to get better and that's going to mean that you make better choices and then the next data will be more important and so on and so forth. So you can see how if you look at the animations of these movies that Juan Carlos was showing you can see that at the beginning all of these models tend to expose the information acquisition function and with time you have more and more data you start narrowing down on the area with an accident. So I think one of the things that is a good one for opening up a discussion. What is or how to choose the proper model to do the active learning process? Carlos can chime in. I think the important thing is it depends what's important is to have a model with uncertainties. If you don't have uncertainties in your predictions then you lose the ability to implement a lot of these information acquisition functions. So random forests lend themselves easily to doing uncertainty quantification in a very simple way because they're based on a bunch of parallel decision trees. So you can calculate the variability of those. There's ongoing research in doing uncertain quantification in neural networks. You can use Gaussian process regression and basically whether it's for sequential learning or not the model that you choose will depend a little bit on your problem how high the dimensionality is. If you're very, very high dimensionality problems Gaussian process tends to not be ideal so it depends on your problem and the bottom line is implementing these models is reasonably inexpensive. So you can always try to different models and see how well they work for your problem. I have a question I would like to know if you have seen someone who is using neural networks in active learning in the sense that I am not sure if we can get uncertainty quantification of the prediction. I think there's a discussion there about how do dropout to do uncertainty quantification neural networks and so they are used in active learning for sure. There's research in doing Bayesian calibration of neural networks. The way we do this calibration where with back propagation is very efficient computationally basically neural networks is just a model with parameters so you can do the calibration in any way that you want including stochastic calibration with Bayesian approaches with other types of approaches that will give you a distribution of parameters from which you can do uncertainty quantification the challenges at what expense you can do it. Professor Ilias Pilionis here in Purdue in mechanical engineering is doing very nice research on Bayesian calibration of neural networks. Are there any other questions for our speakers? I see that some have been addressed in the chat. If there are important questions addressed in the chat maybe we want to discuss them Juan Carlos just so that we're recording these and we're going to make these recordings available so we may want to just go over the questions outside of the chat so there's record of the questions and the answers. Okay I can do that so one of the questions that I addressed in the chat was that my predictions were still contained within the 100 compound set that I started with so how can you expand beyond for example just garnet and the question is basically what are the constraints of the prediction the answer is basically that right now we're creating this example with values that we already have because we're testing the sequential learning algorithm we're still not quite there on trying to predict new materials but the idea is you can create a new data set with the same descriptors as the original data set on materials that you actually don't have the values for as long as you trust that your model is good enough to capture the properties that you want you can use that is a sequential learning and you might not get just the most accurate predictions but you will get for example the order in which the algorithm would query this particular materials so if you have 10 materials that you think are really good and you have a model that can make somewhat accurate predictions you will not tell you with super accuracy what the value is going to be for all of your materials but it will tell you if you test material 7 before material 3 for example so the constraints is you have to provide the model with some sort of normalcy as in you want to plug into the model things that are explained by your model basically you don't want to train in oxides and then give materials like nitrates because it's not going to work the idea is you can create these data sets of actually unknown values and then you can plug in your sequential learning and see what you get from it but what you will get is an order on how to test your new materials not a prediction on what the materials are going to be and is there any other question that I should discuss in the chat versus plot way I think we generally invest the answer in neural networks any other questions yeah so again just to answer Rick we'll go back to this question that it's in the chat can you put unknown garnet the answer is yes you just need to make sure when you exercise your model you exercise it within the bounds where you believe it's predictable right so you cannot extrapolate with your model very far away from where you trained it because obviously your uncertainties will grow but the idea is that this information acquisition functions will catch that that's what's important to have uncertainties and be able to tell you this is an area that you need to narrow down I have a question if I may this is Juan it's more materials based is there a particular set of structure descriptions that have been used or are repository for that so for example if we're looking for conductivity like in this case maybe we want some open-ended structures so just giving the chemical formula may not be the way I'm just saying you know the space group may not be enough or just saying FTC versus BCC may not be enough but I presume that that has been looked at by others so I'm trying to describe the different structures and compounds and 10,000 compounds by the databases in different ways for different problems does that make sense so in this particular case what matters for conductivity is the occupancy of the sub lattice that the lithium ions leave occupy and by changing the dopant you can change the partial occupancy of that sub lattice this is always the case in transport solids that it's mediated by vacancies so that is actually one of the descriptors that we have what is the partial occupancy of that sub lattice so that information is there if you have telegraphic information you have all the crystal sites and you have what you also know what is the occupancy of each of those sites and as you mentioned the occupancy of that sub lattice for the lithium ions is that makes sense yes thank you in general for materials there's Pimadgen that's a library from the materials project it's a lot of these descriptors and I know that Juan Carlos and second they're much more knowledgeable that I am in those libraries but there are many descriptors that not like a specific one but you can use a lot of standard descriptors right out of the gate from using in the chat links one for descriptors and another for tuning external networks with these Bayesian processes for those of you that are interested perfect maybe we want to capture that in the recording yeah I can put it in there I have another question have you tried to sample more than one point at each iteration like let's say that you have one to perform parallel computing so you can use more than I don't know you can sample more than one so that you can arrive earlier to the best point have you tried that so we have not tried that yet how to use these acquisition functions is still on the research because you saw that we're using the same formula as we move through the experiment you can also make combinations of these acquisition functions like let's do let's start first the maximum of prediction and then do some exploring of the uncertainty and you can do what you were saying doing it in batches and see how many batches of experiments you need to run before you find the maximum but the idea with sequential learning is you want to run experiments you want to run simulations and these are time consuming and really expensive so if your acquisition functions are giving you a hundred different values that might be the maximum that you want to find it gets really expensive once you start adding more points but it's an interesting question to try and add toward a time or two with different acquisition functions and there's more to it often in research efforts, development efforts you have more than one source of information so for example you can have an effort where you can run a simulation or you can run an experiment and maybe there's more than one experiment that you can run that will give you maybe have a cheap experiment that's not that you can do very quickly but gives you an approximate answer and maybe there's an experiment that could take a month but give you the right answer and so we actually have a project supported by the National Science Foundation where we're developing these type of ideas to optimize what type of information acquisition you know is best given your budget so maybe you have to, maybe you can do three simulations and two quick experiments and one full scale experiment and that's why it's important to these models that we were developing here have a single source of information we assume that all these experiments have the same accuracy but it's interesting you can use neural networks, you can use Bayesian approaches to aggregate information from disparate sources of information to improve on your predictions and then you can say okay given the time constraint I'm better off running a bunch of simulations than one experiment or I really need to run one experiment because that's the ground truth Alright, so it looks like we're getting to the top of the hour, if you do have any other questions that come up in your head in the next you know hour or days please feel free to reach out to any of us and email us we'd be happy to address your questions these lectures like we say will be recorded and we will be posting them on the nano hub page so look for those when they come out and I'd like to request everybody give one last thank you to our speakers we're doing an excellent job today. Thank you