 Good morning from Stanford University. My name is Will Chu. I'm the director of the StorageX Initiative here. It's my great pleasure to welcome everybody back to our final symposium for the summer quarter. Today, we are very excited to welcome two experts who have been hybridizing the field of energy storage, material science and data science. We're gonna be discussing a very critical topic that is near and dear to my heart, which is how do we use modern informatics methods to really accelerate the pace of research and development and also deployment for energy storage technologies. This is a topic that is hugely important because technologies like batteries and other energy storage devices inherently occupy a very large design space. And because these technologies have very long lifetime, the assessment time is also long. Moreover, some of the properties we are predicting has to do with safety and reliability. So they're also very rare events. Traditionally, this has been very challenging to tackle using standard methodologies. And over the past 10 years, the emergence of data science in other fields have been applied to the battery field and other energy storage technologies. So today, I am really excited to be welcoming to wonderful folks who has been on the leading edge of this field, Austin Sinek from Aionics in Stanford and also Chris Wolverton from Northwestern University. And they will be talking about how these informatics methods can be applied and combined with other material science methods to accelerate the timeline for energy storage technologies. So first up, we'll have Austin give a talk. So Austin, if I can ask you to come to the stage. Thank you very much. Good morning, Austin. And let me just give a short introduction, Austin, and then we can get started. So as I mentioned, Austin is the founder and CEO of Aionics. And he'll tell us more about his exciting company working in this area. He is also an Aicheng professor in the Department of Material Science and Engineering here. He is one of our finest having received his PhD in applied physics. And he is really being trailblazing the field in 2017. He wrote a very important seminal paper in the field by applying machine learning methods and density functional theory calculations to predict previously unknown solid electrolytes for solid state battery. And this paper now has been regarded as one of the major breakthroughs in the field of how many things have come together. And excitedly, the predictions have also been validated experimentally several times around the world. I'm sure Austin will give you some history on the work there. And I also wanna mention that Austin is really an up and rising material scientist and also entrepreneur having received the Forbes 30 under 30 award a couple of years ago. So Austin, we're really excited to hear both about your take on the field and also, sorry, your take on field, their journey and the latest and greatest at Aionics. Austin, all yours. All right, thank you Will for that introduction. It's really a pleasure to be here with all of you right and early California time. And I guess I will just go ahead and kick things off. Okay, so the title of my talk is from medicine to materials adapting the drug discovery model to the battery industry. And really the goal of this talk is to think about how we can accelerate the pace of deployment in R&D in batteries. As Will said, there's a lot of challenges, technical challenges around how battery technology is developed. And of course, there's urgency here given the climate crisis and the sort of climate implications of energy storage technology. So of course, machine learning and informatics is a big piece of this. And I will sort of touch on that as we go. But I hope that by the end of the talk which is going to be there are technical pieces but also, it's not to sort of strictly technical talk. I think by the end of it, my goal will be to convince you that machine learning is not just an interesting method to apply into this space but it may actually be sort of an essential method. So I'll just jump right in and start by saying of course, the motivation as I just mentioned is on mitigating climate change really. There's a lot of benefits, societal benefits for energy storage, climate change is a big part of that. And I think for probably most all of us having lived through the last two years, the rollout of the safe and effective COVID-19 vaccine in less than 12 months, I think to me showed the incredible power of the drug discovery model as it currently is. And I would say that that's a resounding win for science and society. I think now it's having been vaccinated for COVID-19 is now sort of mundane and it's so commonplace but it's easy to forget just how amazing of a development that was to go from essentially sequencing the genome of the virus in early, I believe it was in early 2020 and then kind of having mass vaccines on the market at least here in the US within I think 12 months, 15 months, something like that. So there's something in there that is really powerful and really exciting and battery innovation needs to happen faster. I mean, it would be amazing if we could get these new technologies to market on that same timeline. So the question that's guiding our work at Aionics is thinking about, well, what can we learn from the success of these sorts of drug discovery problems? And I wanna just give a sort of maybe a definition of what I mean when I talk about drug discovery here. So a key element of drug discovery as I will sort of refer to it in this talk is there's a partnership between an innovator and a manufacturer. So an innovator discovers a solution rapidly and the manufacturer brings it to market rapidly and sometimes the innovator and the sort of manufacturer can be the same company but oftentimes these are kind of two different parties and maybe the innovator is an academic lab, maybe it's a startup, maybe for example, Biontech and then the manufacturer is a Pfizer. Typically these are two different parties. And in order for this to work, the innovator must be able to find a solution to the problem and then write it down, maybe not literally, but sort of write it down and pass it to the person who's gonna make it for you. And the question of whether this actually will work becomes a question of whether a solution that's developed in one context, basically within the confines of the innovator still works in new context, still works in the context that the manufacturer will be able to deploy it in. And so the question to address here is does this work with batteries? And I would say the answer is yes with some limits and some caveats. So that's basically the argument that I'm going to try to make or the discussion that I hope we can sort of have. I won't really be talking about the differences in sort of markets and economics of pharma versus batteries. That's a very interesting topic. There's similarities and differences that could be a whole other hour of discussion. But I would just refer you to this seminal paper from Eve Hansen and co-authors actually from Northwestern as well that looks at some of the comparison points around the kind of markets and what's different and what's the same between these two. I'm going to kind of focus more on the technical challenges here. So having set all that up, I think maybe I'll give a quick introduction to myself. So as Will mentioned, my main kind of day job is running AI Onyx, which I founded in 2019 after finishing the PhD program at Stanford. And I've had the great privilege of being an adjunct faculty now at Stanford for the last just about a year in the material science department, which has really sort of been my home in many ways for the last decade. I did my PhD in the department and it's just really a great privilege to sort of be back. My co-founders at AI Onyx, many of you listening to this may know. So Venkat, this one often is sort of well known and in this battery informatics community, professor at CMU and Led Zeppella Show, who was a PhD student with me at Stanford. And Venkat was a PhD at Stanford. So we, there's three Stanford PhDs on the founding team. And I thought I would give a little bit of a history or a timeline of the last five years or so and basically how my personal journey has kind of evolved starting from those PhD days at Stanford. So as Will mentioned kind of my, when I first jumped into this was in December, 2016, we published this machine learning for iron conduction paper in EES. And within, I was going back through my old emails to try to find exact dates. I think within a few months, I was involved in, we were working with Will to set up what became known as the D3BAT Center, which was this collaboration between Stanford, MIT, TRI and a number of others. And it was really cool in those early days. I think we were kind of in this, we were in this period of, we know that machine learning is really interesting and really important. We know there's so much potential here, but we got to find the kind of impactful problems. And in 2018, my, who is now my co-founder of Venkat published a machine learning driven mechanical property screening paper for iron conductors. And we had sort of collaborated on that paper and in right after my paper came out, so this is another old email I dug up. And I got an email from Venkat who said, we're interested in this space, do you wanna come give a talk? And you know, this is not a forged email because there's a reference to blue jeans as the medium for a video call. So I don't know if anyone remembers the blue jeans era, but so I got to know Venkat well in that time and then his group really has kind of picked up and run with a lot of the solid iron conductor work. Founded the company in early 2019. Kind of mid 2019 was when Will's group published this really seminal paper on cycle life prediction that was led by Severson and Atiya. And I think that has kicked off a whole subfield here around cycle life prediction. Late 2019, Venkat's lab starts to kind of ramp up in their capabilities around ML driven energy materials discovery with this RPE award. And he formally joined AI Onyx as chief scientist and essentially we've now given him the title of co-founder as well because he's been such a valuable presence here on our team in 2020. And really since then, AI Onyx has been focused on what we call co-innovation partnerships. So these are partnerships with typically sell manufacturers where we're co-innovating together. We're looking to identify new materials, new usage patterns and protocols for battery operation and battery design. So most of the companies that we've worked with unfortunately are, we cannot publicly disclose their names which is a bummer, but we have several clients who very graciously have allowed us to talk about these partnerships, FormEnergy, Shoadanko, Oakburg, Sepion, Kament are some of them. And Kament is a fun one because this is essentially electrochemical cement. And so it's taking kind of battery design ideas and repurposing them to make a carbon neutral cement material. So we're really working to sort of translate these capabilities away from batteries per se and into electrochemistry more broadly. I should mention here, presenting with Chris that basically from my earliest days getting into this space, I was following his group's work and I'm excited to hear him talk about his sort of perspective over the last five, 10 years. But it's really a privilege here to be speaking with him because he's been, of course, such an influential presence and contributor in this space. And I've gotten to know a number of his students, former students and worked with them in various capacities. And so I have great respect for Chris and really fun to be here with him today. Okay, so with that introduction, I wanna come back to this point about commercializing new battery technologies broadly. AI ONIX essentially, in some sense, kind of sits at the intersection between academia and industry in the sense that some of what we do, not all, but some of what we do is kind of translating various models, datasets, approaches out of the university setting. So kind of tracking the literature, seeing what works and what doesn't and bringing that to our partnerships with industry. And so this has given us a unique perspective, I think, on what academic efforts drive, what are the sort of highest value type of efforts? Maybe it's one way to say this. What is industry really looking for from academia? And then it's kind of interesting to see what can, how can that sort of cycle of learning feedback both directions? So what can industry learn from academia? What can academia learn from industry? What can industry sort of learn from each other? So anyway, I think a typical, what we've seen, so I'm kind of speaking from our perspective working with a number of these now sell manufacturers, what we've seen here is that you have kind of a, a more or less typical path to commercialization where you could think of this as, so you're your risk, you're sort of de-risking the technologies you go. So your risk is falling or you're getting closer to commercialization. So you could think of this as maybe a distance to commercialization. So that falls over time and, or over as a function of the effort you put into it, engineering hours, things like that. And I think you can generally kind of map most of these trajectories onto this path of, you start with a low-fidelity computational demonstration that it works, you go to a high-fidelity computational demonstration, then you go to a low-fidelity experimental demonstration and then a high-fidelity experimental demonstration. And then you basically have your prototype that's sitting on the lab bench and now it's time to scale up. I did not draw this curve to scale, I should mention that the scale up really should be a much bigger part of the risk than is demonstrated here, but I just couldn't get all the text to appear in the upper left of this plot without it being too complicated. So I apologize to anyone who thinks I'm downplaying the challenges of scale up that is certainly a major challenge, but you kind of get the, you kind of see the trajectory here and coming back to this question about acceleration, odd-law drug discovery, it really becomes a question of how do you kind of leapfrog between these stages very rapidly and successfully? And so I think there's a few similarities and differences, right? So if you think about drug discovery, they're typically developed within a representative set of future environments. So they're tested within the human bodies that they will then sort of be deployed into. And in other words, the human body is basically a universally similar environment for a drug. And so a molecule that works to cure a certain disease for human A likely works for human B. And human A and human B can be next-door neighbors, they can be across the globe. The environment is largely the same. Now, battery components are typically developed in isolated and idiosyncratically different environments, different labs, even can be like different humidities, different suppliers of copper foils, all this sort of stuff. And so what we typically see is that the performance of certain components or certain solutions is usually not universal across cell designs. And sometimes even not universal across the same cell design. So an electrolyte for one lithium metal battery might not work in another lithium metal battery. So just because something works for battery A doesn't necessarily mean it works for battery B. And I think the drug discovery comparison here is it's almost more like trying to cure diseases across multiple species. So my sister's a veterinarian and she always says that real doctors cure multiple species. So maybe this is why I was thinking about this, but a COVID-19 vaccine for humans, maybe that can cure, you can stop COVID for like 50% of dogs. I have no idea, don't quote me on that. So you can get part of the way there but you sort of need to co-optimize again for the new system that you're in. So what does this mean for battery development? I think really it means it's difficult for a solution to problem X in cell Y developed by entity A to be deployed to solve that same problem in cell Y by entity B. It can be even ostensibly similar cells. It can just be very challenging to kind of have that solution translate universally. And I think this matters because it has implications for how information flows across institutions. And as I mentioned, namely from academia to industry because you go back to this drug discovery point, here's the answer, write it down, pass it to the next party. The question as well is what you've written down really sort of the solution. And I think that the central question is sort of one of what I will refer to as universality. How universal is the solution that has been found? A universal solution to problem X as proposed by entity Y will also solve problem X for entity Z. A semi-universal solution might solve a certain percentage of the problem for entity Z. So in any given case, the question is basically what percentage of the solution is universal across labs and across experimental setups kind of within the reasonable amount of, the reasonable range of things that you can hope to control for. So let me give you a few examples and I'm gonna deep dive into a few here and kind of give you more details around what I'm referring to here. But just a few examples. So the melting point of a solution is pretty much universal. It can be replicated regardless with pretty high accuracy, regardless of who's doing the experiment or what environments it fits in. Of course you can say, the pressure is different. It'll change and things like this. But for the most part, one lab can reproduce another lab's melting point calculation pretty well. Something that's maybe semi-universal is a problem that is near and dear to my own heart as I mentioned before as Will alluded to. The identification of fast ion conductors by DFT is probably semi-universal. So if I say, I think my DFT calculations show that this is a good ion conductor, please take it into the lab and make it and see what you find. We find that experimental conductivities can vary sometimes significantly and unpredictably from what is seen in DFT based on the details of the DFT and based on the details of the experiment. So it's not always a kind of straightforward exercise. And as you get even more complicated into the system level thinking, you can see that the cycle life of a battery as a function of its electrolyte is probably non-universal or maybe at best sort of semi-universal. It strongly depends on not just the active materials, of course, but small details even with an ostensibly identical cell designs. So you can switch your supplier of your binder and maybe the purity is slightly different. Now you're getting some new reaction on the interface and suddenly your cycle life can change quite significantly. And so these are all things that I think are unique about battery design. Batteries are just very complicated and there's a lot of parameters that we can't necessarily control and we can't really develop the key difference. We can't develop these solutions within a representative set of all future cell environments that they might run into in the same way that you might be able to do this by sampling a human population. So I mentioned a few examples here and I'm gonna talk through them. And I think I wanted to sort of put them on this plot of universality versus distance from commercialization. So the bulk properties of materials, I would say are very highly universal things. The melting point is the melting point is the melting point. But knowing the melting point of something means you still have a long ways to go before you have a battery, right? Or in the case that we were just discussing, knowing the ion conduction of a potential salt electrolyte is great, but now you have to put it into the cell and you have to optimize the cell. So you still have more problems that you need to solve. Going down a little bit further is the optimization of the electrolytes on that system level for cycle life. Now you're probably approaching, you're getting closer to commercialization, but you're probably a little bit less universal. And I would say what is even less universal still, but now further on the development track would be cycle life prediction from early data. And so essentially if we're thinking about what can be discovered and translated across scales, basically as you get further to the right here, it becomes harder for entity A to hand an answer to entity B. That's kind of where I'm going with this. It doesn't mean that it's not, it can't be done and it's not worth doing. Of course, these are all extremely impactful things to do, but there's additional challenges that you have to address in order to do that. So I wanna give some more details on these three examples to just to illustrate the point here and then to use that to think about, well, how can we proceed given the constraints that we face? Oh, and I should mention this that as you go on to the X, up the X axis here as you get closer to commercialization, you're typically dealing with increased time and length scales. And so cycle life prediction, for example, you're looking at the response of the entire cell, which is on the centimeter or meter scale, whereas bulk properties you're looking at the angstrom scale, right? And so that's just another way of thinking about this axis. So we'll start here with electrolyte optimization for bulk properties. So these are the intrinsic properties of the material. And coming back to this trajectory, this commercialization pathway, going from essentially some rapid, computable parameter to an experimental demonstration of conductivity is, I think of this as kind of jumping from this first low fidelity computational step to low fidelity experiments. And by low fidelity here, I mean, the electrolyte is just one component of the cell. Demonstrating the cycle life of the entire cell would be sort of a high fidelity experiment. Here we're just looking at a single component. And so that's why I have it mapping to this, the lower fidelity or single component experimental demonstration. So the question is, well, do physics-based simulations of bulk properties actually correlate with the experimental values? Having written a lot of papers on DFT, every time this question comes up, I just pray to the DFT gods that the answer is yes. But of course it depends on the property. And so we'll go to ionic conductivity, as I mentioned, I've spent a lot of time working on this. And I wanted to show some data from this great, it's actually a preprint paper, but a paper from Forest Laskowski, who I see is here in Zoom. So, hey, Forest, good to see you here. And I love this paper. I'm very excited to see it published. I pulled just a fraction of the data out of this work. There's a hugely valuable and rich data set in there that takes, for a given structure, it takes the experimental conductivity as reported in the literature. And then also goes and computes a simulated react, sorry, activation barrier for ion diffusion. It's like a, you know, EA versus conductivity. We would expect these things to be correlated in the, you know, conductivity goes with the exponential of the activation barrier. This is a Boltzmann process here. And so this paper has pulled these data for hundreds of data points. And as I mentioned, it's a great source. I just pulled a few here to look at kind of what the emerging trend is. And you can see that, hey, it's actually pretty good, right? And so what this means is if you take a new material and you kind of estimate the diffusion barrier with, in this case, this was not DFT, this was with a, what's called the bond valence method. It's kind of one level of simplicity above DFT. But you can think of it as, you know, it's kind of in that same vein. If you can get a low, if you'd run this experiment computationally and you see that there's a low activation barrier and then you take this material into the lab and you try to make it and test the conductivity, well, there's a pretty good correlation. So things that show low activation barriers tend to be good conductors experimentally. So that suggests that, hey, there might be some universality here. Of course, there is some spread in the data. So it's not perfect. There are outliers. And you can ask the question, well, what are the sources of error here? There could be a lot, right? So any method, any computational method that just looks at the crystal structure of the unit cell is not going to take into account anything involving grains and grain boundaries. And we have some forthcoming work looking specifically at the role of grain versus, excuse me, versus bulk conduction as a predictor for the correlation between the DFT conductive and experimental conductivity. So that's important. Defects and impurities typically are not going to be modeled unless you explicitly put them in into a kind of computational calculation. And then with DFT, of course, you have semi-empirical parameters you have to worry about. You have thermostats, you have all of this stuff. You have simulation times. In this case with the bond valence method, there can be assumptions about what the right conduction pathways are. So there are sources of error here, right? There's things that are not addressed between these two vastly different time and length scales. But from this data, I just ran a quick R squared to see, oops, to see what this correlation actually looks like. R squared is 0.66. So this would say that the computational method actually solves basically 66% of the problem, right? So that's pretty good. That's essentially in this argument of maybe semi-universal, the answer is we can solve 66% of the problem across these scales. So this is just summarizing what I just said. Four ion conductors do computational solutions translate to experimental solutions. If the proverbial biotech gives a DFT calculation to the proverbial Pfizer, what can they expect? And as I just mentioned, the factors that determine these values, the variables here are different. Computation you're dealing again with parameters and times and experiments you're dealing with the sort of effects of real life, water absorption and error, reaction with the air and grain boundaries and pressures and all these sorts of things. So I guess I would want to say here that the point is that this is sort of a semi-universal problem. And other bulk materials properties are different, right? So conductivity is a notoriously hard one to compute, but there's a lot of others. So how about some of these others? Well, the Seder group has done a lot of fantastic work looking at different electrochemical properties as of course has Chris's group. I just pulled these two papers. This one is from 2010. So it's, I guess 12 years old now, but really great work looking at kind of the correlation between predicted band gaps and experimental band gaps from GERD's group. And you can see, you know, cone champ PVE is probably the most commonly used approach. You can see it's kind of a consistent underestimate, but there's still a pretty good correlation here, right? So there's some ability to translate across scales from the atomistic to the experimental. The one that is also very interesting is thermodynamic or is broadly speaking predicting at electrochemical stability windows of electrolytes. Most of the work, if not all of the work in the space has been focused on the thermodynamic contribution to electrochemical stability. And this is another paper, this one from 2016 that developed a new method for predicting the thermodynamic piece of the electrochemical stability window. And I think this is, I love this paper. I've cited it many times. And I think that the methods are extremely interesting and useful, but it doesn't capture that kinetic piece, right? And so I think there's still a question here of what is, how do these thermodynamic calculations correlate with experimental calculations? And can you even experimentally calculate what the window of an electrolyte is, right? It's not quite as easy as it can seem to be. And you have all these compounding factors and you have chemical reactions and electrochemical reactions and structural kind of instabilities. So I think determining the correlation, essentially building a plot like this for electrochemical stability windows is a super high impact problem that I would love to see some data on. Okay, so bulk materials I think are generally pretty, maybe not universal, but pretty semi-universal depending on the application that we're looking at. I wanna now take a turn into cell level optimization of electrolyte. So now predicting cycle life of the cell rather than just some bulk property of the cell. And here I have this as sort of going from our high fidelity computational demonstration to a high fidelity experimental demonstration. So high fidelity again here, meaning kind of system level. So the question is, are good electrolytes universally good across consistent battery designs? Let's say lithium metal. So can I find the holy grail electrolyte for all lithium metal batteries? Is another way of saying this. So the question is, well, what information do you need to know if you wanna build a perfect physical model that can tell you the electrolyte, excuse me, tell you the cycle life of the cell based on the electrolyte composition? Well, you need to know all the details of the cell. You need to know the active materials, the additives, the suppliers, the purities. You need to know something about how the cells being used, the charging rate, the temperature, the surface area, pressure. There's tons of variables here that can be quite impactful. And if Fritz, another way the question is, well, can a good lithium metal electrolyte identified in lab A also work for lab B? Is there a global optimum? So we sort of put this to the test and I'm being told I have five more minutes. So I actually need to speed up quite a bit here, but this is sort of a bummer because this is maybe the most interesting slide. So let me take a few minutes to talk through this. So this is some work from our partners at Cepion Technologies who we work with and they're very gracious in letting me share this data. So Cepion is a originally a separator company who is now building these separators into lithium metal cells and optimizing them for fast charging. And so the question is to identify the right electrolyte for the job. And so here I've redacted the actual values, but here you'll see the cycle life of these cells. So capacity versus cycle number for a number of different formulations. And so I want you to note these sort of three lines here which are there in this case their highest performing electrolytes. And these are just essentially unique combinations of otherwise conventional components. And so you probably know the typical things, carbonates, things like that. Combining them in non-intuitive ways over the course of six weeks of iterative optimization gave us a huge increase in the cycle life just over that brief amount of time. And you can compare this with, for example, the electrolytes that were proposed in this nature energy paper from Jeff Don's group. And what we found in this case was the performance of those electrolytes was highly dependent on the charging rate. And this is mentioned explicitly in this paper by the authors that there is a strong charging rate dependence. We found that to be exactly true in this case as well. But I think it was a good illustration of the fact that the details here are really important. And so this electrolyte, which may have worked really well in the context that it was studied in, you start changing some of those parameters and suddenly you lose that difference. And so what we found here what our partners to set beyond found is that selling electrolyte co-optimization is super important. And so the tweaking of what otherwise might be considered fairly traditional components can get you huge amounts of increase in cycle life versus something that might be like out of the box but might not match your parameters exactly. And folks in industry who are listening may say, well, of course we know this is the case that your performance depends on the details. But I often have found that this sort of discourse sort of downplays the importance of those things of charging rates, of all these other details, mass loading, things like this. And so we really believe that co-optimization of anything is gonna work better, significantly better and is even really necessary rather than taking something right out of the box so to speak for electrolyte design. Now let me give my final example here and this is in the cycle life prediction from early data. So as I mentioned, Will's group has this sort of seminal paper from 2019 led by Severson and Natia. And here we're kind of working more towards the cell has been built and we're just predicting the properties of that built cell. You're probably all familiar with this work that looks for subtle signatures in the DQDV curves between cycles to predict the long-term performance of the cell. The question that we asked, and of course this model worked very well in this case, super impactful paper. The question we asked was, will this model actually work for different cell chemistries? And so my punchline is sort of, so we tested this with one of our partners that we work with at Aionics and we found something very interesting. So let me just walk you through the way we think about this which is you start with this full universe of features which were developed in this paper. There's something like a dozen of them and these are different sort of ways of quantifying small changes, sometimes imperceptible changes in these response curves. And the idea is can you combine them into some model? It doesn't necessarily need to be a linear model, but can you combine them into some model that can be trained on data to predict cycle life? So in this paper, your dataset one is, I believe it's 124 graphite LFP cells that are cycled, that data is fed into a model building algorithm and you find this cycle life model. This predicts cycle life for dataset one. So you have some regression coefficients and you have some features or some subset of the features. So then what we did was we said, okay, let's take that same pipeline and let's train a new model on cycle life data for 72 lithium metal NMC cells. So we would naively expect that the failure modes and the signatures of that failure are gonna be completely different, right? And that this model, the features may not work. They may not have any sort of translatable capabilities. We build another cycle life model here for dataset two. Question one is, do either of these models work? And then the question two is, are they actually the same model? So question one is easy to answer. We found that both of these models worked extremely well. So the model from the original paper, I don't remember offhand the exact error, but you can see here that its predictive power is very strong. And what we found here in our test set of, I think this was, I think maybe like 15 cells or so, was actually, hey, this model works really well too. So we can predict cycle life within the first 10% of life with very high accuracy. Okay, so the models both work, but are the models the same? The answer is no. So building a model for predicting cycle life from one dataset is not going to give you the same, it's not gonna predict cycle life directly in a new dataset, right? Things are different. You can imagine that you change the charging rate and you change the shape of these curves. Now suddenly those regression coefficients need to be scaled. So you can't do a direct comparison, but following the pipeline, following the workflow that was put together here actually did work. And I still think this is just amazing because it suggests that there's something universal not in the trained model, but in the pipeline and the feature set that was developed in this really great work. So does a cycle life model train on one cell chemistry, predict cycle life in another cell chemistry? Well, both models work when they're trained on their own respective datasets, but the model parameters are not the same. That's sort of the upshot. You have to retrain these models based on having new datasets at hand. So the emphasis is that Severcin and Tia framework, but not the model itself does appear to be universal, while the, you know, why you could say that the model may be non-universal. Okay, so I've kind of blown through these last points very quickly, but this actually brings me to the end of my talk. So hopefully I haven't gone over by too long. So putting all these things together, I think the question is, well, what can we learn from this? And I have a few points here. So I think the main thing to emphasize here is that in battery design, the performance of a component is strongly dependent on the details of the cell. And that's really fundamentally different from drug discovery, where something works in trials, you can expect it to work in the field. For many high value problems, the global optimum does not exist. The solutions are semi-universal or non-universal and need to be co-optimized with the cell itself. So I think what this means is that the drug discovery playbook can accelerate battery design, but it needs sort of its own unique twists on it. One must learn how cells react to different solutions and different components or cures in this case, and they need to be adjusted accordingly. And this is why, and I haven't talked really at all about machine learning. So I'm excited to have Chris really dive into that, but that word learn is so important because we think that it's, if you're gonna do this, if you're going to try to innovate and translate those innovations at scale, you have to learn how different cells react to different solutions and then tweak those solutions. The out of the box concept, I think is there's very few cases where that actually works and there's very few cases where it works and it provides a really valuable solution. So there must be some element of sort of learning, of machine learning and co-optimization. And I would say, finally sort of each company is really its own disease area, not each cell design. So not all lithium metal batteries, for example, are the same. And the final is just sort of a plug, a shameless plug, which is that AI ONIX has successfully delivered solutions to a number of cell manufacturers. And we believe really that the broad adoption of a co-innovation model across this industry is not just possible, but really inevitable, especially if we want to accelerate progress in the battery space. So with that, I think I'm a few minutes over, but that's the end of my talk. So I will thank you all for your attention and happy to take any questions. All right, Austin, thank you so much for that great talk. I think we only have time for one question. And this one question actually comes from Forrest, who you highlighted earlier. Good morning Forrest. So his question is on standards. So Austin, you alluded to the importance, well, that the challenge is that comes with, you know, widely varying cell formats, testing protocols, and so forth cell designs. And Forrest comments that it seems like standards are very much needed, both in academia, well, especially in academia, but also perhaps outside as well. Can you comment on standardization and, you know, what AI ONIX is doing in this area to lead the field? Yeah, that's a great question. And it's a tough one because there are, it strikes me, and I'm curious, you know, what you think will as well, it strikes me that there's sort of, there's known unknowns and then there's unknown unknowns, right? And so there's variables that change that you know are changing and you know how they're changing and can and should sort of be reported. But then there's variables that might be changing. For example, I mentioned like changing the supplier of a certain component and that changes performance, but it's like you have no idea, you know, why or how. And so I think, you know, in general, it's certainly important to record and, you know, verify the things that we know are changing. And then Kat's group had a great paper come out earlier in the year and I will probably butcher the title if I try to think of it, but it's something along the lines of, you know, a minimum verifiable information set for battery research. They essentially published this kind of framework for what data should be reported in these studies in order to reproduce DFT studies and things like this. I would probably point to that paper as a really good framework to use. And in general, think that, yeah, I mean, this is certainly a very important thing for us to be working on. Great, Austin, we can discuss this a bit more at the panel discussion as well. All right, well, thank you once more, Austin. And I'm delighted to now welcome Chris to the stage. All right, good morning, Chris, or still good morning. So as I introduced earlier, Chris is a professor in the material science and engineering department in Northwestern. I've known Chris for quite a while and across many different problems too, from hydrogen to energy storage to machine learning and so forth. The one thing I wanna mention about Chris is that he has this very unusual convergence of experiences from national labs to industry and then back to academia. Chris spent some time at NREL, spent quite a bit of time at Ford leading computation and application hydrogen storage. And of course, more recently at Northwestern. So I think he has a very unique perspective on how to find the right problems to work on. And I feel industry is very pragmatic and academia can be very scholarly. And I think Chris really embodies the best of both worlds a lot for all of us to learn from. The other thing I also wanna mention about Chris is that although he is a highly regarded academic, but behind the scenes he has been training many of the leading innovators today. So I encourage you to go look at his trainees and protégés, you'll find them at very important places starting important companies, being faculty elsewhere to train the next generation. So Chris, we're very delighted that you can take two hours today to come and speak to us about the intersection of data science and material science. Chris, all you have to say. Fantastic, thank you very much. All right, so thanks very much for the invitation. Thanks for the kind introduction and thanks to Austin also for the kind words as well and the beautiful talk before. So what I wanna do is talk today a little bit about data-driven methods and specifically applied to computational materials. My group is a computational group. We do mostly density functional theory. So I almost always forget to say that and don't give the details of DFT, but basically everything in my talk is gonna be relying on data that's generated from density functional theory. So I'm gonna show an example of battery materials to kind of illustrate some of these ideas, but in some sense, the real topic of the talk is really the methods, the data-driven methods themselves and the battery materials just kind of serve as a back. I also put people who are responsible for the work on the first slide so that I don't forget to acknowledge them. The most important thing is to acknowledge this, well, former students and postdocs from my group who've done the work that I'm gonna talk about, some of whom are listed here and you can kind of see where they are now. In addition to people from my own group, we have had a pretty strong collaboration with the researchers at the Toyota Research Institute and the work that we've done with TRI is gonna feature kind of towards the end of my talk as well, so I wanna acknowledge them as well. Okay, so like I say, the general idea is, or the general question, can I slide it, yep, okay. So the general question that I wanna address in this talk is sort of how do we use data-driven methods to accelerate the discovery of new materials and in particular battery? So like I say, my group is computational to sort of large scale DFT and DFT calculations these days have matured to the point where kind of standard DFT calculations of standard properties like total energy and NCF states and band gaps and things like that are really rather routine at the moment. And so many of these kind of calculations have been automated, workflows have been generated to kind of automate this. So the kind of calculations that we did kind of one material at a time when I was a graduate student, nowadays that's not how it works, you sort of use automated workflows oftentimes to generate larger quantities of data. And these have been assembled into large scale databases. So databases like materials project which you might be familiar with, but also at Northwestern we've been developing the open quantum materials database or OPMD for about the better part of a decade now. It's a very large scale DFT database, it contains both known experimentally synthesized compounds in it about 50,000 of these experimentally synthesized compounds and then increasingly more and more of what I call hypothetical inorganic crystalline solids. In other words, compounds that for one reason or another in some project we've made up along the way and done calculations of and stored in the database. So the database now contains over a 1 million inorganic crystalline solids in it. And as you're gonna see later on in the talk this kind of very large and very diverse dataset actually serves as a good training set for machine learning approach. Okay, so there's all sorts of things you can do with a large scale dataset like this. And I wanna basically talk about sort of how you might mine a dataset like this to search for new materials. So the simplest approach, I think there's sort of two generic ways you might go about this and the simpler of the two approaches is what I call the direct approach or what a lot of people call high throughput screening. And the idea behind this approach is very, very simple, right, conceptually very simple. The idea is you start with a large number of candidate compounds that you might think are interesting for a particular application. And then if you can design a set of screening criteria, so criteria that you think that the material must exhibit in order to be a good candidate, if you can devise sort of a suitably aggressive version of these screening criteria, you can narrow down this list from a very large number of candidates at the top to only the ones that satisfy these screening criteria, which might be a much, much smaller list. And the idea is then these promising candidates that come through the screen can then be, for example, given to your experimental colleagues for potential synthesis and characterization and well, either validation or recommendation, I guess, of the... So I just wanted to show you an example of this in the lithium battery space just to say kind of an illustration of how this screening approach works. And then I'll turn to the second, the second of the two approaches. So the example that I wanna show this for is the so-called lithium-rich LI2-MNO3-based composite cathodes. These are sort of well-known lithium-rich cathode materials. They are formed from a mixture of this LI2-MNO3, lithium-rich phase and a LIMO2 layered phase. The role of this lithium-rich phase is there's sort of two potential roles that this might serve. One is as a stabilizing influence that it can stabilize the layered phase during charge and discharge. And secondly, the lithium-rich phase, of course, contains a lot of lithium and so it can serve as a capacity reservoir. So expert capacity can be obtained from this material. These are very famous materials discovered by Mike Bakare and the Argon group a long time ago. And in general, they're pretty interesting because they have relatively high voltage and capacity compared to some other candidate electrode materials. But of course, they still have issues associated with things like voltage fade and oxygen pollution. So in the same family then, there have been other materials discovered other than lithium manganese. And these other materials sort of exhibit interesting properties as well with other metals such as ruthenium, tin, iridium, so forth. One of the interesting thing about these materials is that there's also evidence that there's anionic redox that can occur in these materials during charge and discharge rather than sort of solely relying on the cation to produce all of the redox capacity. So that's kind of a very interesting and very active area that lots of people are exploring. And the thing I just wanna sort of highlight here in my kind of illustration of computational screening is that this kind of family of these lithium rich compounds is, there's still a lot of potential compounds out there that might be interesting for lithium battery applications. So this is actually a study that we did a number of years ago but still kind of illustrates the point. We basically looked at a large number of these LI2 MO3 compounds for a number of different metals. So the metals you see here listed on the x-axis of this plot, the y-axis of this plot is essentially a measure of the thermodynamic stability of the compound. It's essentially a measure of whether or not this compound is on the lowest energy envelope, which we call the convex hole. And in this plot, sort of a negative value means that the compound is calculated via DFT to be thermodynamically stable on the convex hole. A positive value means it's above the convex hole. It means technically it's unstable. There's something lower in energy on the convex hole. But you see that basically the bottom line is that if you look at all of the experimentally observed compounds in this class, all of them are calculated via DFT either to be thermodynamically stable or very close to stable. Okay, so this is a kind of an interesting validation point and it basically kind of gives us a clue that thermodynamic stability is maybe a good screening criteria in this idea of computational screening. We can calculate all sorts of other properties of these materials. For example, if you wanna know something about the synthesis and synthesizability of these materials, you can analyze sort of under what thermodynamic conditions they will be stable when in contact with the gas phase species like O2. So you can calculate essentially the stability in equilibrium with O2 as a function of temperature or equivalently as a function of partial pressure of the O2 gas species. And you can sort of develop maps like this that will tell you sort of which compounds are not only sort of ground state stable, which is the above plot, but which will be stable at kind of relevant synthesis temperatures as well. Other things we can look at to kind of screen materials like this is the electrochemical performance. So we can start removing lithium from these compounds and calculating the energy costs to remove lithium, or in other words, the voltage. And these are associated with these materials and these are just voltages. This plot is just voltages of given materials for various types of essentially various amounts of dilatiation, if you will. And from this kind of calculation, we can actually classify materials as either, possessing voltages in a range that's accessible via today's electrolytes, in which case we would call this material an active athode material. And this material is not capable of doing redox, or in some cases we find that the voltage is essentially all, essentially a very high voltage or mostly at high voltage, in which case the capacity is probably inaccessible, but the material can still act as an inactive stabilizer of the layered. And then the final screening criteria, which you might apply to a problem like this would be something to do with cathode degradation. On the left here, we looked at the stability of these materials with the effect of oxygen release. So we looked at essentially the oxygen vacancy formation energy of these materials as a function of the lithium content in the materials. We're able to sort of screen out materials that look like they're very unstable with the effect of oxygen loss. We can also look at things like metal migration in these materials that would lead to sort of undesirable phase transformations as a function of charge and discharge. And so when we put all these kinds of things together, it looks like the kind of screening strategy that I talked about. You sort of start with a large number of candidate oxides in the OQMD. You narrow that down to the kind of compositions that you're interested in. I didn't talk about this, but in addition to the single metal compounds, we looked at double metal compounds as well and the mixtures of those materials. And then we basically start applying our screens, the screen for thermodynamic stability, the screen for properties of the function like the voltage and so forth, and the oxygen stability, the metal migration, so forth. And ultimately wind up with a small number of candidates that look promising for these applications and can plot this, for example, in this plot of voltage versus capacity to illustrate some of the materials that emerge through this screening strategy. Okay, so like I say, this is an illustration of how this screening strategy works for one particular example, but you can probably imagine that this kind of screening strategy is pervasive. Many groups are using this screen for many different types of materials and many different applications. Now the problem with this screening strategy is that sometimes it's not really accessible. In other words, sometimes the space that you wanna search with computation is simply too large to actually just generate all of the data with high throughput computation. And so this is kind of illustrated in this slide via, if you think of your interest that may be in a compound that has some ternary crystal structure, well, the number of possible ways to permute elements into the ternary crystal structure might be 100,000. Doing 100,000 VFT calculations, actually to be honest, only a few years ago we would have said that's impossible, but today maybe people don't balk at that quite as much. But once you get to a quaternary compound or even higher order, this combinatorial explosion starts to hit and you can quickly get into the millions or hundreds of millions of candidate compounds. And even by today's standards, doing 100 million VFT calculations is something that most people are not willing to sort of undertake. So this is one way that the computational screening approach can break down. Another way is that oftentimes the property that we're really interested in is not one of the sort of simple to calculate properties. And so maybe the property that we're interested in is not actually even contained in these large data sets. And so then that's another way that the screening strategy will break down. And one way to kind of circumvent these problems is by using machine learning. So the idea of machine learning workflows sort of very generically here is illustrated on this slide that if we have a large data set of materials and their properties, if we show many, many examples of this material property relationship to a machine, can it learn that relationship? And then if it can, can it then provide that prediction of property data for new materials that it hasn't been trained on? Usually these machine learning models are sort of much, much faster to evaluate than DFT. And so we can search through much, much larger spaces by constructing these models and then using them to search. I would say that a machine learning workflow consists generically of at least these kind of four steps that we, if we're gonna use a data-driven method, of course we need data in the first place and probably need to process it in some ways. Then to construct a machine learning workflow, we have to come up with some representation of the material, which I'll talk about in just a bit. And then finally, we have to sort of learn due the actual machine learning itself. So the data in this illustration that I'm gonna talk about all comes from these high throughput DFT data sources. The property learning that I'm gonna talk about or the learning itself, for the most part, what I'm gonna talk about is fairly standard stuff, the kind of things that you might find in scikit-learn or other available packages. I think the real kind of differentiator in this space oftentimes is the material representation. So I wanna spend a little bit of time talking about that. So the material representation is essentially just how we describe to the machine what a material is. In other words, what things we tell the machine we think could likely be important to describe this kind of material property relations. If we kind of take a historical view of how these machine learning property models have been constructed, in the early days, one of the first things that people were doing around the sort of 2014, 2015, 2016 time frame was to featureize properties of material compounds by just featureizing attributes of the elements. So the idea is if you have a compound consisting of several different elements, you can featureize that compound by just looking at the properties of the elements that make up the compound. And the properties you choose can be essentially anything. They could be sort of anything that's sort of easily known for the elements such as melting point or electronegativity or atomic radius or whatever, okay? So using these kind of elemental attributes and different statistical ways to combine them, we can come up with easily a few dozen or even a few hundreds of elemental attributes and applying these kind of features in machine learning models and using things like random forest to train them is was, like I say, one of the earliest things that was done in this field. And to be honest, it's still a very powerful approach. We still use this all the time, even though it's a sort of relatively simple and relatively straightforward idea. Now, one of the main drawbacks of this idea that you can see right off the bat is that if you only have features of the elements in your machine learning model, well, then you have explicitly not incorporated any information about the crystal structure of the compound. And so this might be kind of an obvious way to make progress. You might think, well, let's add crystal structure information to the representation. One of the ways that we did this a number of years ago was by using Voronoi tessellation. We essentially instructed a Voronoi tessellation of the crystal structure. This allowed us to sort of feature eyes properties of the crystal structure, like the coordination number, how many facets there were in the Voronoi polyhedra, the area of these various spaces, the bond lengths, and so forth. All of these things are kind of features now that you could then add to your elemental features as crystal structure representation. So I'm not gonna show you any sort of data, but just kind of tell you that obviously, it seems very obvious that adding crystal structure information is probably going to improve the accuracy of these machine learning approaches and in fact, okay. Now, the one sort of very large step forward that was made in this kind of representation occurred in 2018 when Jeff Grossman's group proposed this idea of constructing crystal graphs of crystal and solids as a representation. This is obviously a structure dependent way to encode information into the machine learning workflow. And the idea here is quite simple that, well, not quite simple, but it's straightforward, I guess, that the crystal structure can be essentially represented as a graph where the nodes on the graph are atoms and the edges on the graph are bonds or neighbor distances, essentially. And this can be combined with convolutional neural nets to produce machine learning models and the accuracy of these kind of crystal graph representations, generally speaking, outperforms the sort of simpler elemental features that I talked about. Okay, so the drawback, I mean, every time you have an increase in accuracy, there's almost always a drawback associated with it. The drawback, of course, is that these CNNs tend to be fairly data hungry. And so usually you have to have a fair amount of data to actually train these models. Here's an example of the use of this kind of crystal graph CNN approaches. Here's a quaternary opacillate compound, sort of three different metals in a double perovskite type crystal structure. The idea is, we train a machine learning model based on the energetics from the OKMD. Then we use this crystal structure and we substitute elements into it. This is a sort of a very large scale problem. We can sort of easily generate about a million different possible compounds. Again, this is more than we want to calculate from BFT. So the idea is we just predict the energy of this 1.5 million compounds using our machine learning model and use that machine learning model to help guide which calculations we think are worth doing from BFT. So we basically use machine learning to predict the energy. Only the cases where machine learning actually thinks that this is likely to be a low energy phase, then we actually calculate with BFT. And then we can actually look at the stability of these compounds. And using this approach, we very quickly discovered about 1,000 new stable compounds. This is 1,000 new stable compounds is actually kind of an extraordinarily large number of these. I'll get more into that in just a minute, but this is a very large number and this is a much, much faster, about 150 times more likely to find a stable compound using this kind of directed approach than from an undirected search where we just randomly substitute in elements and calculate everything from BFT. So this kind of machine learning, this illustration is just for the energy, but of course we can do this for a large number of different properties as well. Okay, sorry, I'm just gonna check and see how I'm doing on time here. All right, so the final thing I wanna talk about today is this kind of idea of discovering new materials either from computational screening or from machine learning approaches from computation, this is becoming kind of almost standard fare in the field. Many, many groups are doing these kind of computational screens or discovering new materials all the time. It's really kind of amazing. Actually, if you look at the literature and look at the pace at which we're computationally predicting new compounds, we're actually kind of entering into an era where in databases like the OQMD now, we've actually crossed over to the point where there's actually more predicted stable compounds in this database than there are experimentally observed stable compounds. So this is quite extraordinary. Only five, six years ago, it was pretty rare to actually computationally predict a new stable compound. Now it's happening all the time. So there's lots of papers in the literature in lots of different fields where people have predicted these phases. So I would say we're pretty good at that now. We're pretty good at least predicting new stablepaces and maybe even stable phases that have a certain desirable attribute. What we are not so good at from a computational point of view is telling our experimental colleagues how these new predicted compounds could be synthesized. So predicting the synthesis is difficult and this is really sort of fast becoming a bottleneck. We can't experimentally test all of these predictions because there are just so many of them and we don't really direct our experimental colleagues in the synthesis. So I wanna talk a little bit about this problem in my last, you know, 10 minutes or so, however much time I have left. And I wanna talk a little bit about a kind of a new idea that we've been exploring in the past couple of years and that is to use concepts of network theory to predict this synthesizability. Okay, so I'm gonna take a brief detour into network theory and talk about how we're gonna actually utilize ideas from network theory and connect this with this idea of predicting synthesizability. So, okay, so if you take, for the moment, you take as a given that I'm going to apply ideas from network theory, the first question is sort of, how do we actually create a network in material science? Okay, so I'm sort of trained in phase diagrams and phase transformations. So this is kind of the most natural thing that occurred to me when we started thinking about network ideas. But if you think about a phase diagram, like this one here, this is the nickel aluminum phase diagram. You know, if you were describing this to somebody who had never seen a phase diagram before, you would say, well, it's got composition on one axis and temperature on the other. And it's essentially a map of what phase or combination of phases is stable at a given composition and temperature. Now, another way of looking at this is that it's essentially a map of regions where single phases are stable, which are indicated by blue. And in this binary system, then you can have two-phase equilibrium, which is indicated by white, okay? So it's essentially a map of the blue regions and the white regions indicating one or two-phase stability. A ternary phase diagram is kind of similarly could be described. This is a isothermal section of a ternary phase diagram. It's divided into regions of single-phase stability, two-phase stability indicated by white again, but now you have these sort of three-phase triangles. So there's regions of three-phase stability. So again, the phase diagram could be described as just a map telling you where you have one phase, two phase and three phase stability. Now, the other thing you'll notice about this is that these phase diagrams could also basically tell you things about what the stable phases are. Those are the ones indicated in blue and how these phases are connected to one another in the phase diagram. For example, this phase here is stable and this one here is stable too. And there's a tie line connecting those two. So there's stable two-phase coexistence between those two phases. But for example, this ternary phase is not connected, say, to this aluminum-rich phase. There's no sort of connection in the phase diagram, meaning that even though these two phases are both themselves stable, if you brought them together in contact, they would not form a stable two-phase mixture, but they would react to form something like this. Okay. So how are we gonna, you know, these phase diagrams tell us something about kind of connections between materials. I'm gonna sort of simplify the problem because we're doing DFT. And so I'm gonna basically say, let's take the T equals zero isothermal version of these phase diagrams or the convex pole. Again, the convex pole of a binary system could be illustrated by a composition energy plot here. And the convex pole of a ternary system is indicated essentially where I've just projected out the energy axis, but shown this on a Gibbs triangle. Again, the convex pole indicates the stable phases. It indicates tie lines, which indicate two-phase stability between phases. And so this sort of starts to look like a network. So this is the idea. We're gonna sort of use these convex holes and we're going to math them on the essentially graph representations that tell us something about the connection between various phases in terms of stability. Now, the interesting thing about these convex holes is that you can compute them from large databases like OQMB, not for just binary or ternary systems, but for quaternary and higher order systems. This over here on the right is actually a graph representation of, I think it's an 11 component convex hole, essentially all of the 3D transition metals plus oxygen. And you can see that the convex hole starts to get very complicated, right? But the nodes of the convex hole are still the stable compounds on the hole and the connections or the edges of the network represent the stable two-phase equilibrium. Okay. So there's two important points here that I need to make about this. And that is that if we're considering an n component system, so n could be binary or ternary, n would be two or three. n component system actually has a convex hole that's n dimensional, right? So the binary system had a two dimensional convex hole, the ternary system had a three dimensional convex hole but we sort of projected out the energy axis. And then beyond that, it's very hard to even show you a convex hole. That's why I sort of map these things onto a graph representation. The other point to make is that we can compute these convex holes, essentially for arbitrary n, right? There's some caveats to that, which we could talk about if people are really interested in, but essentially we can compute these for arbitrary n. So the question is, what if we actually allow n to take the maximum possible value? Well, the maximum number of possible components would just be the number of elements in the periodic table. That's literally the maximum possible n component system you could imagine. And if there's roughly about a hundred elements in the periodic table, you could imagine if you could compute this, this 100 component convex hole, this would be essentially the phase diagram of everything, but it would be a very complicated 100 dimensional object. Okay, so we don't know this 100 dimensional phase diagram experimentally, but we can compute the t equals zero version of it from databases like the OQMD. We can essentially compute the convex hole of the entirety of this one million compound data set from the OQMD and this is gonna define our network. Okay, so this network, maybe I should go back for a second. This network is gonna be a 100 dimensional object. This convex hole is a 100 dimensional object. So obviously I can't show this object to you in any sensible way. I can't even really show you a graph representation of it because it's just too complicated. So really all I can do is sort of tell you things about it. And so just very quickly, I'm gonna tell you a few things. One is that there's about 21,000 stable compounds on this convex hole of everything. There's about 40 million tie lines in this network. So it turns out to be a very, very dense, highly connected network. So every material in the network actually has a tie line with about 20% of all other materials. This means that as a network, this kind of materials network is much, much more highly connected than it's kind of social networks and other things that network theorists tend to study. And then we can analyze all sorts of topological features of the network. I'm not gonna go into these in any great detail, but we can sort of look at the average number of tie lines that come out of every node in the network. We can look at the diameter, which is kind of equivalent to the six degrees of separation kind of ideas. And we can look at things called the global clustering coefficient. Anyway, this is not terribly important to understand the specifics of this, but these are all sort of topological features of the network. Okay, so now I'm finally gonna come back to this idea of synthesis. So how are we gonna sort of make a connection between this network and synthesis? Well, many of the compounds in our network are actually experimentally observed compounds. And these are taken from databases like the ICSD. And what this means is that there's a time stamp. There's a year associated with the discovery of these materials. There's a reference to their discovery and a year in which they were discovered. And what this means is that we can associate a year with every node in our network. This means that we can actually sort of rewind time and we can look and say what did this network look like in 1950 or 1960 or 1970? And we can sort of follow the evolution of stable materials in this network as a function of the year and also the number of stable highlights and things like that. Now, the connection that we're gonna make with synthesizability is maybe if you look at, if you watch the network as a function of time and you look at a case where, say, a new compound appears in the network, maybe there was some hint in the topological features of the network before that compound appeared, maybe there was some hint that something was about to appear in the network at that spot. And if so, maybe we could use these topological features that I talked about of the network to actually help us sort of disentangle this information and predict when a compound is likely to appear in the network. So this is exactly what we did. We tried to machine learn this quantity, which I have up here at the top, which we call the discovery timeline. The discovery timeline is essentially a sequence of values which is zero as a function of time until the year in which compound I is discovered and then it becomes one after that. So this is the quantity we want to machine learn for a wide range of compounds I and if we could predict this, of course, then it means we could predict essentially the year in which a given material could be synthesized. We're gonna use as a representation for this model these kind of complex topological features of the network and sort of skipping over all of the details of this, but just straight to the bottom line, we find that sort of a random forest model achieves sort of relatively good accuracy for this discovery timeline. And what this means is that we can actually then for a given time, like today, if you ask, as of today, what is the likelihood that a given material has been synthesized, we can assign a sort of probabilistic score to it. So we can actually now go back to the literature and look at some of these papers where compounds had been computationally predicted in the literature and we can assign sort of a synthesis probability score. So this, I should say, this still doesn't tell us what the synthesis recipe is. It still doesn't sort of give guidance to our experimental colleagues in that sense, but it does at least give us a probabilistic score of what the likelihood is that that material could be synthesized in the near future. And so if you have kind of a limited budget, you might say, okay, I'm only gonna look at compounds that are computationally predicted and the synthesis score is relatively high. Okay, so in conclusion, then I just wanna finish up by saying that this last part of the work, the network theory stuff and the synthesizability work was done in collaboration with DRI and most notably with Murat I-Call. Murat actually put most of this information up on the web. You can sort of look at all of this network theory stuff at this website and even see the synthesis score and things like that. And it's really fun playing around with this. Okay, so I think with that, I believe I'm out of time. So I'm just gonna put up a bunch of references for you and I'll stop and thanks for your... All right, Chris, thank you so much for that comprehensive overview and the summary of your work over the past decade. This is very exciting. We are a little bit short on time. So Chris, I think if it's okay with you, let me invite Austin back and then we just migrate into the discussion. How does that sound, Chris? That's fine. Okay, all right, welcome back, Austin. So I was very inspired, Chris, by your talk, especially the final parts of understanding the face space of the entire periodic table, certainly intractable using any other ways. And the question that I sort of came up here is because of the complexity of the space, your model really have to capture all of the intricate relationship between everything, right? The major patterns and those are the minority patterns. I'll just give one example of how I'm thinking about this problem. If you think about calculating things like melting point, I think Austin referred to this, you actually have to know in principle all the interaction energies between all the solutes and even for a five-component system, this is already very difficult to do experimentally, but you not only have to capture all the same, the binary interaction energies, but you also have to capture the ternary and the tertiary depending on how accurate you wanna go. So my question is as follows. We see a lot of machine learning models and data-driven methods basically focus on explaining the majority trends, right? So you're capturing the strongest trends that would explain as much of the data as possible. But philosophically, sometimes we're looking for a diamond in a rough or that unusual phase in the phase diagram, that is very rare, right? That is sort of fluke of nature just so happens and it doesn't follow the majority trend but it has some minor signals arising from it. Love to get both of your thoughts on, how models should be developed? So we not only capture the majority trend, but also can discover these unusual surprises in the periodic table and elsewhere so that we don't miss these very important same materials. Maybe Chris, I can ask you to comment on this first. Yeah, sure. To be completely honest, I would say the majority of the types of model development and screening studies that we do you know, what you described is a very, very difficult problem for those types of models, right? They're sort of trained in a certain space and they're sort of most accurate near the space that they've been trained in. And if you, I guess saying this another way they're sort of better at interpolating than extrapolate. So, you know, I guess in some sense these developing models that can really predict things that are completely outside the domain of where they've been trained is very difficult. I'm not sure I have a specific idea of this. I can tell you, you know, one example of how we do use this kind of information maybe not in exactly the way that you meant. You know, oftentimes we train machine learning models and we see kind of outliers in the predictions. And, you know, sometimes these outliers are sort of very, very far off from the actual values. And in many cases, this actually alerts us to errors in our original data. This is not a prediction of a brand new source of physics or brand new mechanism of superconductivity. It's usually, in our experience, usually a hint that if we go back we find actually the DFT calculation in the database was not converged. So it does, you know, I mean, it's useful in that sense it serves as kind of a data cleaning method but I understand that that's not exactly the goal that you had in mind. I don't know, I'd be interested to hear if Austin has other insights. Yeah, Chris, the first word I wrote down was extrapolation. So I was going to say basically the same thing that you said. Will, I think this is a very important question and it's a challenge for us in material science because our data sets are often small, right? And so the models are typically simple and fairly limited. And I think, you know, as you were describing this, Will, I was kind of imagining in my head, you know, here's the universe where most of the data lives and then here's data point, you know, way up here. And you fit this line to where all the data is and now you're suddenly kind of like completely missing this point that's far outside of the domain. I think, you know, so the question is, Will, you know, how can you build a model that fits well than most of the data that captures that? And that really becomes this question of extrapolation or of generalization of the model, which is really hard to know our priority. I think one way, one challenge I've seen, you know, in this space is understanding the errors of a given model really require is always a function of the distribution that you test that model on. And so as material scientists, we often, we build a model in the domain where we know the materials and then we test it in some domain of new exciting materials and the errors can be completely different between those two distributions. So I think, you know, working to sort of quantify models as a function of the distribution is important. And I think there are some methods that I've seen and I, you know, I'm saying this with a caveat because I haven't used them myself, but there are some methods that will weight heavily concentrated points less strongly in a model than ones that are kind of off on their own. You know, that could be a way to kind of keep yourself from drowning in like with the big, you know, glob of data in the middle. But that's a challenge. I mean, I think like the point that Chris made about structural models versus compositional models is a good one in terms of structural models are just more likely to generalize than compositional models. I mean, we think that will be the case because we know structure is important for materials. And so it's, you know, there's, I think there's things that we can kind of do to increase our odds of capturing that extrapolation, but it's, it is a hard thing to know a priority. Thanks, Austin and Chris. Chris, you wanna say? Maybe I could say one other thing about that. I mean, so that, you know, the approach that I guess is often used to address problems like that is, you know, sort of Bayesian approaches and active learning. And of course, you know, these, these approaches essentially weight the exploration versus exploitation, I kind of trade off there. But, you know, usually the idea is if you have some region of your parameter space that's not represented in your training set, the uncertainty in that region is very high. And so the solution is get data. So I don't think, you know, I mean, I think that that's generally speaking the best approach, right? Just to get data in the, in the, in the regions of space where the model seems to be predicting this kind of outlier behavior. Chris and Austin, I really resonate with this discussion. I think really what it does come down is how do we weigh, you know, different parts of the data set, right? Because as, you know, as someone who's trying to understand new physics or relationships I don't know before, I weigh toward these unusual observations, right? But Chris, like you said, it could also just be the wrong results, wrong experiments, wrong predictions. It's hard to weigh it. But I think this is what we do all day long in science is just try to prioritize how exciting something could be. And, you know, I really haven't seen this done as much because in sort of the materials and robotics area, it's sometimes hard to really assign this excitement index. And this is where maybe the human intuition comes in. It's not in the excitement of the property itself because that's easy to quantify. But it's in an excitement, okay, whether this could be a new physics that we haven't learned before and we could only have one data point because we all know how the scientific world works. People all migrate to, you know, YBCO. So we have a lot of points around it. But then the one outlier could be the really exciting one that doesn't capture, it's not described at the current physics. So anyway, just want to share my thoughts on that a little bit. The other maybe just to build up on this a little bit. I think both of you highlighted that, you know, Informatics Method is really a tool for down-selection, right? You begin with this huge space of stuff and it allows you to hone in, you know, but it's not really quite a scalpel, right? It's more of a sledgehammer to sort of find out, you know, which whack them all to hit. So one sort of thought that's been on my mind and some of my colleagues' mind is what is needed for R&D broadly speaking? And Austin, you alluded to the R&D cycle very, very well in your talk. And, you know, on the academic side, you know, we're certainly very excited about discovering new things, whether it's mechanisms or materials and chemistry. But on the other hand, as Austin, as you showed in industry, but it's really about refinement and optimization as especially when you go from R&D to say, but as you go from R to D, right? Even in an industry setting. So the question that has come up quite often these days is how do we use machine learning? How do we use data methods for product development? So, you know, this is when you have already stated your narrow design space. You've found the interesting technology you wanna develop and now you're refining, right? And now the differences between various, you know, formulations or whatnot is becoming smaller and smaller. And then Austin, as you pointed out, then the other things become more important, right? Like, you know, how you're designing your battery, how you're testing. So maybe I can ask Austin to comment on this first. How is industry using these methods for this refining product development aspect of material science? That's a great question. You know, I think once you're at that product phase, it's a very different problem than when you're at the materials phase and in to use the points that I was sharing. On one hand, you have kind of discovery of new ion conductors that are just looking for both properties. And then on the other hand, you have a full cell. Maybe you're trying to predict the cycle life or something regarding the cell performance. And that's a function of all of these different parameters. So it becomes very complex very quickly. And the general sort of rule of thumb is if you want to model something very complex, you need a lot of data and you need to map that space of parameters pretty exhaustively. And this goes to what we were discussing Will earlier about the sort of known knowns or the known unknowns and the unknown unknowns is exploring those parameter spaces is just super important to so that you can kind of map everything out. And so, I guess the question of how can informatics be used in industry on the product development side, I think conceptually it's very similar to what we would do in the materials world except you just have a lot more parameters to be sampled and you generally gonna need much bigger data sets. And we've seen a number of applications kind of on that side and optimizing different parts of manufacturing and usage. And you usually get the privilege of having much bigger data sets but also much more kind of complex, multi-dimensional data, I suppose. So it's a different problem, I think. Well, I'm not sure if I'm qualified to answer that but I was sort of thinking about sort of my experience at Ford versus Austin's at AI Onyx, I mean Ford and AI Onyx I guess are quite different companies in terms of the lifespan. But one of the things that is definitely true about a company like Ford is they actually sit on large quantities of data but most of these are buried in tech reports from the 60s and 70s, right? So, I don't really know exactly what's going on in a company like that but I could only imagine that they're really trying to maybe even kind of mine their own data sets because I suspect that these companies that have been around for a long time are sitting on a gold mine of data if they can just get it out of their libraries. The other thing I guess I would say about that I was thinking that on my very first slide I showed some names of former students and such and two of them Murat Aikal and Sukim are now working at Rivian. And I thought they might actually be very interesting speakers for this series because they would come from not only the DFT and machine learning perspective but also now kind of the vehicle manufacturer perspective as well might be kind of interesting to hear their thoughts. Great Chris, I'll write them name down. I'm already planning for the next couple of months. Chris, I think this is a very important one. Austin, please. Sorry guys, I have one more point that I was thinking about as Chris was saying that. So, he mentioned that Ford's company has been around for a long time. There's almost certainly data going back decades sitting around. Funny enough, we see that exact same thing happening in companies that are three years old where a set of experiments will be run by someone. They'll save the data, they'll leave the company, someone comes along and restarts those experiments. And we're often asked the question, well, can we harmonize these data sets? Can we build a model on both data sets? And the answer usually as well if you know all of the ways in which those data sets differ and you can include those as parameters or as features I should say, not parameters, then sure, the only difference is that one was captured on a hot day and one was on a cold day. Well, put temperature in your model and see if it works. But I think it's a major challenge just in large organizations like this or like any large organization just keeping data clean and labeled and consistent. I think that's one of the biggest challenges we see in these companies. It's not doing the machine learning, it's just curating the data. So Will, could I ask Austin a question about this? Please. Yeah, so in training the kind of machine learning models that we're doing of kind of simple material properties, one of the advantages, I think of having a big database like OQMD is that if we're, say, training a model of stability, we have plenty of sort of wildly unstable things in the database. So we have plenty of examples of negative data and actually this serves to train the model quite well. So I wonder about the issue of negative data when it comes to Will's question of product development. Is there such a thing as negative data and is it stored and do you run into that problem? It's a good question. It's, you know, we love negative data and that really helps models, but it can be a challenge to try to convince a team to build, you know, 100 cells that they know are not gonna work, right? For the sake of making the model better. It's kind of a philosophical thing, right? Some depending on how you think about the value of these approaches, you know, a company may wanna do it or may not wanna do it, but we do always feel this tension between exploration and exploitation where there's, it's like you usually wanna get to the fastest, the best answer is fast as possible, but sometimes you can actually get there faster if you explore for a while before you start running down a certain direction. And so, yeah, it's funny how much it varies, I think between the cultures of different companies, sort of what they, how much faith they put in these approaches and what we always say is, hey, if nothing else, we're just enabling very targeted and efficient data collection. So even if the models never learn, we can at least kind of identify the most, most important points that we can try to learn from. And then your worst case scenario is that you were, you know, you were essentially guessing at random, which is, I shouldn't say most people would guess at random, but in certain applications that might be close to that. So, yeah, that's an interesting, interesting to see that play out across different companies. I'm a big fan of, or a big proponent, I guess I should say of automation. I mean, I think if I look at the DFT example, you know, we store all of these kind of wildly unstable compounds in our database, just because the workflow is automated. And so the students don't actually have to do it. It's just automatically stored for them. I think if they had to do anything to actually, you know, physically put the data in the database, I think we would have a lot to your negative data point. So I wonder if that lesson could be learned in the experimental community, what data could be this kind of automatically incorporated into the dataset. I think this is a very interesting discussion. I think especially on the experimental side of things for, you know, ingesting all these data and storing them. I think, you know, for a company like four, you know, with a century of history, this may be quite difficult just from a legacy perspective, but I think for companies that have, you know, started in the past five years, certainly everybody has recognized the importance of just really good database in all sorts of data, right? Not just, you know, the final performance data of a battery or something, but everything along the way. And I think companies are beginning to have these extraordinary datasets. And I think, Austin, you alluded to some of these from your partners. So I absolutely agree that automation and data ingestion and data analysis, this is at the forefront. And especially I think integration with data generation on the experimental side could be quite exciting. I wanna maybe come back to this earlier point about, you know, negative data points and so forth. And I think I have another philosophical question for you. Sorry, that's all I'm thinking about today is hypothesis testing, right? So if we think about humans, you know, us doing the work of understanding and optimization, you know, we, the reason why we have a lot of negative data points is because most of the time our hypotheses are wrong, but we're very good at coming up with interesting hypothesis to test. And then we learn from the results of those hypothesis testing. I know that this is also an emerging field in machine learnings as well, being able to sort of mimic and emulate the human in terms of hypothesis making and hypothesis testing. I almost think it's machine learning models are too good at hypothesis testing. So you don't have those bad hypotheses because that is what we learn from, you know, in the context of say active learning awesomeness you were referring to. So it'd be great to get your thoughts on, you know how hypothesis testing can be better incorporated into machine learning to identify the causal relationships. And I think this sort of in my mind leads everything together in terms of understanding all the underlying rules for good and bad data points in your data set. Chris, you wanna take a stab at that one first? I didn't speak fast enough. I was gonna say, oh, I'll punch Austin. Yeah, I don't have a very insightful thing to say about that except that just that, you know I do feel like, you know at its core machine learning and these kinds of approaches are very good at finding hidden correlations in data sets, right? And so I know that it's sort of important for us to kind of try and learn information, learn physics from these models but I'm always sort of very nervous about this issue of correlation versus causation. And I've told this story many times so I'm sorry if people have heard it before but it's an example that still sticks with me that not too long ago I had a student who was a training machine learning model of certain property and he, you know did the kind of composition only models featureized the model and we looked at the features that came out to be most important model and they were really non-intuitive feature things that we just didn't expect at all. And we spent basically the better part of an afternoon sort of brainstorming about why could those features actually become important? And we came up with this very kind of convoluted idea that would explain why those features were so important. We thought, well, this is fantastic, right? Machine learning, there's no way we would have ever thought of that on our own. Machine learning really taught us something that felt really good about this. The next day of the student comes into my office and says, I'm really sorry Professor Wolverton I made a mistake and I retrained the model and now the features are totally different. Those features that we were looking at yesterday were not the most important features. And so the lesson I learned from this is, yeah, we can invent a hypothesis to explain almost any features. And so, you know, I just, I know everybody knows this but I feel like we just have to be careful about this correlation versus causation. We could very easily get trapped even when we know that that's a possibility. That's a great anecdote. I think I was gonna share a similar point which is probably about a year ago, I became, I kind of, I guess, well, I should give, really should give credit to my late advisor, Evan who, Evan Reed who started reading about what is probably my favorite, my learning theory with my favorite name which is probably approximately correct learning theory. Sometimes abbreviated as PAC theory. And it's a fascinating framework for understanding how likely you are to be wrong, how likely your hypothesis is to be wrong based on the amount of data that you're validating it against and the sort of universe of hypotheses that you're considering. And this was, is developed in the 80s and there's a great text I would refer people to by Scott Aronson who attacks it from kind of a philosophical perspective and says, hey, if you have a number of data points and you're thinking about all of these possible explanations, when n gets small, you're probably gonna find one of these explanations works perfectly, not because it actually explains causally the data but just by chance. It's kind of like the infinite monkey theorem. You give a monkey a typewriter and eventually it'll type the full works of Shakespeare. And so I think these are the, I found this very helpful in my own work because you can actually quantify rigorously what the likelihood that you are, that your hypothesis is wrong is. And so I just kind of at the beginning of getting into this framework, but I found it very useful to recommend it to others who are thinking about this question. Gray Austin folks can add that to their reading list. We have only a few minutes left. I thought I would end with the following question to the both of you. So all three of us work very actively in the material science area. And I think as Austin pointed out, one of the key difference between say pharma, pharmaceuticals and material science is that material science is inherently very multi-scale, time and length. And I think Austin, you gave a great introduction to the problem for solid electrolytes. Yes, you can predict ionic conductivity quite well, but in a real same battery or others, you may be dominated by defects and grain boundaries and so forth, which is not captured in your atomistic simulations. Obviously this is a hugely emerging field, but I was wondering if I can get a few minutes of thoughts from each of you on where we should be going for multi-scale so that we can take the density functional theory calculations and atomistic calculations and be able to translate that all the way up. What are the new physics needed and what are the tools needed? Taking that solid electrolyte example, how do we model everything along the way, all the way to microstructure and everything in cell? I don't know who would like to start. I think this is another one of these really tough challenges right now. Maybe since I made Chris answer first last time, I'll take this one. Yeah, I mean, it is a tough one and I don't know that I have the answer, but I think what I have seen work and be very interesting and this is part of why Venkat has been a key part of our team at Aionics is some of the work that his group has been doing and probably elsewhere too, but I'll just speak for this one is looking for those atomistic descriptors that you can find and compute with DFT that do correlate at least in some sense, may not be a perfect correlation, but have some correlation with what you can observe on the macro scale. And in particular, I'm thinking about this paper from a few years ago that looked at taking various electrolyte molecules, putting them on the surface of lithium metal, computing the charge transfer from electrolyte to electrode and finding that that actually correlates quite well with the sort of macro properties of the SEI. And a lot of things you can compute from DFT, but there's really no guarantee that those things will ever be relevant on the macro scale, but searching I think systematically for what those things are, it will be impactful and I think we're kind of just at the beginning of that search. So that's I guess what I would say. Thank you, Austin. Chris. So I guess my, you know, this is an area that I was very active in when I worked at Ford. So maybe my knowledge of this field is a little dated, but with that caveat, you know, the kind of approaches that we had the most success with at Ford were these kind of parameter passing, you know, calculating properties at a low length scale, passing those up to, you know, mesoscale models and using that to drive information that could be passed up to macro scale models and so forth. So I still think that this is actually a very useful approach and a very useful way to look at the problem. I think ultimately the trick in that is, you know, in going from one length scale to the next, you're essentially trying to integrate away all of the irrelevant degrees of freedom and keep all the relevant. And that, of course, is the trick. That's the difficult part of the story is figuring out which degrees of freedom you can integrate away and not lose all the physics of the problem. So, you know, maybe that's an area where machine learning could play a role in sort of trying to identify these relevant degrees of freedom that are, you know, integrable. Yeah, I don't know. It kind of reminds me a little bit of this problem that I talked a little bit about the representation, coming up with a representation to build a machine learning model. You know, that's a real challenge as well. And so this idea of, you know, using machine learning to learn the representation itself is a very attractive idea. This kind of starts to sound the same to me, to use machine learning to figure out actually how to integrate up the length scale. Yeah, I don't know, that could be enough. If I could just add one point to that. I think that made me think of this example, which was in my slides, I had that band gap, you know, computed band gap versus experimental band gap. And what you see is that PBE is basically, it's pretty accurate, but it's just a little bit under the line where it should be. And I think if you just accept that you're gonna have to build some empiricism into a scale bridging model, you know, you can just add a correction term that shifts that PBE band gap up and gives you something that's more accurate to experiment. And so I think what you were saying here, Chris, about learning the right representations and mixing the sort of atomistic with the empirical correction terms, potentially to get you to more accurate, even if you don't really know why, to maybe get you to more accurate numbers of systems of different scales. Well, Austin and Chris, on that note, I'm afraid time, we're out of time. I really enjoyed the discussion, you know, we are still very early on this journey and much progress and breakthroughs will be made, you know, by yourselves and others in the coming years and decades. So, you know, hopefully many of these directions will see fruitfulness down the road. And Chris, Austin, thank you once more. I really enjoy your talks and the discussion today. So, Kaley, if I can have the closing slides, please. So this symposium marks the end of our summer quarter. So we'll resume approximately four weeks from now with our fall quarter presentations. Please be on the lookout for announcements from us and I look forward to rejoin many of you and the recordings for this quarter, so including this talk and the three before will be posted in about two weeks on our YouTube channel. So for those of you who want to rewind and replay and listen to some of the key insights here, please come to the Stanford StorageX website. And with that, I'd like to wish everyone a great weekend and hope to see you all soon. Thank you.