 having me. So I'm very glad that we already had these two speakers before because there were already a lot of, sorry the microphone there, there were a lot of subjects already that were talked about using bigger models and doing transfer learning and then the convex hull in the last talk that I will need. So thank you for that. Yeah, so we're speaking on machine learning thermodynamically stable materials and I will start with a small introduction why do we actually care about this problem and point one is that we really need to discover alternatives to current technology. So we have for example a new demium iron boron magnets that are used in electric cars and then you need extra dysprosium there and the production is quite limited. This can become a problem in the future for cobalt and the production or the demand is projected to increase 50-fold in the next decade due to lithium batteries as well in electric cars and so on and then of course cobalt is produced in the democratic republic of Congo so there's human-wide issues and yeah there are a lot of these cases and the second point is that we would also like to enable new technologies. So like p-type transparent conductors, room temperature transparent ferromagnets, room temperature superconductors and so on. Everything that would be very nice but nothing is really there for industry at all in the case of room temperature superconductors just aren't very ridiculous conditions and can you achieve that. So in our case we are not looking for anything specific but all strategies more widespread find as many but materials as possible and hope and then that there are the researchers interested and when they know the materials will be stable to scan them for the properties that they need and so how do we do this and the traditional way to find new materials efficiently are prototype based state throughput searches. So we take a structure you already know the perovskite structure from the last talk just an example we fill it in with candidate elements and get a few hundred thousand compositions then we use a million or more maybe CPUs for a ternary composition to do all the geometry optimizations get the energy and then we go to the convex hull and that was introduced in the previous talk and we have our energy relative to all possibility composition channels and yeah and then we know the stability of this distance is zero. Okay so far so good but then if we want to do thousands of these ternary prototypes I mean it becomes really expensive or for quaternaries we already had 15 million compositions so we can't really do this anymore and yeah what people have now been doing maybe for the last seven years or so is machine learning so you predict all the distances to the convex hull with machine learning model and then just validate the ones closest to stability and in the beginning this was quite limited so the first models and we also ended this maybe five years ago where just for a single prototype you have handcrafted features, decision trees, conoridge surrogation and so on but it's I mean it worked for single prototypes then over the last years people have tried more deeper neural networks that were also based on the composition however here we have a problem and this is just if we consider a pair of skite structure and here this sodium and nitrogen and beryllium-3 we don't actually know which is in which position in the structure and so you can't learn the to differentiate these polymorphs properly so purely compositional models don't work even though some of them have good performances on the same materials project or similar datasets. Now you could extend these representations to an extra dimension to just say okay we have one vector for each site but then we are limited to one prototype again and we can't profit from big data and pre-trained models and so on and what as I mentioned already. Okay so we need structure sensitivity I think everybody has heard about these crystal graphs already during the last days where we mapped the notes to sorry the atoms of the crystal structure to the notes and then the edges usually people use some distance representation however we have this problem and that we really directly want to predict the relaxed energy or the distance to the convex hull of the relaxed structure and but yeah and we don't have the right positions so we can't really do this and the idea we had instead just take the physics out and just say okay the atom is the first neighbor we don't care if it's one angstrom or two and yeah just use a embedding for the graph distance so this is first neighbor second neighbor and so on and then just throw more parameters at it and see if the model runs it so what we have already heard in the first talk a little bit so our models are not quite large but also larger for maybe material science and purposes so we also use attention mechanism not the same as in natural language processing but somewhat similar and yeah I also was used in ruse for example another application in material science we did some small modifications and but yeah let's just say we end up with the message passing network and with maybe 60 million parameters that we want to train and I mean it doesn't need as many GPU hours but maybe I'm still a few thousand and so we thought we also need some more data as we have quite a few parameters so we curated a new data set so we had a few and by now maybe two million calculations of our own group lying around and then there's a flow and the materials project and in the end we ended up with 2.4 million crystal structures and the rather complete convex hull which is also important for us to determine the stability of the compounds that we are predicting and you can also find this data set online and so yeah please go ahead maybe it's useful for someone here yeah so then we trained with this data set and now here's again one of the pitfalls of this data set that we have already was already mentioned in the first talk we can get the perfect error better than chemical accuracy on our normal test set that we randomly select however and this is just the indistribution error and then if we remove for example all mixed perovskites where the larger data set of these lying around we removed all of all the materials with the same composition and space group and then we end up with this error of 500 milli electron volt pattern however and the problem here is also that these materials were rather unstable so there's a distributional shift of maybe one ev and if we actually consider only the ones that are closer to stability we get a more reasonable error still not great but yeah it's okay and these are anyway the materials we're interested in as the ones at two ev and nobody really cares okay so then we had the pre-trained model and we wanted to check how good we can get it with some transfer learning for the specific data set and yeah as you can see the blue curve of the pre-trained model always performs better than the model without pre-training and especially in the lower data regime and then we have some other purely compositional model that was changed to I mean differentiate these polomorphs and yeah it's not really competitive okay so then we had the model and as we already had it we also did a throughput search of these mixed perovskites and you can see here the in orange the training distribution and then in blue the predicted systems or that we validated with DFT so we validated everything below I think 200 milli electron volt and yeah and we found quite a few stable ones and we think that with distortions or we know this from a different study that we have done on these mixed perovskites that you can stabilize them by more than 100 milli electron volt often with distortion so maybe a few more of these will be stable in the end and another interesting metric maybe so here the predictions for the unrelected structure so when we go from the unrelected structure to the energy of the relaxed structure distance to the convex hull am I actually really good now and even though we am only enter this graph distance and it's really competitive to if we start with the relaxed structure and we use one small trick here and that is we use a few different cell constant ratios so we do a few more predictions for each composition for in this case just take the minimum energy and this way we get the better prediction for the relaxed structure from the unrelected one okay so up to here this was all already published and now these are some new results so we started with this pre-training or continued with this pre-training for a few more maybe 15 more compounds and produced some more DFT calculations 400,000 and then roughly we retrained the model and it's now performing better one big problem was before that and nobody had really sampled lanthanides and actinides a lot so the error for these was higher before and now yeah they are better sampled and with better performance there and then we also trained them to train the model to predict volume so we have to do less geometry optimization steps in the end and now we just work on doing a lot of high throughput searches so we're trying to do all binary, tarnary and quaternary prototypes that we have with less than 20 atoms in the unit cell and yeah not as space group numbers larger than 10 that appeared these 10 times in our database and this next slide actually changed even yesterday so before yesterday we just had maybe 1,300 of these tarnary prototypes done but now we've read all the output files of the WASP calculations and now we have searched this roughly 1 billion compounds and yeah you can see you can see the median distance at the convex hull for the predicted tarnary and binaries here so 50 for the binaries that are a lot easier and 100 for the tarnary and then we found here more than 23,000 compounds that were below our previous hull and which is quite a lot and below 10 mille electron volt which is still super close to stability and we found another yeah we found 43,000 in total and maybe just to give a small impression of the scale and below this white line here we have roughly 1.6 times as many inorganic crystal structures as we have in the materials project so below 100 mille electron volt and yeah so now we're just left with the quaternaries and have to see how they will go okay so to summarize the crystal rough attention networks can actually predict properties accurately from unrelected structures and discover new materials now the challenges that we are facing directly are mostly related to the data at least these are the most pressing for us so this element bias I already talked about we think that this will be corrected more or less just through doing more and more throughput searches and it's getting better prototypes are different because we will always do calculations for or predictions for prototypes that we haven't sampled yet and so we wanted to do some active learning there so probably weren't that successful but already learned a lot at this conference and talked to all people that I found that did active learning and maybe with this knowledge now we can be more successful there yeah then the second point is really this accuracy of dft so we use the pb functional and right now like I mean basically everybody in the community would decide throughput calculations and but it's quite inaccurate for formation energies actually or the errors are unreasonably large in comparison to our machine learning models and so and we recently published a data set of 175,000 pb solar geometry optimizations with scan single point calculations and these the scan calculations should have a better formation energy or distance to the convex hull for most material groups so there's also available online and it's also maybe 10 million or 15 million cores something in this region so please be free to use it and so that yeah we get something out of this money or basically someone can profit from it yeah and also maybe there are band gaps for these with scan as well and high-campus sampling maybe it's also interesting for someone okay and yeah then we're just left to explore the space of all the other known prototypes and then afterwards we have to see what will happen okay thank you very much oh of course most importantly I didn't do this work alone so there are a lot of people involved with this project and I would like to thank especially my supervisor Miguel March who also despite being a supervisor does an unreasonable amount of calculations and this helps a lot okay yeah and thank you for your attention thank you Donatan for a wonderful talk how about some questions yes great talk I'm just a bit curious I guess about what your strategies are going to be with trying to uh I guess like from the point of view of the quality of your data like you said the PBE may not you know may present sorry oh sorry can you hear me yeah now we're now it's better okay um you mentioned there is uh some challenges using the PBE functional have you what are your plans to I guess yeah so basically the problem is that the I mean one problem is that there's also very little experimental data information energies so nobody can be that sure even how good the PBE actually is but from what we know um the scan function should perform better than even the PBE with the materials project corrections and so that's the main thing and of course scan geometries are also more accurate and we don't do scan geometry optimizations because they are um difficult to converge so that's why we choose the PBE soldier that has as accurate geometries so it's kind of a cheap way to get there and yeah then what we are doing right now is transfer learning again we have the pre-trained models which is a big advantage and of course um and we also have the code online although it's more one of these phd student codes and not um the facebook research engineers and but um yeah and then we are transfer learning to the PBE sole and scan data okay thank you don't know the student codes we've all learned a lot from them any more questions oh yeah zoom it's actually not a zoom question but my own so now that you've done you've studied so many materials and you've got the distances to the convex hull what what are you doing with the results yeah so i mean for the last like um 100 000 dft calculation we just finished reading the output files yesterday so we haven't done anything yet but of course i mean we'll do a lot of statistics to see actually what are the distributions for example and for the stable compounds and so on in this ternary compound space especially because it's not really i mean known that well yet and then i mean we will for the stable ones calculate some basic properties like other ferromagnetic what's the band gap and so on and with higher quality and yeah then um we will i mean obviously um publish the data and um yeah advertise it and hope that people find you so but okay and of course we will try to contact as many experimentalists as possible to synthesize some of these materials especially new material groups that haven't been found so far and so i mean we had a few experimentalists here so if any are interested in some completely new material groups and synthesizing them please talk to us we would be super happy okay thank you actually i had a question very similar along these lines so you've created a gigantic database this is very helpful and you seem to be searching for stable materials of many different compositions uh within there but then your title was about thermoelectrics right so i'm guessing you're scanning your title was just thermodynamics oh thermo oh okay sorry i was just wondering if there's any particular class of functionality that you're interested in or if you're just planning for the stable materials just to compute a range of potentially useful functional properties and then many people can use them different ways or so right yeah right now i think um or we see that there are some needs maybe for some materials in the near future or 10 years i mean with um what's happening and so we thought that this is the most efficient way and these are all low-hanging fruits i mean these are the known prototypes and but if we can find maybe i mean 50 000 structures closest stability in year or two and with this then i hope there's something useful there and then yeah we can't i'm i'm not really an expert on the material science but i think there are a lot of people who who are and maybe can find something useful great then also one thing i wanted to add is now there's been in the last year or two quite a lot of works on synthesizability or predicting machine learning synthesizability so i think this is like a step forward beyond like finding which ones are stable energetically stable uh then from these then there's some works trying to predict which ones of these then there are synthesis routes for so you can also like add this on on top of what you're doing now to make it even to boost it even more to other people who already know what they are doing there can just take the data um yeah but for sure there's um a lot of things that can still be done and i mean right now we maybe tell experimentalists maybe one in four so they try and might be stable and but this can this can help a lot definitely okay i see one last question yes let's have it thanks a lot of really interesting stuff um it's not really a question more a suggestion or a thing that would be really i'd be really interested to see so um i don't know if you know chris wolverton of the oq md so they have a database about a lot of uh also convex hulls like this he did this really cool thing where he did at the convex hull for the entire periodic table so one hull for all the compounds so if you do this and then maybe use our sun reliable tools to calculate intrinsic dimensionality maybe we can know what the intrinsic dimension of all of the known solids for now are so if you do it if you think you can do it i'd be super interested to know what that number is okay yeah sorry i'm not i don't think i'm entirely sure yet um where you're going but i'm sure i think we can definitely discuss it in the break afterwards because it sounds very interesting yeah thanks thank you jonathan again for your presentation and now let's welcome our next speaker okay our next speaker is felix arant from the group of marix yorka at yena and uh he'll be talking about machine learning applied to glass materials so thank you very much yeah good morning um my talk will be about the evaluation of descriptors for property predictions of glasses by the means of machine learning