 I'm going to be speaking about the frameworks which have been building that we've heard a decent mark now through the aspect from the earlier talk. It's really this crappy estimating framework which forms one of the coffee spin structures both in terms of our assessment and optimisation pipeline. The real goal was to create and build an automated scalable framework which sefydl anodol o'r amseriaid rhaglen o'r cyfnod yn cerddraeth. Yn ddygaidd, rwy'n g Anytimefyrdd, byddwn ar ôl eistedd mewn cyfrannu gysylltu. Gallwch anodol yn ysbyt ym Llywodraeth llwyddol yng Nghymru. Be gydig o gwneud hynny'r cyfrannu cyfrannu ac tyffanaeth hynny'n ei ddweud o'r cyfrannu? Bydwch cofyn a'ch gydig addo, a'ch gydig i ymddir ymdwygaru. More than just having a tool that can extract the data, we want it to almost be able to understand the data, kind of know what is a density, know what is a salvation energy, or at least automatically know what calculation steps would be required to estimate these different properties. Or if the framework doesn't know how to estimate these different physical properties, we want everything to be extensible so that anyone can come and plug in their own definition of how should any physical property be able to be estimated, that they think may help in the optimization or assessment cycle. Kind of not just defining or knowing how to estimate these physical properties just by using programming simulations, we actually want it to go a bit further than that. So, if we're using kind of a gradient-based optimization and to do our fossil fitting, we maybe do 10, 20 evaluations of the objective function for what we're looking to do in the future at least in more specific techniques. Kind of Monte Carlo based sampling and Bayesian fittings, and these typically quite under 100 or 1000 of evaluations of the objective function. So, if you're having to do that many, you can't just go and run simulations at every step, you need something a bit more specific to take this. So, this framework isn't just about running simulations to get up to physical properties, but it's estimating from simulations and simulation data, and I'll talk about that a little bit more as well. In the same vein of kind of a performance improvement, we also want this framework to be entirely scalable and possible. So, in particular, we want to be able to scale up the size of the data set that we want to pick in, and up, and in, and set again, but also be able to scale up with the availability of compute resources as I demand to grow and grow and grow. So, maybe spoiler, but this is kind of the real work result, and if nothing else, the take-home message is, we can do this and it works quite well. So, just, I think it's been mentioned before, but just to stress again, the point of this framework is true whole theory. It's crucial to help and just curate our physical property data set and curate and extract them from the data. And it's also what we do first in terms of our assessment, kind of a human running this framework, but also in that optimisation cycle, kind of integrating into whichever optimisation engine we're trying to use, which is kind of a nice take of Python API. So, I won't say too much about this because I think my teacher at Central Coder had heard more about this. But essentially, the first release has been based on trying to optimise and accept the data from this ThermML archive, which contains a vast number of thermodynamic properties, hundreds of thousands of densities, tens of thousands of mixed properties. So, we really wanted, and the first thing that I did was building the framework to build the utilities to kind of go into ThermML and pull all of this data down. For the sacred time, I won't talk too much about this, but essentially, we've got this Python API, this kind of ThermML data set object. So, you just give it the digital objects of the entire of the data set that you want to pull from this ThermML archive or a list of URLs or a list of file packs. It will go understand this data format, pull all of that data into a Python object and do some basic filtering, duration, temperature ranges, pressure ranges, and so on and so forth. And this kind of object has been what we'll use in the first few rounds of filtering to kind of build our data set that we want to optimise things. And for anyone interested in how we pick kind of the data set, the link can't quite be seen, but we use this object in this data selection repository that we're having GitHub, which details how we use this kind of API to build our automatic sets. Kind of the key aspect of the framework is actually the estimation that we should have. So, what scalability in mind, we kind of chose to design the framework as kind of a, we call it like a client-server architecture. But the basic idea is that the view on your laptop or photon on your laptop can be able to use our API to kind of pull down the physical property data set into a Python object to be able to load in the bar scale to address. And then what you would do is what we call create an object called the property estimate client. The property estimate client takes the data set, takes the bar scale parameters and translates that into a tangible set of calculations. I think it translates a density into a workflow graph or a set of workflow steps. So, I'm still in, I have a number of definitions for how to convert into like a density into a workflow graph, but it's at this kind of point where you can plug in your own definition. Either if you want to change how internally we're choosing to estimate these properties, or also if you want to expand into make your own property. The idea is that once this client object has turned your property of interest into some kind of workflow graph, it's ready to connect with what we call this property estimate server. So, the property estimate server is just a program that runs on whatever computer you've got available. Maybe it's a big speedy computer, or maybe a speedy computer plus this. So, it just sends on that object and waits for the client object to kind of send it those workflow graphs. The server then kind of uses a library to take those graphs and execute them properly to the assessors. It does the calculations, it computes the properties as well as their gradients with effect bar scale parameters. For one of the key things to serve as well as this kind of launch and decalculation, is a start to store all of the data that was generated from the decalculation. In particular, the server object is continuously storing the uncorrelated configuration that we've generated from any simulation that we've run. And that's kind of the key for what allows us to become more efficient in the future. To give just kind of an idea of what this framework looks to use, you can essentially estimate the entire physical property data set in many different types of properties. So, basically, you load in your data sets, you can use our thermal data objects, you might use some filtering on it, you can use load in your forest scale parameters, you create this object of a property estimate to the client, you just take the client, you press the data set and you can use forest scale parameters. This automatically spins up and connects to the server. The server does its business, returns to the pack, the results, and you can query either has a server finished with these, using the results object, this kind of request object, or kind of get up to the information of how far along have these practices been completed. So, very simplistic, very easy to use both on your laptop, or very easy to kind of hop into the optimizations that we have in mind for the future of the cycles. So, the key thing, and I won't talk too much about this, but the key thing to this framework is we want it to be able to rapidly estimate property data sets. So, not just by running simulations, but by kind of taking this multi-fidelity approach to estimating physical property. So, the idea is within this framework, we've got this kind of notion of different calculation layers, where the idea is a calculation layer of the space that needs some technique to take a physical property data set and turn it into a set of estimated values, as well as a constant. So, we've got kind of two calculation layers implementing framework at the moment. So, we've got one kind of calculation layer, which basically just takes a physical property data set and runs the simulations to estimate those physical properties. We get the gradients helped in that as well, and we turn those values. But, as I mentioned, we also use this kind of calculation layer to catch the simulation data that was generated from these different simulations. In addition to this, we have a secondary kind of calculation layer that uses a technique called simulation re-waking. And what re-waking essentially allows you to do is take the results of the simulations that you've done in the past and re-process it to evaluate what would have been the physical property that you've estimated from that simulation data, how to run that simulation using a slightly different set of conditions. So, in particular, if you run a simulation that once has four scale parameters, re-waking essentially allows you to re-evaluate your observable set of four scale parameters very close by. And these kind of methodologies that we employ to do the re-waking not only give you a value for the physical property re-evaluated using these four scale parameters, but also kind of a confidence in how well have you been able to do this and kind of having gone too far from that initial state where the method begins to break down. So, kind of these two instruments left, one that the simulation, one that takes the results of those simulations, implies re-waking. So, why do we do this? Simulations are quite slow. You can take tens of minutes to finish hours, days, depending on the kind of properties that you want to calculate, whether it's the energy, especially things like how to define if it's going to be quite expensive to do. Re-waking, this technique, when you need to use cash simulation data, is literally just a re-processing set. You look at a set of uncorrelated samples and to re-evaluate those, but essentially, as re-processing, you don't have to re-evaluate another simulation. So, it's order of magnitude faster to estimate properties used in re-waking compared to the initial simulations themselves. And this is really where we start to do performance gains when we start to estimate physical properties using parameters most to where we initially generate the simulation, which is exactly the case of what happens during the optimisation. So, just to mention that, while we have kind of got two of these calculation layers implemented at the moment, we plan to expand this in the future. But just to give a more detailed picture of how these kind of calculation layers all fit together, it really is about how fast can our server objects through these calculations and to what confidence can the different layers of it affect properties. So, we want to imagine that, if you've not run any simulations before, if you've not done any calculations, if you've got no cache data, there's not much you can do except to go off and run your simulation and get the evaluative objections in that way. But you then cast that simulation data from those initial simulation concepts. If you imagine that it's some optimisation engine that initially requested that these properties be estimated using these four-scale parameters, and then the optimisation engine makes a small perforation for those four-scale parameters and actually to re-evaluate those three-scale properties, our server object will receive that request. See that it's now got cache data, I think, on the disk. We'll automatically know how to load it up, how to deploy this method called re-weighting. We will see has it not enabled to calculate these three-scale properties with sufficient confidence using these two four-scale parameters at the speed that's at an order of magnitude faster than it has to run those simulations. And if it answers yes, then perfect, then just return the result back. We'll automatically know when it can use a faster technique and when it needs to fall back and re-lawn a new set of simulations to generate a new set of cache simulation data. The idea of it being re-on a simulation would be if we can re-weight it, if not we launch a new set of simulations, re-optimise it and make another change to the parameters, and the next time, eventually, we'll get into a region of confidence where this kind of re-weighting technique can be able to re-weight the data, as opposed to on any simulation. So this is kind of what we've got built in at the moment. We've built too much of a function of this during the release one bit in due to a number of slight technical difficulties. But one of the things that we do want to look into it is not only new simulations, not only new re-weighting, but actually start creating things like so-that models or neural networks to take this cache simulation data and start to learn the response of how our objects are functioning, how are our different figures properties changing as we change our faster parameters. So we think when this isn't going to help us too much, we'll kind of start out optimisation and we'll kind of park them on the minima. But as our optimisation engine kind of proceeds and parameters are going to get more stable, we think we can train a surrogate model around that point to learn the responses and should be able to be evaluated in seconds. So where simulation is going to be days, re-weighting is going to be 10 seconds, surrogate model is going to be literally seconds or milliseconds. And it's kind of this much-difficult approach that will allow us to go from an optimisation engine which maybe needs 10, 20 iterations to one that maybe needs hundreds of thousands or tens of thousands of generation cycles and this is really the core power of this framework that we've been building. This idea that you can use multi-built sampling, different calculation layers, each one that tries to be used as much data as possible to estimate these physical properties in the fastest way that would be possible. So in terms of how we actually estimate these properties themselves, either by simulation or by re-weighting, it's within the property estimate framework and I really want to suggest that we're going to be using a lightweight workflow engine. The real focus of this framework is in building this set of re-usable workflow components which will change together and produce quite powerful workflows and can be used to estimate physical properties. So in particular, we've got protocols which will build coordinates in a set of smiles, or we've got things using OE dots, we've run simulations using minimisation or OE command, we've got things using things like Ynchor, APR, hostgeff, bindings and simplification using the tweeter. So really, the main part of the practice of this framework is you've got the filtering components which do individual tasks when combined together and on larger workflows which will allow us to estimate almost any different physical properties without having to do any coding that's literally just building blocks together. The last we look like is not just kind of one protocol sequence, these protocols are really quite simplistic Python objects. I mean, they just look like a Python block that has a number of set of inputs and a number of getter outputs and a method that says execute. So there's really nothing to them. It's just literally like a Python run script that will use some specific tasks. And the idea being that each of these protocols will have some input and these inputs need to be set from constant values. So we may have a protocol that runs a simulation and you might want to set a Python set and you might want to set that just arbitrarily for one template. But this protocol might also need like a quality object to describe what simulation it's going to run on. The idea of the input can also come from the output of another protocol so one of the protocols that may be built to system in the first place. So this idea of individual protocols which have a response process of doing a small task this idea that you can arbitrarily stick them together either using constant stuff on the output of different protocols is unbelievably flexible and allows us to compute really quite tribly different physical properties just by re-using these same protocols at the moment. So to give an example of how we do this identity it's not particularly exciting but we essentially use one of these protocols to build in coordinates. We use a protocol that essentially defines some kind of parameters. One of the nice things that we have is kind of represented by the three-box in this visual figure as we have built in this notion of what we call a conditional execution group which sounds maybe daunting but essentially what it allows us to do is take a group of protocols and just run them again and again and again and again and pull some specific criteria that's been met. So in case of our framework if it will be a system we want to capture it into a certain uncertainty especially kind of roughly on par with the experimental uncertainty report with methods and we just take the protocols that we've run the simulation we wrap them in kind of conditional groups and we automatically run them again and again and again and pull their uncertainty within a target uncertainty of our two circumstances. And it's not just the simulations data and use this kind of reweighting technique to reweight these protocols using this kind of workflow graph as well so really quite a flexible system. So I think this may be quite apparent but the nice thing about defining how we capture these processes into the granular nature is that almost any of these steps can be easily taken out in places that seem to go. So for example maybe we don't want to placement of samples maybe we want to apply gaps by OK Elector 2005 without changing much of how to try the definition of how we estimate these different processes we just take that and then a protocol to strip out and replace it with one that applies to the system instead maybe we don't want to use open amendments and loads into the clouds that block from the workflow and stick in one that runs through much instead so it's very flexible and modular and it allows you to accomplish the template. The other nice thing about it is it allows us to kind of automatically determine where we're doing redundant work so if you imagine that you're asking for an entity to be calculated and a dielectric to be calculated the steps that one would need to take to actually estimate those different properties that are essentially identical from set to get the system, run a simulation and extract the property are just the only differences what can be your exception. So after in-work and automatically look at the workflow that you're using to estimate these different types of properties you where you're doing redundant calculations and just push those together so you're doing the least amount of work and not use a bit more complicated resources than we need to do. So I think one of the key things that I want to stress about this what we're doing with this framework is we're not trying to build any workflow and we're not trying to reinvent the wheel but to see things that already exist. The things that we're focusing on is building individual protocol buildings which is quite generic which seems to be used by the very lightweight workflow and which we built or which is plugged into any available workflow because instead of writing from the complicated engine that takes our workload graphs and try to execute them we'll just use a library called Das and basically with Das you say I've got the protocol that I want to execute and then goes up and executes them on your compute resource. I won't say too much about that but essentially it's a nice feature of the library using the called Das job queue as it integrates with our local inverted queuing system and in principle an input with the queuing system asking for inserting your jobs into the queuing system without you having to submit them based on the workload that you have to get to so if you're submitting 4 calculations this kind of job queue library will insert 40 jobs into the cluster and as those dots finish it will scale that down and this not doing too much detail allows us to make real quite powerful use of the cluster and allows us to request more GQ using more resources than we would otherwise if we were to manually submit these things so that's a very fly by car of the practical estimate framework I think as with all our stuff it's available to download on GIST as well as the repository that we've been building through our rate of data that's using the utility that can become in the framework itself so I think if nothing else the HOME method is that we have this framework it's incredibly flexible that you can define how you want to estimate your different physical properties just by combining together these flexible workload components what we really want to do and what my personal goal was this framework was to kind of abstract ourselves away from the simulation of themselves and allow us to focus on the scientific questions that we want to answer we want to be absolutely trivial for anyone to be able to come along and say I want to assess my four skills against entity admission and without having to rewrite a whole new set of components that we've just been struck by this definition of what is an entity of mixing into the framework so really abstracting ourselves from the entity and the tediousness and just taking advantage of some of this framework so in terms of what comes next in the framework we've already used it across the optimization cycle but we're now planning to use it as part of the assessment cycle as well so the path that we want to expand into the tools that we've built into not only extract data but generate it and filter it allows to ask more interesting questions it's not just filtering by tempers and pressures but can we build force which will allow us to filter data set by chemical specificity or can we use the filter set by our agent force which will allow us to automatically identify what regions of chemical space is our data set left lacking then essentially which will allow us to help identify what data are we missing which length do we need to fill in the other thing that we want to do is continue to expand the number of properties that we have built in support for so currently the framework can do density dialectrics, entities of mixing, entities of vaporisation as current got partial support for host guest bindings both through absolute things calculations using gents by using a patch called relief and Dave's flowchaller and we're just finishing off implementing that and getting ready to use that as part of the assessment phase and similarly for the point where we can be using implementation so in addition to that, the last two things that will really make this powerful going forward and make it more useful for the optimisation is this notion of employing surrogate models as one of our additional calculation approaches because surrogate models could previously be a model on our patch simulation data it would essentially allow us to evaluate our physical properties really, really rapidly and allow us to do more sophisticated optimization techniques than what was currently available to it and of course additionally we'd love to improve the ability of the framework to scale across different resources so currently it can run calculations under the single cluster but we'd love to be able to scale across multiple clusters up into the cloud and that's another big engineering piece that will allow us to scale across calculations into the future Thank you How well can you actually fit to all the properties right now? So are you trying to actually fit to all the properties simultaneously or do you say fit to one then try to fit to the next one and then go back again It would be interesting to actually see how transferable would it actually be to fit to one property and then see if it improves the properties of a different property I think it's right because it's odd and self-correlated Absolutely, yeah But I would say we don't actually know how much better we're doing it but that should be happening in the next few that will be happening in the next few weeks so then we can make more quantitative things of how well is it doing and this is the testing the training of the testing Let me demonstrate my ignorance by asking about what the correlation involves and the coupling in your workflows Okay, so we think you just calculate the auto-correlation function with respect to what they have observable, what we're trying to compute and then we basically sub-sample based on the auto-correlation time Okay, thanks a lot