 All right. My name is Dr. Dara, and I'm at the Memorial Sun-Kettering Cancer Center, which is the basic science of the Sun-Kettering Institute, the Memorial Sun-Kettering Cancer Center. And I was asked to talk about what we might be able to do and just introduce the speakers here that are going to tell us much more about their concrete ideas for streamlining molecular simulations data. And it's a quite broad topic. Just to give you a bit of background about myself, Mike's lab does a lot of molecular simulation of all kinds. We try to develop quantitatively accurate models for critical biolecular small molecule design basically, as well as how mutations in patient populations actually translate it to sensitivity resistance or anticipating drug resistance, we also simulate other things like nanoparticles. We have a giant automated web lab, which is a big 24x4 robot, so we work a lot on how to automate and understand things from molecular biophysical experiments. And there's a lot of other cool things in terms of simulating protein-protein interactions we're getting into. But we're also a member of the Fooling Home Consortium, so we generate very large data sets and would love to share, but we just need a way to enable this. So what I'm going to tell you about, at least in this introductory part, is these are some of the, there are some significant challenges facing our field. I think everybody has an idea of what these are, but I want to explicitly name some of them just to get us thinking about how either technological solutions or community solutions we could come up with and try to overclock some of these challenges. Now, we talked a lot about interoperability, right? I grew up in the Ember community, and it's very important for the Charming community. Here's stories about, for example, someone in Charming Brooks's group would develop a new GD model, and then the next graduate student would spend an old graduate student career next door trying to implement it into Ember, and they'd case this group. And we want to avoid that kind of vulcanization of these communities where people can't easily contribute from one package to the other. Also, in doing things like predictive modeling for protein-linked antibody affinities, we run these blind challenges all the time. E3R is one run by Mike Wilson and Romney Amaro, where they're looking at data from pharmaceutical companies, and it's blinded until the point where people predict it and then try to see how well they did at predicting protein-linked affinities. Stable looks at those, for simpler protein-linked systems and also for physical and property development for discovery. But it's really hard for us to compare performance because nobody uses standard data sets except for these blind challenges. So somebody will publish their own method and compare on a particular data set. Nobody else uses the same data set to compare it, so we don't know which one is better unless we go with a predictive challenge. But unfortunately, most of these predictive challenges test things that are not the methodology or the force field. They test unrelated choices that people have made in putting things together. So when you're trying to set up a viral of those in relation, you have to make a ton of decisions, and everybody makes these decisions differently for different reasons. We hope that there might be some best practices, and that's what live forms this journal is trying to establish in certain domains at least. But still, most of the time, we end up testing which of these random decisions have you made differently rather than how is your method better than another method? And we're really trying to evaluate the technology. We'd like to evaluate which technology is doing better at navigating this very complex design problem, but we end up mostly evaluating the driver. So if somebody set up the wrong pernation state for a ligand and gets a really weird ligand binding affinity because of that, that's usually what we see in these challenges, not the differences in one free energy method being better than a free energy method. So we need some way to separate the skill of the driver from the actual technologies as well. And the other thing is that people like me who like to focus on different algorithm approaches for building better tools for predicting binding free energies or binding modes or designing small molecules, we would really like to focus our creative efforts on a particular part of this whole thing, but we end up having to worry about literally everything we need to do in the previous slide to set up our molecular systems. And that's to some degree ameliorated by things like Memphre, MD, which we're going to hear about in just a bit, where they've done all of these best practices set up things for you, but there's still lots of other courses one has to worry about. Industry also shows up that these blind challenge prediction meetings like P3R over and over, because they want to learn about one of the best practices. They don't have time to do comprehensive evaluations. So they would like to figure out who's performing the best consistently in predicting protein ligand interactions and how can I implement those procedures within my own organization, but it's really difficult if there's not a clear way of describing what those best practices are and actually reproducing them in another laboratory. What you might, so for example, if you're just going to set up a free energy calculation in Gromax, actually this is just a standard molecular simulation where you're computing the RMS deviation. These are all of the stages you have to go through in general to go from one end to the other. And if you're interested in doing science in that one little cue, you still have to do all of this other stuff in order to get there. And every choice you make along the way is going to impact your results. So it really makes it quite frustrating if you're trying to focus on this part, but want to leave everything else of the best practices to the community, to people who know all of those steps a lot better than you, then it's quite difficult to have that kind of focus. We support reproducibility we've already spoken about, but really what we'd love to do is build on the work of others. We'd like to take some great idea that somebody had, some baseline method. We think we have a better idea for how to improve this part. And we'd like to do it better and implement it, and then start by reproducing that work and carry it forward. But I've had enormous trouble trying to reproduce the work that other people publish in nearly all cases. And so if we can't even reproduce that work, it's really hard to use that as a starting point for going forward. Instead, you usually just take a completely different approach and make shortcuts everywhere else. Also translating these best performance from the blind challenges into real production pipelines that you can use in industries almost impossible for the same reason. Deployment is actually quite difficult. Everybody here is tried very hard to make their software as easy to use as possible, as easy to install as possible. A lot of people are using Kanda at everybody. But still, even though our software is installable in one line with Kanda, we emerge a data set to have a post up flyout every three months just to make sure it was working on their batch queue system. That sort of thing is something that we like to avoid. So training is also a huge problem. Let's see, there's a big problem in the United States because all the baby boomers are retired from the farmer. So there's a lot of talks that are open over the next few years in computational modeling. But training the next generation of chemists is actually quite difficult because we don't even know how to improve what the best practices are in a way that makes it easy for someone to learn what that is and why that is. So it's quite difficult. Funding is also very difficult, both from industry and from academic institutions, or federal institutions, that would like to give money to academics because they like to see something for their investment. The research paper is not an overall product. They would like to invest in a tool that other people could use. And the problem is that if nobody else could use it, then it's really a not-of-worth-while investment. And both the program officers from federal agencies and industry people have said time after time that this is really a problem. They don't like seeing their funding dollars totally wasted this way. So this is not a workshop about workflows, but workflows are kind of the solution to all of the problems. They solve the interoperability and evaluation issue. They allow us to take the same workflow and then use it again on other kinds of data. But to make those sorts of ideas work, and there's other workshops focused on those really neat standards of common data models that allow us to move data through different parts of this simulation through that entire flow of what I'll show you for Gromax workflows, for example. So if you have your modeling tool that you want to use, and for example, it could be in a Docker container or it could just be bits of pipeline used together, you want to be able to take data sets and shove them through standard best practice preparation pipelines and into your modeling tool and then through automated data analysis tools. But to do that, you really need standardized models to allow you to move data through each of the different stages of this. The preparation pipeline, if you did have ways of pushing data back and forth in a standard way, we could actually try a bunch of different ways of doing that and figure out which ones perform consistently best across the suite of modeling tools to even find what best practices are and have a good set of evidence about why they are the best practices at the required time. So but we need a lot of different kinds of standardized common data models in order to actually be able to do this. For a biomolecular system, we need to replace the aging PDB format. We need to handle things like parameter sets. It has to be a robust, open reader and writer that everybody can share. We need ways of describing small molecules in a better way. We already talked about that a little bit earlier and pushing them over the internet. Output data, we were just talking about how trajectory data needs to move through. But also, any computed physical properties might need to be carried through in some way. If you're looking at things like structures or clusters, they might need some way to be expressed as well. How would it really quantify uncertainty and then propagate that through? That's also very important. And then how do we assess anything is also a big question. So all of these, in principle, need some way of defining a common data model between different bits of tools that might go back and forth. And there's different granularities of interoperability at this level. You could have, for example, this free-loading, importing, opening them next to Gromax and sharing potential definitions or sharing different parts of the tools where there's just a common data model that allows us to move things back and forth at the Python level, or it could be at a much coarser grade level where you say, my tool lives in a docker container with a little Python driver to advertise its capabilities and to allow me to update it in and out of the docker container and have it operate at a coarse-grained scale. And maybe there's a DOI-resolvable repository of those containers that goes along with my paper. So we can think of anything along this scale as allowing us to facilitate and really streamline our ability to move data from one tool to another tool throughout the whole simulation lifetime. There's a particular challenge I wanted to highlight, which is, how do we describe what it is we're simulating? And I mentioned this a little bit earlier. Just describing what's in my simulation so somebody can search it later is important, but also describing to the force field so that it can apply parameters. It turns out to be super important. I'll mention the open force field initiative in just a second. But this is what you might look like to a biologist where we say we express human able-pinyte T315i that's immune, isoform 1a residues 242 through 493. This is the unifront identifier. Views to an N-terminal HISTX-TEV tag that was cleaned with Teprides, which leaves a little bit of an N-terminal overhang. Incubated high concentrations in these auto-phosphorylations. So you need to know which thing was phosphorylated. Assays were run in 100 microliters of 1 microvolver kinase in an assay buffer, which is 20 millimolar trisk buffer, pH 850 millimolar NACL. That's the actual one we used for assays. Two which we had 100 nanolayers and 10 millimolar DMSO stock in the math there. So you need to know what prognation state and what tautomers are relevant for that. So all of this has to get turned into some sort of structured file that the force field can actually act on and figure out what am I going to actually be parameterizing, what parameters do I want to assign. And this is a non-trivial process to even describe or come up with some sort of scheme of translating this into something that is structured and that's something that we can share in a search. Here's some things to help us. There's just mentioned SMILES or UCHI. UCHI is developed by NIST to describe their mass atroscope database and it looks like this, for example. I'm sorry, looks like more like this, but you can actually have mixture issues that also say how you're combining these in different ways. You can describe like a 25, 24, 1 phenyl chloroform isoamyl alcohol in a trisk buffer using this kind of description. Uniprod is kind of a standard place where you can get biopolymer information for proteins. There's also this thing called ISO 11238 which is used by the Food and Drug Administration of the US in their GSAS, Global Registry of Additives for Things, which says, what's in a pill? The Molecular Software Sciences Institute has worked with us to meet intermediary for something called the Open Source Scale Consortium which is a group of well-interested industry that has signed on to sponsor the production of open source tools and open enforcements which is super cool. And this legal model can be used for other depressed groups as well. So there's other academic groups that would like to get money from industry that can see me or Daniel about how we set up the legal infrastructure to take multiple companies from around the world and funnel their funds into open source software. But the cool thing is that we're trying to define some data, we don't want them to necessarily be standards but they could be if their other people are interested but we just needed some way of representing things like what is in your simulation system? What molecule species are there in a way that can actually work between different kind of cheminformatics toolkits? We're also working on automated benchmarking against biophysical data sets. The question is how do you represent a biophysical experiment and what it was done on in a standard computable way? And so there's some standards like the NIST-THROMML which is an IUPAC standard for describing mixture of good properties but there's some others that need to be developed for biophysical experiments. And then we have what we think is a really good contribution to how to define, how to transfer parameters to given systems or how to describe how parameters is supposed to be assigned for biolabular systems. And you can read all about that here. So with that, I want to stop and introduce the two speakers that will follow me. So Phil Stansfeld, it was my pleasure to meet this weekend. He comes from Oxford and has done some really cool things with automating sort of best practices for simulation preparation with this mempropery. We're using it in our own library for right now to grab a board of structures to simulate permeation of small molecules, small molecule antimicrobials through portons. And then you'll hear a bit from Christopher Woods who's been doing some great work with the simul and now with BioSIM Space to try to really show what the future could be like for how simple it could be to do molecular simulations, for example in Jupiter notebooks where different modules move things back and forth through the cloud to do a lot of these computations in a very fast style of manner. So we'll give you an idea of what might be possible and how we can move forward. All right, so without further ado, I'll let Phil come up.