 Good afternoon everyone, and welcome to the next edition of the BioXell webinar series. My name is Rostin Apostolov and I will be today's host. Today we have a presentation by Christopher Woods from University of Bristol, who will tell us about biosem space, which is one suite of application that lets you connect different simulation codes. Before we start with the main presentation, I have to tell you that this webinar is being recorded, and soon after the end of the webinar, the next day or two, we will post a recording on the BioXell website on our YouTube channel, which you can review and see, you can share with your colleagues and friends. Before we start, for those of you who are not familiar with BioXell, I would like to give a very short overview. BioXell is a European Centre of Excellence for Computational Biomolecular Research. We work with several key applications in the area. These are GROMOX for molecular dynamic simulations, HADOC for integrative modeling and docking, and CPMD for hybrid QMM calculations. We work on improving the performance, efficiency and scalability of this application. We also work extensively on improving the usability of different applications in tools used in life science. We work with several key workflows and platforms such as Galaxy Ignite, ComSys, OpenFacts, and we develop such automation pipelines. We also provide extensive training program and consultancy services. What might be of interest to you is that we are running several interest groups in different areas such as integrative modeling, free energy calculations, biomelectro-simulations for entry level users. You can find more information about them on our website and we welcome you to visit our support platforms. We have forums at ask.bioxcell.eu, we have an open chat channel, and you can always get in touch with us through the website. At the end of today's presentation, we will have a Q&A session, so during the presentation at any time, you are welcome to use the questions tab. You can see it in your control panel where you can write your question, and when we are done, I will let you speak directly to Chris and ask your question. If there is a problem with the microphone, we don't have a good audio, I can read the question on your behalf. So please feel free to type in your questions as the presentation is proceeding. And now we know further adieu. I would love to present Christopher Woods from University of Bristol. He manages the research software engineering group at the university, and he is an IC fellow and joint chair of the UK Research Software Engineering Association. He obtained his undergraduate and post-graduate degrees in chemistry from the University of Southampton, and he's been working with Professor Johnton Essex before moving to Bristol as a chemist, developing software and algorithms for modern biological molecules and systems. In 2016, he started the RSE group at the University of Bristol with the Advanced Computing Research Centre. On the slide, you can find the contents of Christopher, and you can get in touch with him. So I would like to say thank you, Christopher, and I will let you continue with your main presentation. Thank you very much, Mohsin. So I will show my screen, and then I see him. I hope now everyone can see my screen. Yes, we see well. Perfect. Well, thank you for the kind introduction and for the invitation to come and speak to you today about Biosyn Space. As said, I work at the University of Bristol in the Advanced Computing Research Centre, and if you want to download a copy of these slides, you can use the link at the bottom, just to my website, slash talks. Now, Biosyn Space, before I explain what it is, I think it's actually easier just to show you. So I'm going to come across here, and this is actually Biosyn Space running in the cloud, and I'm interacting with this in a web browser. So what does it do? Well, it basically enables you to run simulations, run calculations without having any kind of knowledge of how to use it. So ultimately here, we're going to start with Biosyn Space, and this particular notebook, the stupidity notebook, is to actually run an interactive MD simulation. So here we've imported Biosyn Space. We've now loaded up some molecules. How many have we loaded? Well, loaded up 631 molecules. And then we're going to define a simulation protocol. So for this particular simulation, we're going to run an equilibration. So here we defined an equilibration protocol. And now that we've loaded up some molecules and we specified a protocol, we can now run an MD simulation just by going BiosynSpace.md.run system protocol. So now we have some MD running. Is it running? Yes, it is. It's going. How long has it been running for? At the moment, just 0.16 minutes. But we can interact with this MD simulation as it is going. So what's the total energy at the moment? It's currently 6,500 kcals per mole. But we could then format that nicely. So here you can see we've done 3.2 pika seconds of dynamics. And it's currently 6,400 kcals per mole. But if we run this again, we're now done 4.4 pika seconds. And we can keep doing this. It's not good enough just to get text output because we're working in Python land. We can use things like numpy, mapplot, lib, et cetera, and just plot these things. So this basically is the temperature and then the energy live as the simulation is going. And if I rerun this cell, it will get more of the data that's being generated. And we can see we've got some more information come through. As well as looking at the simulation in terms of the raw data, we can also get 3D views. So we're going to connect a viewer now to the simulation. And if we view the system, what you should see now, and I apologize if this is a bit slow, you'll see now this is basically the 3D snapshot at this particular time. So if I zoom in, you can see this is the molecule being simulated. And if I rerun this cell, it will get the next latest snapshot, which we can then look at. Now, the simulation is producing data. The data is sitting in the cloud and it's producing the trajectory. So here we can then grab the trajectory out. This is the trajectory produced so far in the simulation. So how many frames have we got? We have five frames and we can pull those frames out. So here they are. Now, the way the BISIM space works is a wrapper around lots of other tools. So by default, we get trajectories in MD-Trag format, but it's very easy for us to also get them out in MD-Analysis format. So you have all of those tools available to you to analyze this kind of data. And of course, when you have that data, you can then run an analysis. So for example, here, this is calculating a real mean square deviation. That's what's been going on so far. So that's basically what BISIM space does. But now let's go into detail what actually is it. So we can say, why BISIM space? Ultimately, this project started because I was watching researchers basically trying to work out how to run molecular simulations. So typically, you'll be going, a researcher will say, how do I do something? How do I add in a loop? How do I minimize a system? How do I parameterize a molecule? And the way that someone will find out how to do it is you search the web and you search the web and you go through the search results one by one. And generally, you'll find a tutorial, a blog post, or a script. And you'll go to that page and you'll follow the instructions in that blog post. And it will either work or it won't work. If it doesn't work, you then basically go back to the search results and you choose the next one. And it will work or not work. If you're lucky, you find a blog post or some instructions or a tutorial that actually does work. And you're happy you've done what you wanted to do. Sometimes, though, you find instructions that look like they work. And everything completes without error. But actually, it's not really worked. You've basically made a protein that's rubbish, but you can't tell that. So it's actually failed badly. But most of the time, actually, to be honest, you never find something that actually works. And this means that you'll go through all of the search results and ultimately give up because you can't find out how to solve a problem. And this is a problem for us as a community of molecular modellers. We're really not good at sharing instructions on how to do things and sharing best practice. We make it very difficult for newcomers to learn how to perform basic molecular simulation tasks. We decided, as Biosims based, to try and solve this. So we got together as a community. So essentially, it's a collaboration between the UK and the US. And it was a response to one of the UPSRC software flagship funding calls. So it's a collaboration between myself, Vanadio Moholland at Bristol, Julie and Michelle in Edinburgh, Charlie at Nottingham, Francesca in UCL, together with Cressits and Gent at Evotech, and then John Tadera, Mollsley, Michael Shirts, and D3R in the US. And we looked at the field and we said, what is actually the software problem? Why is it so difficult for us to share knowledge about how to run simulations? And it's because if you look at all of the software that we have in our community, it really doesn't fit together very well. So if you write a tutorial on how to do something using Gromax, but you as a user are working with AMBA, then it's really hard to mix and match between these different codes. Effectively, there are lots of gaps between the codes. And the result of this is because we have very little compatibility and interoperability between all these tools, is we end up writing scripts that are really brittle. They're bespoke for particular code workflows. And so it's really hard for them to be kept up to date or shared. Now, one way you could solve this problem is you could, as a community, we could all get together and say, OK, we all have many different file formats. We'll have different codes. We'll have different ways of working. As a community, we could say, let's get the great and the good together. And we'll create a brand new standard format for bimolecular simulation. We'll create a brand new standard simulation package. We'll create a standard sets of workflows. And everybody will be encouraged to use the standard. The problem with this top-down approach is that, of course, it's difficult to get everybody to sign up. And what would happen is instead of creating a new standard that would solve everything, we just create yet another set of file formats, which would be a yet another set of incompatible, uninterruptible things. So we think the right solution is to work with what we've got. Instead of trying to replace all of the tools that we have in the field, we need to create the shims, fill in the gaps between all of these codes so that we can translate from one file format to another. We can translate from one file to another and make sure all of the tools can work together, make it easy to plug them together into one thing. And that is Biosim Space. So our solution is bottom-up. Biosim Space is a collection of shims that make it easy for us to plug together all of the existing codes that we have. So you saw when I ran the simulation on the notebook at the beginning, I didn't say what MD package to use. Biosim Space found an available MD package and just used that. I didn't have to say what trajectory analysis software to use Biosim Space found the analysis software and just used that. Now all of the tools that we wrap up are exposed in Python and everything is decided we will use everything in Python. Why? Because Python is insanely popular now. I call it the Facebook of programming languages because effectively everyone's in Python because everyone else is in Python. And it's also quite a cool programming language. And then what we've done is we've been ensuring that all of the tools that we use in the community can all be accessed using a common simple API, i.e. we have the same interface to run dynamics in all Dynamics packages, the same interfaces to do alignment, the same interfaces to do trajectory analysis. And this means that then we can write Biosim Space Python scripts that can then act as workflow nodes that can plug into existing workflow engines. So for example, we can plug these into nine, pyplopilot, ecstasy, run with the command line, et cetera. And this means that then we can write scripts which you can share, use with your own codes as they're installed and then use them from the command line, from a workflow engine or interactively in a Jupyter notebook, because I showed at the beginning. So you think, you know, this is something people have already been doing. So for example, if I wanted to write a workflow node, so the node would load a protein that can complex, it would run a collaboration for a certain number of steps, calculate the RMSD with respect to the starting structure and then output the equilibrated structure and the plot of the RMSD. This is something that it's very easy to write. If you choose a particular software package, so I could easily write a batch script or a Python script which could do this if I was choosing to work with Amber or I was choosing to work with Gromax. The key from Biosim Space is that it lets you write one Python script that will do this workflow in any file format using any available MD package, using any available trajectory analysis tool. And then when you output files at the end of the workflow node, it outputs them in the same file format as was used for input. So then you can take individual nodes and effectively rewrite them in Biosim Space without affecting all of the rest of workflow nodes in the whole workflow. So if we visualize this workflow, this is the one where you put in a protein dig in complex and out comes the equilibrated complex and the RMSDs, this is what a workflow node will look like. And the idea is you write a Biosim Space node that basically does the work within this box. And then that node can be plugged into nine, pipe like pilot, ecstasy, can be run as you put a notebook or run from the command line by placing in the inputs as arguments. So this is a Biosim Space script which actually does all of that. I'm not gonna work you through all of it because it's obviously a lot of Python. They're not that much considering how much it's doing. You can see here we're importing Biosim Space. We're creating a node which is gonna do the equilibration. And one of the key things we want in Biosim Space is actually the nodes have authors. So you can see here we've added in the author of this particular node and we've added the license. And that way people who write Biosim Space scripts can be credited with their work. You can know who actually did the effort of writing this and when they share it, they can then get the credit. The nodes have inputs. So here we're inputting the complex as a file set. Nodes also have outputs and we're gonna say it's gonna output a file which will be, say, the RMSD. Here, this line is reading in the molecules into the system from those input files. And this read molecules basically can read pretty much any molecular file format. And it's works regarded auto-detects the file format. Here, we're now getting a protocol for equilibration. And then when we do MD.run, what's happening is Biosim Space is going which MD packages do I have installed on my computer? For the MD packages I have installed, write the correct input files, write the command files, submit and run the simulation and then go. When it's submitted and running, you've then got a process. That process can be used to get an interface to get things like the trajectory. So here we're getting a trajectory out. And then at the end of the node, we then produce some output and we say, okay, output from the node, some molecules that we save. And you can see here we're saving the molecules in the same file format as they came as input. And then when the script is finished, we then validate that the node has actually produced everything that it promised it was going to produce. And if it doesn't, you then get some errors. That node runs as a command line script. It will run eventually within nine. And as you saw when I demonstrated at the beginning, it runs as a Jupyter Notebook. So this is the plan. This is what we're trying to develop. So how much of this actually now exists? Well, currently we have written many of the file conversion parsers. So we can read and write files in AMBA format and the Gromax format and the charm formats and your PDB and multi-standard. We've also written drivers for MD programs, particularly AMBA Gromax, we're working on charm and we'll be doing NAMD as well, which I think NAMD is mostly done now. We've written interfaces for the network analysis tools for MD analysis and MDTRAGE. And we've written the node interface for command line Jupyter and most of the way through nine. The other thing we've got is a molecular search parser which enables you to search for bits within molecules, which is very, very useful like here. We can search for all molecules, which contain alanine and within five angstroms of a ligand. Most of the work going on at the moment is basically working on setup. So we're basically trying to now get automatic setup of molecules. And this is basically writing thin wrappers around T-leap and chamber palm check, SQN and PDB to GMX. We've been salvating molecules using T-leap and salvate. And then we've taken code, which has come out of the FE setup project to enable us to automatically map ligands together to do single topology free energy calculations and are writing drivers for single and dual topology free energy calculations at AMBA Gromax and Someday. And we hope to have that ready even sort of like the end of August for a release just before September. In terms of how long the project's been going, we're basically about now 10 months in and the project is gonna go until the end of 2019. And we're currently on target. So I hope this is all gonna be producing things that everybody will need and want. So just as an example of setup, this is a biosim space script, which is nearly working, which will do automatic setup of some molecules. So here we are loading in some molecules in Gromax for that, a grow and a grow top file. We're extracting here the protein from the file and the ligand. And then we're parameterizing it using the FF14SB force field, which is basically going to run. And this in the background goes off and uses PDB to GMX or it uses TV depending on what is available. Then on the next line, it goes off and basically runs a GAF parameterization using anti-chamber and SQM using an advanced protocol we developed to be set up that copes with a large number of types of molecules. So it's not just a simple just run anti-chamber. Then we solvate the system here in tip three P and this is a wrapper around TV or solvate. And then finally, down here, we can see we're doing the MD.run with a minimization and equilibration using a standard protocol for these. So at the end, we can then output the system which has been fully parameterized. So this is a software project, but software should not be developed in a vacuum. You need to have actual science to do with the software so it drives you on and you can then hit your deadlines. So it also helps you fix bugs. Bugs are very important to find. So we have two grand challenge applications that we're running during this project. The first one, which is going to be running basically from September this year is a automatic setup and running of binding for energy calculations challenge. So this is why we're collaborating with the D3R group. They're going to be giving us access to the D3R data sets and this will enable us to run binding for energy predictions using a range of different tools, a range of different force fields and basically run a comparative study to say which of the tools, methods and force fields will actually give the best prediction in this blind predictive challenge. We will then be running another grand challenge in the beginning of next year. And this is going to be basically when we write to the meta-dynamics layer for the code and this will be basically automating the running of meta-dynamic simulations for looking at binding kinetics. Now, when I started this talk, I showed you BISIM space running on Jupiter. When we first conceived the project, we were thinking of BISIM space being useful from the command line and from within workflow nodes but what we quickly discovered is that people really liked using it on Jupiter and Jupiter notebook seems to be sort of a very popular way of getting access to it. So how does this work? Well, for those of you who don't know Jupiter, it's basically a webpage which gives you access to a Python interpreter. The way it works is you have your webpage running on your client. So it could be like your iPhone, iPad or your notebook. So I could have run the demo I started with on my iPhone, it works fine. And then you have Python running on your server. So the Jupiter notebooks on the client and then you have a Python kernel running on the server somewhere. What happens is as you use the notebook, you're sending Python snippets over the network to your Python kernel. Those Python snippets are then executed on the server. That will then produce some output and a result and then a renderable version of the result is then sent back to the notebook to be rendered. What this means is you've effectively separated the user input and the rendered output onto the client while all of the compute and more importantly, all the data that you need to do the compute is stored on the server. This makes it as a very efficient way of actually using a piece of simulation software. It means that instead of having to install everything locally, you can just go to a webpage and everything is just there. And also because all your data is on the server somewhere, it means you're not moving data backwards and forwards between your laptop and the server. This is particularly visible when you think of the 3D molecule render that I was showing at the beginning of this talk. That's based on something called NGL view and effectively what NGL view is doing is it analyzing the data is generating 3D representations as WebGL and it's only sending the WebGL data across the network to your client running on your notebook or your phone so that then it renders it. So here, and this is where you'll be thinking, oh, this is really jerky, it's only throwing a few frames per second, but actually this is running at 30 frames per second even though the client is sitting at home in Bristol and the actual server is running on the cloud in the east to United States. And what you see if you can just catch on the top right here, this little block is the little bits of data that are transferred. So only the data that's needed for the visualization is actually transferred across. Now, the way that we run by SimSpace using Jupyter is we run it on the cloud using a standard called Kubernetes. Kubernetes is a container orchestrator. So many of you may have heard of Docker. Docker is a way of basically containerizing your application so that you don't have to worry about its dependencies, everything is contained within one Docker container. What Kubernetes does is provide a layer on top of Docker which can then orchestrate those containers. So what you have is you have a set of servers typically running in the cloud. And so we have a cloud servers running in the east to United States. And these servers are allocated or these containers are allocated to the servers dynamically and a container running on a server is called a pod. Now the container orchestrator will network all of the containers together using named services and will then expose those containers to the public network using a load balancer. So when I connected to my Jupyter notebook, it basically connected to the load balancer and a container was then automatically put onto a server to serve me. Now, if demand for your service increases as more people log on to your service, the Kubernetes will basically expand the number of pods which are spawned to match that demand. If you find demand for your service decreases, then the pods are destroyed. So effectively the pods will grow and shrink as the work allocated to your service increases or decreases. If you find one of your pods fail or go silent, then the Kubernetes orchestrator will automatically kill the pod and then restart it. The other thing it enables you to do is it can actually upgrade pods in the background by turning them off and bringing up a replacement. And you can also do things like A, B testing. So it's really useful as a service to enable you to install software and then run services which are based on that software. Essentially, if all of that was like too long to read, you can think of Kubernetes as being effectively a scheduler for containers. So this is how we use Kubernetes on the cloud to run by SimSpace. So there's a website which is workshop.byimspace.org which is the thing I connected to right at the beginning of this talk. I connected over HTTPS and when I connected the Kubernetes service basically spawned out a pod here which contained the Jupyter server and all of the by SimSpace software which was then able to run and interact with me. This is all running on the Microsoft Azure cloud and essentially it's been configured to support up to 60 simultaneous users at once. And the cost of that is about 11 pounds per day which is about 4,000 pounds per year. We think it's a really cost efficient way of again labeling lots of people to use the software without having to install it all and work with it. But this then brings us to by SimSpace on the cloud and the big question of who pays? 11 pound per day is not too bad but still someone has to pay that. So how can we get by SimSpace to work and be more self-sustaining? This is where the partnership with the cloud providers has come in. So we've formed a partnership with Microsoft Azure and also with the Oracle Cloud infrastructure to basically develop by SimSpace cloud. The aim is to allow upfront charging of cloud compute and storage costs on a per simulation basis. So instead of you having to think, okay, I need to rent servers and rent disk and rent network and IP addresses and all of those little tiny details, instead you will have one cost which is the cost to run the simulation. Now the reason we can do this is because if you remember when I did MD.run, it was basically running a system with a particular protocol and this means that by SimSpace you can actually estimate upfront how much compute and how much storage a simulation is going to take. And this means we can predict on different resources how much it will cost to run on different clouds on different servers and then pick the cheapest or the most efficient one for you. This means we can present you with a cost to run the simulation upfront. You can then decide whether or not you want to accept that cost and if you do that cost is then guaranteed. You're then able to do things like set a daily cap of how much you want to spend per day and also things like the maximum run time for the simulation. And with all of these sort of constraints of the daily cap and the maximum run time, you can then have the best resource allocated to you depending on what you've then set. So for example, here, this is what a by SimSpace Cloud script and what we're working towards to be able to produce are effectively at the top of the script you're going to log on to your by SimSpace Cloud account. This basically logs on and with your account you'll have daily limits. So you could say, I don't want to spend more than 10 pounds per day and I don't want any simulation to last more than one week. We're then creating the system and the protocol and then here, we're now going to do the bss.md.run for a system and a protocol. By SimSpace on the cloud then goes, aha, I can see you're trying to run this type of MD simulation with this number of atoms and I know using this particular package it will take this long to run on a cluster of this size and so it will go through all of the different resources available. It will work out which of these is the most cost effective for the thing you've got and then it will go, does this actually fit into your cost and your time constraints? If it does, the simulation runs and everything is fine. If it doesn't fit into the constraints you have an exception, the cost breaks constraints error and then it will then ask you for permission to break one of the constraints. You'll get an email request or a console request basically saying, can I break one of these constraints? Can you spend more amount of money per day or are you willing to wait longer for the simulation? Once you've decided on what you want because this is blocking, once you give your permission we then get a permission object. If the permission is granted then the simulation can then run using the new constraints that you've set. Another possible thing that could go wrong is you don't actually have any money to pay upfront for the simulation and in that case you have an insufficient funds error and so you'll then email to request some money because you have to pay somehow and then we block here waiting for the funds for a maximum of 48 hours and then if you get the money it will run the simulation and if not then we can't run anything. Now the way this actually works on the back end so this is what we're setting up is we're deliberately working this at the multi-cloud system. So we chose to work with two cloud providers to make sure we don't choose any technologies which lock us into a single provider and so everything we've got is standards compliant, open source, et cetera so we could actually swap the cloud providers around and bring in all partners we need. So here we have on the left BISUM space running in a Jupyter notebook on Kubernetes and so we've got the bss.cloud login logging on to the user's cloud account within the script. Once they've logged on they get an access key and this access key can then be used to authenticate this user on all other services. First thing that happens now is when the person wants to run the MD simulation and we worked out we do have a resource and it fits into the constraints is the access key is used to connect to an object store create a bucket for the simulation for that user and then the input files which are generated by BISUM space to run the simulation are then transferred to the object store under that bucket. The next thing that happens is that the notebook running on Kubernetes will connect across to a function service. So a function service is something called a Lambda service if you come across Azure Functions or AWS Lambda it's basically a function which automatically allocates resources on the back end to support the running of that function. So our MD simulations are being represented as functions on a function service. The authentication key is used to connect to the service authenticate with it and say we want to run this function and here is the bucket in the object store where all of the input data exists. Once the function service is authenticated it will then use that key to connect to the object store and copy all of the data from the object store through to an HPC disk fast disk which is basically a posix disk ready for the simulation. At that point it can then use the servers that were allocated to that function to run MD and as it's running MD on the servers it will then produce output back onto the HPC disk. At the same time a copy service starts running and this is then using the authentication key that was provided to copy the output dynamically back into the object store. As a user you can be interacting with this object store dynamically and interactively. So as data appears in the object store you're able to grab it out and then dynamically plot it and dynamically visualize it in the same way that you saw me doing at the beginning of with the demo that I was running so you can get the energies, the graphs and everything else. Once the simulation is finished the copy process will then do a verification on all of the data and it will basically check some to make sure all of the data from the HPC disk has been correctly copied across to the object store. The object store is already automatically backed up and everything else and so the data is now safe and it means that then we can basically shut down the function service, shut down the service running on the function service. So running on the notebook we can use the key and the bucket location in the object store to then basically run all of the analysis and the results so means the data is then available to look at. Now the data in the object store is written in a protected manner so it's read only and this means that we can safely reuse this data from other scripts without fear of modification or deletion. So simulation outputs that are produced sitting in the object store, they are read only and because in the same way we can estimate the amount of cost to run a simulation because we know how long it will take we can also estimate how much data is gonna be produced. So the upfront charge you have of 10 pounds per day or whatever, that upfront charge is also covering one year storage of the data in the object store. The key into the object store is effectively a document object identifier and will be converted into a DOI enabling you to re-access that data from other scripts or to publish it and make it accessible by others. We are going to produce a web console which will allow researchers to manage outputs. For example, you could use that DOI and control access permissions. So basically you label other people to share to make it public. You can delete the output and receive a pro-writer storage refund. You can pay for extra year storage because only one year is paid for upfront or if you want to archive it there'll be a one off charge for the output to be archived. And this is all being produced with the partnership of these two cloud providers because you need the cloud providers to provide effectively the backends to do all of this. And we have engineers beginning to build this now. The interesting thing about Biosymspace for us as we've gone on this journey of producing it is it's really feeling like it's on the cusp of a change in the way that we do computing. So when we were starting Biosymspace we were thinking of it very much in sort of the batch computing world where essentially you would have a simulation and you would submit it to a queue and then you'd wait and then the simulation would run and then eventually you'd get some results back. But as we've evolved, we realized that we're sort of moving now into this demand computing world where we have notebooks and interactive visualizations and interactive data analysis. And effectively you can begin thinking of a Jupyter notebook not as being just a notebook itself but it's also a repository where you have the documentation, the simulation, the analysis and the results all together into what's really an executable reproducible interactive paper. When I showed this to a group of PhD students and sort of talking about the difference between demand computing and batch computing one of the PhD students said, you've built the Netflix of simulation. And I think this is a really apt metaphor. Effectively we're moving into a world where simulations are streamed on demand. You access them on demand. You're not going to wait in a queue. And there's no reason why demand computing has to sit on a cloud but a cloud is a very, it enables this to happen. But I think as we move forward in not just biosimspaces and other sort of simulation packages as well we need to think about how can we make our batch computing systems behave like demand computing systems? How can we handle user accounts that can move between running in the cloud and running on batch? How can we handle movement to data? How can you run custom document images and how can you do usage and cost accounting? So that was kind of a run through the whole of biosimspaces. Just want to give some acknowledgement to the team. So basically most of the work in terms of coding has been done by Lester Hedges and with Antonio May. Obviously, they're other biosimspaces partners for Shidia, Adrienne, Charlie and Francesca who have been really helpful. We thank EPSRC for funding under that grant number. CCB Biosim and HEK Biosim basically they are the umbrella organizations under which all of the community has been brought together. And essentially the community under CCB Biosim and HEK Biosim has massively supported for biosimspaces. They've put us with examples, workflows, et cetera. And we're working with them to basically try and find a way of enabling biosimspaces to do all of their work. Also, particularly want to thank Kenji at Microsoft in being his access to ASIO, to the engineers at Microsoft and also Phil Bates and Gerardo and Oracle again for giving his engineers and access to the Oracle Cloud. And thank you to Biorexcel for inviting me and for hosting this webinar. So with that, I think I'm maybe a little bit on the time I'll hand over to the Q&A session. Thank you, Chris. That was a really interesting presentation. I would welcome everyone from our listeners to type in your questions in the questions tab in the control panel, which, and then I will give you the microphone to ask your question directly. I was wondering, Chris, so when you, so you can have multiple, for example, MD packages who would run a simulation and the biosimspaces will automatically select the ones needed. But how do you distinguish between small differences in the methods that are applied to the integrators, et cetera, within each package? So we're in control of that. So we're deliberately wrapping things at a slightly higher level. So what we're saying is you want to do dynamics for a certain number of steps. You want to do a collaboration for a certain number of steps. And so we're choosing a package based on that request. We're not giving you the option of basically giving you different decisions actually within the integrator because that's a very low level decision, which could be very specific to an individual package. Now it is possible within biosimspaces to pass that information down. So if you pass in, you can control and say, I would like a specific integrator. I would like a specific set of parameters. So don't use the default protocol. Use this very specialized protocol. And as you do that, you eliminate the ability for biosimspaces to choose. So what will happen is it will then go, okay, you've specified something that only exists in Gromax. I now only have the choice of using Gromax. And if Gromax is not installed, then I just have to tell you, sorry, it's not installed, you can't use it. What we're trying to do is basically smooth over these differences and sort of describe simulations at a slightly higher level. The reason for this is because biosimspaces is very much aimed at the more general community. So we don't, with the more general community, we want things to be done and we don't really need to go into the details of exactly how it is done. And so the tiny details of how you do it, that's what we're letting the community decide through basically how they've defined these protocols. Yes, thanks a lot. So we have several questions now. So first one we have from Alexandra Simpuller. Hi, Alexandra, can we hear each other? Yes, can you hear me? Yes, so you can speak. Yeah, I just was wondering, that's a very good idea of what you do. So you write the wrapping software in a good format, what people know. And obviously you need an MD simulation at the other end to do it. Let's say you have AMBA, but not everything is free. How do you handle this? Is this included in the 10 pounds I pay and do we do software as a service? So there's like two questions for that. First one is, Biostream Space uses what is available. So if we ignore the cloud part, Biostream Space also runs as a command line on your cluster or it runs as a nine mode on your existing, you know, industrial cluster, et cetera. So if you already have AMBA installed because you've already paid for AMBA and Biostream Space sees you've got AMBA, then it can use it. Biostream Space running on the cloud, obviously then we have an interesting world where actually effectively what Biostream Space is doing is providing software as a service. So this upfront charge of saying, okay, this simulation could cost eight pounds of compute. What we're actually saying with the cloud providers is one, that they guarantee that cost, but two, we want to add a tax to it. So we want to say, it's going to cost eight pound of compute plus data, but let's now add another pound or another two pounds, which will then be the cost for the software. For the open source software, because we basically track everything that we're using, we'd almost like to pay royalties. So effectively, if you're using Gromax, it would be good to then send royalties back in the same way you get royalties as a singer, you know, on Spotify. If you're using something like AMBA, we would like to do deals with the commercial software providers that actually can we find a way that we can turn your existing site license into a pay as you go license. And I think this is one of the big challenges really that the cloud presents to the commercial software industry because it very much pushes you to a per use pay as you go license model rather than an upfront site license model. So luckily, most of what we want to do is open source, so we don't really hit this too much, but there are a few things where we will need to be negotiating these kind of cloud licenses. I hope that answers your question. Yes, perfect. Yeah, thank you. So we have another question from Rochelle Yemstro. Obviously, if I can turn you on your microphone. Hello. Hello. Hello. Hi, Christopher. Very impressive presentation. But I'm wondering if you're using the system and the cloud, it's sort of set up to be very automated. And my experience in using Amber and doing a lot of simulations is that very often, especially if you're using non-standard amino acids or different DNA addicts, you really have to customize the parameterization. Is this gonna be possible through your use, through Jupyter notebooks through the cloud and your scripts to include specialized parameterization and to go back and forth in terms of optimizing parameters when their parameters are not found? Yes. So this is what I think I can try to hint this is we're not just having a simple wrapper around anti-chamber or an SQM, et cetera, because you're right, you have this issue that actually parameterization is really difficult and you do need some interactivity or you need things to catch things. So we're building this on top of a project called FB Setup, which came from the C2 Bias and the Neck Bias of communities. And in there, we basically caught all of the errors or a large number of the errors you can get in parameterization and then worked out how you could automatically fix them. And then for the things you couldn't automatically fix, how could you then make it easy for the user to then say what they want to do? And so Bias SimSpace is building on top of that because this is almost like a continuation project. With the system object, so when you load up molecules, you can parameterize just by calling the parameterized schemes, but you have full control to go into individual atoms and you can set parameters by hand, you can pass in parameter files and say use these exactly, don't use an automatic parameterization. And also when you run a simulation or you run a parameterization, you may have noticed me talking about protocol. Protocol is a way of specifying how to do something and what protocol does is provide a way for the community to really control how you have this high level representation of running. So with protocol, different groups can basically say parameterized DNA using this protocol, parameterized approach using that protocol, parameterize these weird things using this protocol, et cetera. And what we hope to do is collect these protocols together and effectively use that as the store of best practice that we can then share amongst each other. Because at the moment, we don't really share these protocols with each other in any sort of consistent way. That's where the sharing will happen. And I just want to make a comment. I think that this would be a really powerful tool for teaching how to do simulations when people are first starting out, especially through the use of the scripting and the Jupyter notebooks and accessing a cloud. Have you thought of developing any educational tools around this system? Yes, so the Jupyter notebook work and we build that actually for workshops. So what I showed you at the beginning was an excerpt from a molecular simulation workshop. And indeed, if you go to workshop, stopbyasimspace.org, or you go to byasimspace.org and click on the link from there, you'll actually see these are teaching workshops. But it was from the students using the teaching workshops in Jupyter that we suddenly thought, actually, this is not just good for teaching, it's actually good for real simulations to kind of go backwards way around, actually. But no, I think this would be fantastic for teaching. And again, a way of us being able to share blogs and share best practice and actually bring new people up. So we don't have them just frantically googling and they're not know how to do anything. And you said you can access that through, what do you say, workshops? Yes, so it's, I think I can type it in here, but it's basically HTTPS workshop.byasimspace.org slash hub slash TMP log in thing. And I click on that, I think I get to the page. Maybe we'll just watch, sorry, it's workshop. We'll just get this wrong. We can add this later to the slides, probably will be easier. Yeah, we can add it to the slide, but essentially it's workshop.byasimspace.org. Because there's no user accounts with it, you just automatically go in and you can then start using it and running with it. It is linked within the presentation and on our website. Oh, great, thanks. Thanks, Trichelle. And we have a few more questions from Alexandra, actually. Alexandra, do you have a couple more? Okay, I'm curious. Anyway, I will read. Yeah, can you hear me? Yeah, sorry. Yeah, so one question you answered already, the software as a service business. I think this is the way to go. And then of course you mentioned your project is stopping in 2019. What happens after it? So the moment I guess you have funding to develop, so what do you do next? Do you do a spin-off or where is this going? That's a very good question. So we should be able to finish most of it. So what we have is a very detailed software engineering plan which we're on target with and that has most of the code developed and finished by kind of February next year. So then leaves us with about 10 months through user support documentation, debugging, et cetera. There are two avenues going forward. So we work with commercial software providers and we're working with cloud providers. What I'm hoping is that effectively the tax on using it in the cloud is sufficient to keep the front-end going and sort of keep sort of development going through there. And then we've built enough goodwill in the community that ultimately we're not the ones having to wrap everyone's code and we're not the ones having to develop all the protocols, but instead it becomes almost self-sufficient where as a developer of a new software tool, you want to wrap it to make it available within Biasing Space. And I think the motivation to do that is this idea of as I said, royalties effectively that if we're using it and your code is used by somebody else, then even if your code is open source, you should be getting royalties from that cloud usage. So I think that's how we'll get sustainability. I mean, ultimately it shouldn't be that we as the group are the ones running it and we are this sort of single pointer failure. We really want this to be a community thing. That's the thing, but somebody needs to check but another idea of course could be that their app is written by the actual software company because once it's, let's say you could put this on a marketplace, that's one. So software vendors could see this because you have crossovers as well with materials, modeling and whatever. That's what I can see because you can put your protein on somewhere on top and give it a different twist. So that for example, the software vendor writes the wrapper. Yes, so as I said, we are working with software vendors who want to put their things into this and this is why we did this in partnership with the cloud providers. So I think it's the cloud providers actually really providing it and making it part of their system and getting almost like a certification scheme that means that if you run a piece of software the worldies go to the right place, which is them how we get trust that say one software vendor won't try and steal everything and run away. So I think this would work if we have trust in the community that we're all doing this together and everyone's being rewarded. I think it would fail if it ends up being one company or one group who are just doing it to the exclusion of everybody else. Okay, thanks. Thank you, Alexander. Now, we have a question also from Ian. Ian, can you hear us? Okay, I'd like to hear a bit more. Yeah, can you hear me? Yeah, I'd like to hear a bit more about the metadata side of things. So you did touch on it. So my specific question is, how is relevant metadata managed which records how the simulation was run? I'm hoping that that's captured. Yes, obviously that does have to be captured and that is captured. So everything that's run is, they're all run in Python objects. So essentially what you have are Python objects which are data preserving. So when you load a molecule up, it's loaded up effectively into a Python molecule and a BISM space molecule, a BISM space system. And that BISM space system collects all of the metadata regarding the molecule that's been loaded from those files when that then gets passed into the md.run The protocol contains all of the metadata and all of the data of how the simulation is going to be run. And then it runs the system, processes there. The process object that's created contains all of the data which is then as a Python object of how the thing is being run and then effectively you'll then querying that process object. Now, when it comes to the end of the simulation, you're outputting things at the end. The output of the things at the end, it really depends on what you want to keep. So as a node, you're normally only outputting output files or graphs or things like that. Within the node, everything is saved. But then when you go beyond the node, only the actual relevant output data is passed on. So the metadata tends not to leave the node. If you wanted it to leave the node, you'd have to create another output stream from the node and then pass the output, that metadata around. But obviously it's all stored within the node. So you can always get back and query that. We can look at how to do Pongman's logs. That would be a good way to capture it. Yeah, I mean, there are logs. It's obviously everything is on the cloud version. Everything is stored in the object store as well. And you can just go into the object store and query the things that are in that. But what we took, again, was this kind of decision is sort of where is the layer of, where's the user interest? The user is really only interested in something running. And there's a layer below which they really are not interested in what happens. So we try not to report things to the user that they don't care about. The data is there and it's stored, but we don't tend to show it to them. So there's a great opportunity here. You've probably heard of the fair principles. So the opportunity is to make use of systems like you or engineering here so that the user is supported in how they manage their data, including metadata. And in agreement with the fair principles, maybe at the moment, it's not, the most users may not be well enough educated in what their principles are important. But I think by supporting them through automated systems because their fair principles are all about supporting machine readability. I think you would be helping the users even though at this point in time they may not fully see the long-term benefits for getting that right. So I'd hate to think we're missing an opportunity on that. Well, no, as I said, everything is stored and the nice thing about doing this as a Jupyter Notebook is once you publish it, that's all the information. And because we've got the cloud connection, it means somebody else can rerun that exact Jupyter Notebook and re-get the things they want to get out of it. One of the things we think is quite interesting that's pointed out to us is that if something is published, it means, and people are happy with sharing, you can actually say take a simulation on a protocol and because we can fingerprint that, you can go, has this ever been run before? Has anyone ever run this protocol in this system before? If the answer is yes, just give them the results, don't run the simulation again. So it'll actually make it much easier for us to share simulation outputs. That sounds good. Thank you. Thank you. We have also a question from Adam. Adam, can you hear us? Yes, hi, Chris. I'm Adam Carter here at TPCC, the University of Edinburgh. So I come from a sort of a HPC centre, so I'm kind of interested in the scale of some of the things that are being run in the cloud. You mentioned that you gave some figures for 60 simultaneous users. I wondered kind of roughly what size of simulation in terms of cores this was based on. So the 60 simultaneous users are for users for the front end. So if I go back to the slide, I think I can, can you see my slides? Yeah. Okay, so basically the 60 simultaneous users are the people on the Jupyter Notebook side. And that's basically, that's very, very, doesn't require really anything. It's sort of like half a core per person because work doesn't need to happen really on the Jupyter Notebook side. Yeah. The expensive bit is the thing that happens on the serverless side and essentially for that, that's why we need to do having people pay up front because on the serverless side, a function could be one Gromax on 1,024 cores. And so we're actually putting the core count as part of the function definition. So then what happens is that it chooses, okay, should I run it on Gromax 256, Gromax 512, Gromax 1,024, what fits into the cost of time budget calculation. And so that's how you then get the separation. So we can support 60 people who are interacting doing data analysis, blah, blah, blah, blah, but then for simulations, and that's free for simulations, they're gonna have to pay because computing is free. No, absolutely. I think the tax on the front end or the tax on the service, what I think then, or the cloud providers are agreeing that can then basically fund all of the running of the front ends. The interesting thing that I said at the end is that this is a model which happens to suit the cloud very well, but this is not cloud specific. It doesn't have to be this way on the cloud. For example, the serverless system could be Archer. There's no reason why you could not put a, go and run this on Archer and use your existing set of channels to run it. I mean, this is why it's interesting with this transition point towards running things interactively and doing interactive data analysis simulation, butting up against a kind of the fact computing view of supercomputing. I think things like this and like Galaxy and like the query and work with Relyon, they're producing these on demand workflow managers and interactive notebook managers. They're really making us think about how we provide computer researchers and on demand manner. So have you done any, so you're talking about going up to a thousand cores, which is sounds like fine for what a lot of people would need. In terms of the efficiency of those kinds of simulations, do you know how they compare on the cloud infrastructure to how they would compare on a cluster or something like that? So with the Oracle cloud, what they have is a bare metal cloud and all of the hypervisors actually as hardware in the switch, they did a partnership with Melanox. So you basically have in Phineban speed, I think it's about two times 40 gig connections and the scaling is pretty good. So we haven't run any benchmarks up to that size yet because, you know, that isn't as interesting as it looks. But they are very fast. It's basically, it's what you would expect if in Phineban connected very fast servers and each bare metal server is about 56 core correctly. The actual disk is a dual SSD disk in every single node and then they also have an HPC disk, which is connected to all of the nodes on a telephone network and they're very close to the disk as well. So it's supposed to be very, very fast. Okay, thank you very much. Thank you. Thank you, Adam. And we have time just for one more question by Steen. Hello, thank you for a very inspiring talk. I think it's exactly what we need to have in that kind of interface of new and advanced users to get them to use hard computer resources. I have one question because you mentioned about how in your question with Alexandra about how you want people to start at acting adding their own tools and so on. So how easy is it to add your own tool if you have a tool that isn't currently covered but you want to fit it in, into the protocol. It's like how do your packages are up and then eventually, I guess, how would you get it into the model shape because you want it to be not just on your own installation but in the public one. So you can get those royalties and so on. So what do you think is the process for adding or modifying, I guess, fixing those integration parts? So I think the process, it's easier when you have a class of things that we've already got. So for example, if you had an MD package which was not one of the ones we're wrapping, what you would do is basically see the wrappers that we've already generated for amber and MD gromacs and then effectively here are the protocols you need to implement and then work out how would you turn this protocol into the input file for your package to replicate the spirit of that protocol. So it does what the user would expect it to do for that type of protocol. So it's really essentially writing small Python function that will take a protocol in and then we write out an input file for your program. Once you have an input file for your program, ultimately what we're doing is literally just doing a sub-process run on that but in a slightly more clever way to enable us to throw things on the cloud, et cetera. But that's then just using the wrapper that you've written. In terms of then packaging, we're packaging is a real pain, I have to say. So we're currently fighting with Condor of trying to work out how we can get package, easy to install packages of gromac, and the amber and all of the other tools and it is actually very, very difficult. What we're effectively aiming towards is a kind of like a Linux distribution of molecular simulation tools. And so what would help us is if you wanted to add a tool, you actually have all of your installation instructions and all your dependencies and make it as easy to install as possible so that we can then basically build Docker containers that can then contain your application very cleanly. So that's it, it's kind of like Docker container of your application plus how can you turn, make an input file from the command file from the protocol? If we get the tool into Docker container then half the job is done for you in a sense. Yes. Yeah. So effectively with the serverless platform, these things are actually running Docker containers which describes the entire function. What about file formats? Because often there could be mismatching file formats. So this is why Biosyn Space has a lot of file format converters. So we are continually adding in more file formats so that you don't have to worry about that. So if your thing is taking in an amber net CDF binary file as your input, but we will convert anything you need into that file format for you. If you need things reordered in a particular way, we have things that can reorder the atoms or ensure they have standard names or things like that. So really the bulk of Biosyn Space is actually file format converters. And I sometimes think we're very lucky to get funded to write file format converters because that's crazy bad. It's not the ones with glamorous or fun job and I would say please, please people don't write more file formats because in a way to have to do them. But as long as you see your tool speaks, you know, one of the standard file formats and you haven't created your own file format, then we can convert the input files for you. So you'd have to worry about it. Thank you. Thank you, Steve. Thank you, Chris. Now we are over the time and I think this was a great discussion. I encourage all our listeners and future listeners who are viewing this on the web to get in touch with Chris and try out Biosyn Space. So thank you all for the webinar attendance today. Thank you to the presenters. BioExcel will have a summer break with the webinar series and we will continue again in autumn. So please follow our website for news and please subscribe to our newsletter which comes once a month and with very interesting links to events, webinars and other noteworthy articles. Thank you all for today and wish you a good day. Bye. Thank you, bye.