 Hello everyone, and welcome to the next edition of the BioXL webinar series. My name is Rosson Apostolouf and I will be today's host. With the ongoing COVID pandemic, BioXL has many other research organizations that are involved in developing tools for drug design. We engage in many initiatives to help the global fight against the disease. And one of the initiatives that we are participating in is a collaboration with Molecular Software Science and Software Institute in the United States for the development of a portal where users can share their data from modeling and simulations. And in today's webinar, it's our pleasure to introduce you to the to the hub, show you how it works, and hopefully you will find it very useful and contribute to this community effort. With that, it's my pleasure to introduce you to Levin Eden, who is a software scientist at the Molecular Science and Software Institute in the United States. He is the leading developer of the COVID Molecular Structure and Therapeutics Hub project, which we are introducing today. Maybe he has a background in computational free energy algorithm, scientific software development and deployment and software interoperability. He has a PhD in chemical engineering from University of Virginia. All right. Thank you, Rossin, for the introduction. And thank you, everybody, for coming to the webinar today. As Rossin introduced, I am Levi Nadin. I am a software scientist at the Molecular Science and Software Institute, MOLSI, stationed at Virginia Tech in the United States. I am the lead software scientist responsible for MOLSI's response to the COVID-19 pandemic. And today I'm going to be talking about a joint initiative between MOLSI and BioXL that we've come together to try to tackle this problem from an angle, not explicitly related to doing the research directly, but connecting as many people doing the research as possible. So first off, a good question is what is MOLSI exactly? We are a NSF-funded initiative and Institute whose sole purpose is to serve and enhance the software development efforts of the broadly labeled field of computational molecular sciences. This includes quantum chemistry, computational material sciences, biomolecular simulations, and basically anything that deals with atomistic or even bulk phase down, but ultimately molecular level sciences. And this goes beyond just providing support in the sense of training efforts, even though that's partly what we do. But we also provide software expertise and infrastructure resources. That includes supporting software, additional tool development, resources that don't necessarily exist or won't necessarily be convenient for individual labs or PIs to create in their search of their own duties and responsibilities to carry out actual cutting edge science. We do provide education and training. This includes for everything from undergrads all the way up to company level training for tools and things like Python and like dockerization containers and good software practices and development. And then we also are charged with providing an engagement and leadership in the community. And this is both in the sense of emerging technologies of actual like software and hardware side of things, but also providing those leading things to sort of integrate what is the best way to do things or at least the best practices to do things. How should you go about carrying out your research in the CMS field without getting in the way of your specific research, but helping enable more efficient, more effective tools and keeping what's going on. So that is what we will see are. And as part of that, when this whole pandemic started and we all had probably substantially less hair, we were charged with as part of our advisory board. Look at this problem and see can we, the people who are supposed to be engaging and leading the community come up with a solution, something to help spurn research and get people going. And there are several interesting converging factors external to Mulsey and Bioexcel that sort of drove what our response to this. And the first one is that we are in a very interesting state of research in our field. We all have what is a common enemy. Everybody can recognize this. This is the SARS-CoV-2 variant particle that everybody is researching in some degree. There's a good side to having a common enemy is that the entirety of the scientific community has come together to tackle this problem in sort of a harmonious, cohesive approach. Everyone is doing something. Everyone is wanting to contribute in some way. And this is a headline from the New York Times article about three and a half months ago now. But this is also leading to some interesting problems. There are groups who have been tracking the number of papers that have come out about the COVID-19 pandemic. And this is a particular snapshot of the recent trend is that is 30,000 papers. And just for a reference point, when I sometime after I started working on this project, I went back to one of my earlier reference points and we've tripled the number of papers written since then over the course of two, three months. And this sort of becomes a problem because that's a lot of information to process. There's also another interesting problem that I'm sure everyone on this call and webinar has experienced is that there are a lot of us stuck at home right now or stuck in isolation, unable to actually carry out our normal duties under normal circumstances. But we all have expertise in different fields and aspects, which might be helpful in coming up with a solution to our pandemic problem. And we're all looking to help. And so many, so we have a common enemy, a cohesion of the community trying to get things to solve this problem, a complete overload of information, and a lot of people looking to contribute to that overload of information to hopefully tackle the enemy. There's also been a secondary effect of both the good and the bad here, is that with every single institute and lab and company almost, and that goes even beyond CMS, trying to tackle the SARS-CoV-2 virus, there's been some interesting side effects. The first is that the editors for the journals who all these papers are being published in are completely overwhelmed. This is something that through our network of contacts at Mulsea, and the people who are editors in the journals have told us is we cannot keep up with the papers. We can't get reviewers fast enough for these papers. And then also with the deluge of information, there's this sort of mixed bag of quality of this coming out. People are trying to churn models, churn simulations out so fast to try to help with the problem. And it's not like they're doing this in bad faith. They're doing this in good faith. There's just not time to necessarily take the due diligence to make sure everything is right. So for instance, I have a tweet here from Tristan Kroll, who is a fantastic modeler in this field, who is the main developer of Isoldi to look and do these modeling. So he looked at one of these structures and he put out a number of statements about this, is weird biological structures. People are saying this is a model of one of the different proteins of SAR COV2. And clearly this doesn't make sense. So in this case, he's pointing out what would technically be called a trisulfide, even though that doesn't make any biological sense. The other thing that we've been hearing a lot is from the collaborators, not necessarily editors, but the actual individual researchers, the labs, our contacts at company, is they have data. They are generating data right now in real time. We have it. We think it'll be helpful to the community. We have no idea how to get people this data at the volume of requests we are receiving. And it's not feasible for every single group or PI to make a point contact with every other group or PI and get their data to them on a one-on-one basis. That's not going to be efficient, especially with how much data are coming out. Then there's a second problem of even though we have this sort of mixed bag quality, even the really good data are nebulous. And nebulous in the sense of they make sense to the people who developed them and make sense to the people who typically work in those spaces. Here are a couple, I've got a couple different groups here who do fantastic work in structures and modeling for these proteins. And I imagine many people on this particular call actually understand what these sort of mini languages mean here. The PDB structures, the PDB codes, and what does that mean, and what's the quality of this PDB structure, and how much refinement has been done. But if you're somebody who is looking at sort of the biological side of things and the biological responses of these things to say, drug bindings at a more macro scale, this sort of terminology kind of gets confusing. And if you're not an expert in it, you're probably going to have to spend a lot of time dissecting this information to say, okay, what's actually useful to me? And so it's tricky. And so we have this problem of all these different pockets of data all the way from the small scale like drug design people all the way up to the full epidemiologists are all great and organizing their own data. But it's hard to cross the barrier all the way from, say, the base biology through modeling, through simulations, through drug discovery and screening, through actual possible drug design and then encapsulation and testing in that space. And so you get to this problem, which just seems to feed on itself and it's hard to get this communication. Then you get the question of what of this that's coming out of this delusion information is good, what isn't. And that's something that we have to try to work on. And most of the data, unfortunately, only paints a very small picture of what's actually going on in the space. So this will probably be review or seen redundant to many people on this call, but it's important to highlight the scale of the problem here is that this urine particle that I showed at the beginning is that in a lot of cases people are looking at this spike, the little knob here on the end. And at the microscopic scale, this is what you see, but at the actual modeling and structure scale, it's a structure that looks like this full of very sort of many different proteins and chains along the way to actually make up its and different residues to make up the whole structure that is the spike. And understanding all of that and understand how all of that fits together is only a part of the picture because this when it actually gets into the body to the body, it starts reacting and causing the infection binds to a protein that happens to reside within this called ACE2. And even that has its own structures is then embedded in a membrane, which then acts as the barrier to process into your body, which is what starts causing all the negative side effects in response to that. And I apologize for any people who do these sort of modeling embedding. This is a very crude, almost cartoonish representation, but it gives you a sense of the scale of this. And this is just one tiny part of the virus, which could be affected and one process, which could prevent infection or stop the spread. And we need to figure out how to parse all of that information. So from this, Molesi came up with the response that we thought we would need to this is that we need a way to place for people to gather all this information together, to share it with everybody and make sure it is easily conveyed and logically organized. And the big hurdles here are how do we get such a diverse field of people to actually share data efficiently in ways that make sense in what they need. And we came up with a few basic principles. The first is we want simple descriptors of the data. So all the very complex, potentially confusing terminology, we tried to strip that out and leave that for the people in the particular domain expertise. We wanted to have people submitting data as quickly as possible. And then even if it's not been data that has been formerly gone through a peer review process, and then have the community through that hub provide the reviews themselves, so that we would have a natural sorting of what's better, higher quality data for this week or what's been refined since last month. And then some of the lower quality things were sort of filtered down, not necessarily because they're not helpful, but just because the community is finding these items more beneficial. We also wanted to have the hub serve as the centralized point or two where all the data reside. We didn't necessarily want to have all the data concentrated in one place. And we'd like all the very large data files. But we wanted to make sure that we at least pointed to wherever they happen to be in a very cohesive and uniform way. So even if it was on, say, a somebody's private server they have running out of a box in their house or it's on some cloud resource, they're still linked to and referenced in the same way. And then we wanted to make sure this was as accessible to all experts in all the domains that we could. And so we tried to make things as sort of flow logically within the sense of the biological processes involved. So if you knew what you were looking for, or you knew what you needed in order to do your research, you could get that and then do your research and feed that back into the hub for the next person along the way. And what happened with this is that we basically had two PIs spark what is now this hub of global network cooperation between all the different companies and entities who are helping contribute to this. And it started with Ramya Morrow and Adrian Mohallard, who reached out to one of our directors at Mulsey, Teresa Hedgord. And she came to us at Mulsey and said, we would, this is how we kind of want to tackle this problem. What can you come up with in a very short period of time? That's also when we got another board member, Cecilia Clementi, on this particular project. And through with their leadership and then fair amount of work to put into it and rapid response to this, we came up with a prototype for what this hub is. And this was three months ago, give or take a few days on that. And as the person who made the original hub, this was like it's a sort of prototype proof of concept hub. This was not a good design. And I can say that with confidence as I was the sole person who made it originally. But it still served as a good proving ground for what could be and what we might be able to evolve from. And it was at this point that we reached out to by BioExcel, including people like Ross and including people like Eric Lindahl, who said, we want to join in on this. We were thinking of doing something similar, but since you already have this prototype, let's iterate on this. And so from this prototype, and then from some contacts through the director members and these other two PIs and BioExcel, we were able to reach out to a few different individuals and groups who wanted to help contribute. There was a signed letter that went into JCIM, which had over 100 different PIs and companies sign on, say, we will share our data for COVID-19 publicly, and we will help contribute to the hub that is being developed. And then we had some additional early buy-in from some people who contributed, who were willing to contribute what data they already had. So the Disha Research Group contributed their very long single trajectories of some of the main protease for the COVID-19 virus. We had Reichen BDR, who did some simulations as well that they submitted as well. And then Folding at Home, who's been running these very, very, very long simulations through their network for COVID-19, agreed to help contribute data and curate data and help build out what the new hub should be and what it should look like to be generally beneficial for everybody. And then about one month later, in a whole bunch of elbow grease, and several of the software scientists here at MOLSI to actually get things up and going, we finally managed to bring the hub to what it was and now what it is, which contributed contributions from several dozen individuals, actual dozens of research groups specifically, and then all the individuals associated with them, a community that spreads sort of the US, the EU, Switzerland, Korea, Japan, and then we have upwards of a dozen or so organizations and institutes who aren't explicitly doing research at a lab level but are wanting to contribute efforts and resources and tools to what the hub is now. And to give you sort of a loose example here in the bottom corner, this is relative access to the hub as it's being tracked through an analytics that we're doing. And in total, this comes out to be about 2,500 to 3,000 different points of contact over the course of the last couple months or so. And so for showing what this hub is, rather than just say, talking about what it is, let's take a look at what's actually there and how people interact with these data that have been contributed from everybody and what the hub has become. So let's see here. Make sure I don't minimize the wrong thing. And so this is the hub as it stands today. I'll go ahead and zoom in so people can see. And that it's mainly meant to be a, how do I get to the fastest, what's the fastest route to the information I need from the landing page? So very briefly, we've organized the data into a few primary categories. Basic biology of the whole virus. What are the proteins? What's the virion? What are some of the different means under which we can try to disrupt the infection that have been tried and what are being approached? And then we have sort of the structural data. What sort of things do you need to actually start looking at the virus or the host or actually carrying out simulation? So this is things like structures derived from X-ray crystallography. These are models, refinements on those structures through sort of like loop closures or adding missing residues and fixing missing information to get simulation ready models or screening ready models for whoever happens to need that type of information. And then the simulations, which are long molecular dynamic simulations or Monte Carlo, or you could even have some very large virtual screening in there if you wanted, but basically taking the information before this and creating things which can then be analyzed. And then from there we also have the therapeutics. What sort of drugs and drug-like substances do we know about that are being treated or potentially being used to treat this virus? And what sort of things have been done with that? And how does all of that relate back to the structures? What structures and models are associated with those? And what biology and what targeting modalities are those all linked to? And that's really the primary thrust of the hub here is I want as quickly as I can the information I need. So we try to organize this in the most intuitive way possible and at the most linked way possible we could. So for instance, if we look at one of the very popular things like the spike protein, which is in the biology section, we say, okay, here's the spike with everything. Here's some very basic things you need to know about it. Here are some types of drugs that are currently being used to treat this or approach this. Here's a sampling of some of the structures that we have in the hub that we know about that reference the spike itself. Here are some of the models that we happen to know about the refined structures that are ready to go in simulation that people have been working on. And then here are a few of the simulations that we happen to know about for this particular thing that we are referenced in the hub. And in any of these, you can just, there are links that jump to the next section. So you can say, I've looked at the part I need. Here's the next subset of data I need. Let's jump there. And then it takes you to where the next set of data are. And from there, you can look and say, okay, here's what's in the simulation. Here are some very basic physical things or physical parameters that the simulation was run under. It doesn't go into any of the specialized detail that you might see in some other simulation types. But that's not necessarily what we'd be important for the hub here, because we want to keep these descriptors simple. And then an expert who needs its information can parse it from the description. And that includes then going and getting into the simulation files, getting the trajectories themselves, and then also back referencing where, what proteins this is associated with, what structures this is associated with, and what models this is associated with, such that all of the data in the hub are connected in a way that it's easy to not only get around the hub, but it's easy to see how all these things relate to one another and how I can get the data that I need for my work. And all of this is good. But there's a fair amount of question of how do we get people to actually contribute to this? Because nobody really wants to go into, at least I don't think so in our field, actually doing website design and having to write like raw HTML. And so we've using a couple existing tools that are sort of brought from making these sort of websites. And really what we want is we want people to, when they contribute, they really just need to do things on GitHub. Something most of us who work in computational biology are probably very familiar with. And so I already have it open. So the whole site is generated and contributed to by GitHub itself, just here on GitHub. And for most people, it's a very easy process. You'll have to forgive me if it takes me just a second to find things. I am not the most familiar with the new GitHub layout that they rolled out yesterday or something. But ultimately, it comes to that all of the information that we need, all the content we need for people to submit to the site is contained here in these data directory. So if we look at, say, the protein, for example, and we come down, these are just simple YAML files. And YAML itself is a human readable plain text key value system. So in this case, if somebody wants to say contribute to there's a new protein I want to look at, they just match the schema that we've provided that have these simple descriptors, which is one of our goals at the onset, and fill in the information necessary for that. And then when they're ready, and they want to say, I have a model I want to contribute into this, let's pick a model out of the FIG lab for the open reading frame proteins. This is a model from the FIG lab, which says this is the description of the model. It follows the schema. Here's exactly where you get it. If you want the straight PDB, you can get it here. It doesn't necessarily have to have known PDBs or existing like X-ray crystallography structures. But it does reference this protein. And then extra information about, is there any publication data? Who made this? Where is credit? Where can credit be given? And to who? And so this is part of the approach that we wanted to really take to make sure that data can get to people as quickly as possible without overloading people with so many possible technicalities and nuances that it becomes a complete, well overload, and then nobody contributes. And then similarly, all of these data are automatically connected once you submit to that GitHub repository. So if somebody came in and said, made an edit to this file or added a new model entirely because of the way that the website is constructed and the way it constructs itself, is the website gets updated within a couple minutes and their new data will appear on the website for anybody to start accessing immediately. And so that's really the big power of the hub that we're wanting to convey here and convey this information. And then if you happen to have sort of any publication information, things like that are automatically marked for people to look at and see. So this is sort of how the hub stands today and how we want people to interact with the hub is to come in, get the data they need, and then go do the important research that they themselves are experts in, and then feed that back into the hub so the next person can look at the data that they've done and help contribute to the cycle. And overall, I'd say it's been fairly successful. There have been some interesting hurdles and some interesting revelations we've seen as we've been working on the hub. But most importantly, we've also started to bring in companies, external companies who have helped facilitate making the hub itself better. The first one is we are an official DOI issuer. So anybody who contributes data to the hub, even if it's unpublished at the moment, we can issue DOIs on demand at no cost to you because data itself should be citable. That is something I personally have a stake in and will advocate for. And it's something that we all should be trying to advocate for in general is that your data itself should be something that can be cited in addition to the paper that it might be associated with it. And this is, I could spend an entire webinar just speaking about that topic. The other interesting problem which I'm sure many people have probably figured out is where do we store all of this large information? Because sure, some of the structure, like some of the base PDB structure files can be very simplistic. We're talking a few megabytes. Some of the more refined models, like happen to say, haven't be like all atomistic or have all the waters associated with them. Those can be upwards of a couple hundred megabytes depending on how big the model is. And then the simulations, those get into very, very, very large domains very quickly. But it turns out that there are many companies who are willing to contribute that type of resource. And one company we have partnered with is Amazon Web Services, AWS, through their open data program. And they have graciously allocated at no cost 100 terabytes of storage for us to upload data to with the possibility of expansion. And through there, we can ensure not only is the data preserved but is globally accessible in a way that is, say, better than a box sitting in an office somewhere that we hope is well backed up. And as part of this and part of the longer term goals that we hope to accomplish with Amazon and as a consequence of this particular hub, is we want to look at options for how we store data in the computational molecular sciences field more generally, correctly, so that we don't run into the constant problem that we probably all encountered where you wish to go get, you read a paper somewhere, you see somebody has generated this fantastic data or you want to look at these data yourself or get the particular bit of original underlying, say, model that somebody used to say do AI training. And the best you can do is contact author for information and hope they still have it in a cut in a well, packageable way to get to you and hope that they don't have to physically mail you a flash drive or something. So that's a longer term goal that we're hoping to accomplish as part of that with the lessons we've learned from the hub. The other one is that we have a Zanota group that has been created. That is was created by Rawson who introduced earlier and for everybody who can who wants to contribute data and link to the hub and it is my understanding that Zanota as part of the COVID-19 research is giving anybody who would like it a substantial amount of storage and this is order several gigabytes. You will have to forgive any random noises in the background as many of us who are probably working at home can attest to little kids will be little kids. Anyway, we have this Zanota storage and this is a Zanota group set up for people to put their data on which will also issue DOIs and share that with the hub in a different way or at least be able to point to it from the hub in a way that you don't have to worry about maintaining the backups yourselves. And lastly, there's a few other potential improvements to the not only the hub but to the field that we are exploring with a couple different companies that I can't speak to who yet but we're exploring these options. The first is potentially connecting the data through a knowledge graph and this has a secondary effect which I'll cover in a moment. And this knowledge graph is even though we organized the hub in a way that we felt made intuitive sense to us and with the help of the some of the researchers from folding at home made intuitive sense to other people in the pipeline besides those of us who happen to have a more narrow training scope. So for instance, me, I come from the background of biomolecular simulation but I'm not an epidemiologist and I don't necessarily handle things at that level of development. The other thing we're looking at is potential data licensing schemes. So as many people probably have seen is when you put data out there in the world you may or may not have thought to license that data. Sometimes the data just exists as a hyperlink as a URL to be downloaded. But the questions we should be asking in this is not necessarily out of caution and one of security and sort of job security in this context but we want to make sure that the data we put out there is appropriately shared with whoever wishes to pick it up and it's very clear how you are allowed to share it. And in a lot of cases it's not clear but as we're getting more and more data into this hub especially data that hasn't been published we feel it's important to sort of explore this option. So it's it is something we are exploring and hopefully some fruition will come of that in the future. One thing I did want to speak about here in the hub is what consequences we saw some unexpected social insights that have developed over the past few months. Things that we didn't necessarily expect people to either agree to or not agree to or but just the way people responded to certain things has been an interesting experiment that we didn't expect we would be conducting. The first is that these simple descriptors we provided these YAML schema have received a very positive response from the people who have contributed and access the data in the hub and they're very effective at getting people close enough to the exact thing they want that they can find that they can then find the specific details they need once they've navigated the simple side of things because of their own expertise. The second one is that because they're simple in their nature the simple key value pair schema it's easy to expand them and iterate on them and so long as the change isn't too drastic most people were like yeah this makes perfect sense or I understood this quick enough because you made such small changes along the way. And so I think as far as organizing large sets of data especially very large heterogeneous data across many domains this approach seems to go very well and could be used in the future. The other thing we sort of saw out of this is that one of the original goals when we created the hub was to have people contribute data as quickly as possible and people were very enthusiastic about contributing data. What we saw happen is people were reluctant to contribute their data not because they felt that they didn't want it out there or because they felt it would be harmful or somebody would pick it up and effectively use their data ahead of them. It was they wanted to make sure they got it right before they uploaded it. Despite the early enthusiasm they still wanted to carry out the due diligence to make sure their contribution to the scientific community was as impactful as it could and as correct as it could be when they did put it out there. And I say reluctant with an asterisk because there were exceptions to this but it's an interesting social experiment if you will and sort of in heartening to see that the people in our field are trying to do the do the correct due diligence on a whole to make sure that the data we provide are good. And that's a very encouraging thing but it also provides some interesting insight to how we would do things differently going forward. And the final point I wanted to make on the social things is that people really really enjoyed having a central access central place to have the data. They did not care where the data were stored. They didn't care what medium the data were stored within so long as they could access it from a central common point. And more importantly can we quantify those accesses whether serving a sort of a middle layer like the for instance if you click on the hub would the URL add to some counter we could keep track of? And I believe that's a very valid point at the moment the hub does not do that. And with the technology that the hub is built on I would I do not know of a good solution generally to that especially with the heterogeneous storage options. One advantage to having the AWS option and sort of this centralized place where we can accept data is we can control that and possibly meter that if you will. Because I imagine for many people this was very be very helpful and for purposes of additional funding going forward and also saying what research is being actually looked at out of this group and what should we continue to focus our efforts on. And so that brings us to the final question I wanted to ask about the hub here is that this particular virus this pandemic it's not going to be the last pandemic what happens when the next virus comes along and it's the next generation and it's angry is how do we how do we prepare for that and it doesn't actually have to be a virus or a pandemic it could be whatever the next big thing in research that brings the community together and says we all want to tackle this one problem at the same time. And so the lessons that we have from the hub are a few that I actually wanted to write down and convey and sort of challenge the people here on the call of do you think these are good ideas and could we make improvements on these and how would you consider implementing these for your own research and working with so many other people inside the field. And so the first one is that we should encourage very easy contribution it should be simple to say I have my data here is enough information to link it to the broader topics at hand and then point to it with all the additional details and sort of either metadata or more detailed descriptions that people can happen to find. And so with that we want to keep the metadata that people are contributing to say here are here's the work I've done here is the research I've done simple want to keep it simple we want to keep it very with very small iterations on versions of that but do iterate is the important thing don't let things stagnate and as part of that be receptive to those iterations as the person either maintaining the thing or a person using in case you know somebody says I wish to add some things to this but obviously keep it simple but no simpler. The next one an interesting thing that came up is instead of highlighting like the most the top rated or the recent contributions we feel it would have actually been better to highlight and promote the gaps in the data because many many many researchers were conducting work on the spike protein and there was far less contributions to any to many of the other proteins especially relatively speaking yes there were plenty of research into the main protease or the pap and like protease as well and there was plenty of research into the host ACE 2 binder but there wasn't too many things researching into say the open breeding frames or some of the other particular proteins of the virus and if any of those can be disrupted you likely going depending on what it is you can further you can disrupt the viral infection rate to what extent we don't know but we feel that if we highlighted the gaps instead of the popular things it might help inform people what if I want to make these contributions and use my expertise where are we missing information to further broaden the scope and the perspective of everything that's being done so far. The other thing that I would be considering doing is networking the data so relating all the data to each other like I showed on the hub but then more importantly connect the people carrying out that research and this is where things like the knowledge graph attack that we are in discussions with and seeing how to employ that might come in handy because it's one thing to say these these data relate to each other through some means whether it's because you know this simulation uses this model as its input but then a matter of okay I am a modeler and people are doing simulation work with my models can I talk to them to say what was bad what was good what can I improve upon and then that communication line goes both ways and so I think that if we can track who is contributing to what and then connect them to the people who are contributing to their related data then I think we could actually speed up and enhance the entire research process and make things better overall. The other thing I would propose in this is that we would replace this concept of having these review teams people to review the incoming data with outreach teams people who are fielding the data and trying to bring in more people to both the hub and to gather information because of the complete overload of information that we're seeing come out of COVID-19 research it would be good to have make sure we get as many people as we can all communicating through the same system or the hub with their data so that we can actually make sure we're not duplicating effort and again find the gaps in what we're doing and because people have been have been doing their due diligence and acting in good faith to provide well verified data and going through their peer review processes ahead of trying to publish their data or put their data out there we think that the review teams would be less beneficial because there's just not as much for them to do and it would be more helpful overall to instead have these outreach teams. A very easy one which easy in principle much harder to actually do is secure these large storage places and actually met provide metrics for the downloads this isn't necessarily something that we would want to have public as far as what those metrics are but basically allow researchers to say who is using my data either anonymously or and that might be too complicated or wind up with too many privacy concerns and so the better one is how frequently are my data being used and that's that is important information I feel even if it's not the most beneficial thing besides necessarily getting a dish further funding for your research but it is it's interesting information that is in high enough demand that we should definitely look into it the final item and this is one that would probably go missed by many is to make sure you bring in user experience expertise the hub was built by a few of us software scientists here at mulsey with very limited user experience and design expertise so visual appearance natural navigation layout experience and so a lot of sort of trial and error in that context and also some very limited experience in doing things like this before but that can always be improved and there are people who specialize in this outside of the scientific fields that would be if you're have expecting to have so many people access the hub and access the data within you want to make sure that it is easy to navigate as well and yes we came up with something that we feel is simple gives the just enough information and gets you to the data you need but there's always improvements that can be made on that and so with that I'd like to go ahead and conclude and thank all the people who've helped make this possible first are Sam Ellis and Andrew Abby Montsour who are other software scientists who have worked on the hub with me the list in the center is a sampling of the people and the companies and the researchers who have helped contribute either time resources data or consultation to the hub all these companies who have either spoken with us provided data provided resource agreed to provide different means of support in whatever way that they can the NSF for funding mulsey the eu horizon 2020 grants for funding bioxcel and everyone for listening to this talk and thank you for your time and I'm happy to take any questions if you need further instruction on how to ask the questions I have the slide here for you if you would like to use the go to meeting question field we can answer the questions thank you Levi thanks for the very nice presentation and and also the connection between the hub the data and the overall the connection with the with the communities and the impact on how we collaborate we have several questions here first we have one from Navin as you have mentioned data site and Zenodo for data sharing I know that Zenodo is free and we can host data and code for free but how about data site is it free and open source I do not know about data sites actual physical storage capabilities and what the the pricing or anything with that is we are using data site and we mostly specifically we are using data site as the authority by which we issue DOIs through and those are things that we can issue that they the data do not have to exist on data site we can literally just point to wherever the data are with sufficient metadata to issue the DOI and that DOI works pointing to anywhere free of charge you just have to request that we do the issue and then we can talk about like send us an email we can chat get the data we need and issue the DOI and get it back to you and that will be last for as long as DOIs and data site do you next we have a question from Muhammad so Muhammad is wondering he he's offering to help with reviewing and curating the data how can he do that right so the way to contribute with the data and for help of you is there is two ways you can either email us directly and Ross and if I believe you have my email yeah if you could send my email to the people who wanted after this talk I'm happy to talk with you afterwards um you can open an issue on the github that's probably one of the fastest ways to get in touch with not only me but everybody else who helps contribute and maintain the um hub in general or you can send a email to the generic to the um general mulsee account which is info at mulsee.org and we can share that out with everybody afterwards but ultimately you just have to get in touch with us and we will talk about what you can contribute and how and what we need and we'll work something out by the way leave your email address is also on the website where we publish the webinar so users can find it there yes and that is a direct email to me I'm happy to answer any questions or field any comments people might have um then we have a question from Suyash but that's more on the scientific side Suyash is wondering whether the MD data which is being hosted can be used to understand the binding modes of the spike in ACE2 and calculate the binding free energies yes it depends on the short answer is yes some of those simulations the some of the electrodynamic simulations which we have uploaded are both spike and binding modes however we as mulsee and bioxcel are not the primary we do not generate the data they are generated by the research groups who contribute their data and say these are the data we have for this so it would it'd be good to read through the descriptions I know there are a few simulations I believe there's a few who actually look at the the spike folding modes and the spike and the overall spike motions and in some cases you might be able to use that to do binding mode calculations or do some sort of docking depending on how many frames you feel like extracting out of that but I I do not know or remember off the top of my head if there are any explicit spike with a say drug like molecule binding to this in general and so you'd have to review the MD data that are there and assess if there's any simulations however there are many models that you might be able to use to say set up and run your own simulations for this we also have a couple of drug libraries which have been compiled by some researchers that we are sharing as well for these small drug like molecules that have a bunch of conformers inside of them and a bunch of tautomers for those as well so you might be able to you combine those different data together to come up with the simulations and hopefully answer the scientific questions you might have thank you Divay thank you very much for for the presentation we're looking forward to to see the hub expanding getting even more data making more useful for the communities and all of you who are on the call or watching the recording please get in touch with us if you would like to contribute data or help with review and curation and we look forward to your contributions thank you everyone and have a nice day thank you Ross and thank you everyone for attending