 First of all, existing tools. Everyone knows that available upon request generally isn't. That's going to go under concerns and difficulties, perhaps. OK, so we've got a lot of failed attempts at this that we should learn from. Biosympherite was an effort by Mark Sansom and John Essex to do a trajectory database. I remember him talking about it. How many people know of biosympherite? OK, yes. So there's a lot of lessons to be put under existing tools, because there's a lesson there. So there are tools, like the torrents that were referred to earlier by Eric Lindahl. GeneTorrent is used extensively in sharing sequence data now, because you can get parallel follow transfers. So we have analysis tools, lots of great ones. MDTRAGE and the analysis, CPPTRAGE, all of it, GROMAC stuff, CAT, TCD, I think that's another one. We have a generic data analysis tool, Panda, SciPy, and friends. We have lots of other visualization tools, like NGLV, which we've heard about, PyMolVMD. We have existing repositories. So people have used FigShare, GitHub, OSF, Zinodo, and now we've heard about GPCR and MD. And finally, there's some great kinematics toolkits that help us perceive what's in different things. So that's a problem I'll talk about in a moment. RE Kit, Open Eye Toolkit, Open Vable, and CDK, for example. All right, so those are the existing tools. Now, we didn't have anything on the left side, because we thought we weren't prioritizing things that weren't important, but that's up for debate, obviously. So we thought there's possibilities of building either a coordinated or a distributed trajectory repository, and that they have different difficulties or complexities, and I'm not sure if you'll agree. We think having a way to share trajectories is important, and we thought that coordinated would probably be more difficult than a distributed trajectory repository. But again, that's quite up for debate. Can you describe a little bit more of what's being passed on to us as distributed? So if you built, like, BioSendGrid, one giant place where everybody stores all of their trajectory data in one place, that would be a central coordinated trajectory repository. A distributed one would mean maybe everybody runs a server at their site and stores SM local storage that backs up chunks of other people's data as well, and could easily be networked together like a whole pirate bay. So a streaming trajectory library for a subset of trajectory data would also be of high utility, we thought. So a simple, likely Python thing where you could easily grab slices of trajectories from a remote site. Finding the trajectories you might want seems to be a really hard problem. So a trajectory search engine like Google, because you don't even know what you'd be searching for, exactly, we heard a little bit about from GPCRMD and some of the things that you could search for, but in general, the problem is quite difficult. If you are wanting to compute specific properties, this is also somewhat related to this in that you might want to identify trajectories that are suitable for computing things like order parameters or slow degrees of freedom. Visualization tools are, in general, pretty important, but not super difficult to put together. One very easy win that is very not difficult, but highly important would be, instead of sharing the trajectories, why don't we just share the initial conditions so that you can generate the trajectory yourself. So that would be something that we could easily persuade people to do in terms of journal editors, and would be sort of a prerequisite. Finally, if you're wanting to figure out what the trajectory you get from downloading to somewhere is actually containing, you have to know what biological components are inside it. And that's actually not super difficult, but pretty highly important if you want to actually make sense of it. Analysis tools, again, also somewhere around here, just in general, it would be really great if we had some sort of wonderful hybrid of MD tragedy and MD analysis. That was just one tool that also had a streaming data with the I. And maybe that was an intermediate complexity. And finally, as we get more trajectory data, it becomes more and more important not to have to shift the trajectories to you to do any data analysis. But it's quite complex to actually analyze the trajectory data of where it is in place by using computer to local to it. But that would probably be something that's important in the future. OK, anyone want to speak to us? Yeah, I'll go ahead. OK, so I think we have time to do things with the existing tools. So I'll just start from there. So we have university infrastructures, very much underutilized at the moment. We also have Zanodo. We have Fisher, Pithub, now MD server. And just we got this paper, which is primal, but I know it's fun. So there's a feature in primal, which you can say, all right, there's these five items in this configuration and find every PVD that has those five items in that configuration. This is the kind of querying that you'll probably want. Is it? No, it goes all the way. And now with that, with these existing repositories, I'll actually highlight one of the problems. I don't know if you've ever tried to search anything on Fisher or Zanodo. It's really super difficult to find. I actually tried to find my own data with all the keywords that I put inside. It only worked when I put my full name. So we often don't know what we're actually looking for, right? So unless you really know specifically what you're looking for and the DOI, the data set, it's a big data dump and you might never find something useful, even though it exists there. So I think searchability is a really big problem. It's important and I'm not sure how difficult it is. It's not impossible, but it's not that easy as well. This searchability is directly linked to metadata, which is how we find things. So we have to have a really good description of data so we can find something. So that's also important and not that difficult, but again we need to agree on data models, data descriptions, semantics, and all the other things that we want to, how do we want to describe our data. That brings us to automation because we don't want to describe every file and spend an hour trying to characterize what's in that file. And I would put it here because it's important, but we can also kind of work around it at the beginning. It's not impossible, it could be a little bit just boring and take some time to implement. This also brings us to automation. What is really important is also ease of use if we want to achieve mass adoption. Because it's really hard to use, if you have to spend hours to share your data, nobody will use it because at the moment you don't really get rewarded for sharing your data. So it's a huge time sink. You don't get rewarded, nobody will share it. So ease of use, I don't really know how to put it because that's a complicated thing. I think it's important and maybe a little bit difficult because it includes probably the UX design which currents and psychics also balance the portrait. And that also I think would be, okay, so the problems that we have, I guess, is distribution of efforts. So Yana was telling about GPCR and the, I also tried to build data sharing platform. Biosync Grid was also an attempt to build data sharing platform in the past. There were actually many more, but they all failed at some point because I think the problem is that, again, this is the usual approach how it works. We say, oh, wouldn't it be wonderful so if we could share data? And we said, yes, let's write a grant. So we get a grant. So I also got a grant to develop a tool which you can share data. Then you write a paper, which I have done. And that's it. And that's what at the end of the story. You have some piece of software and then the piece of software gets forgotten and becomes abandoned where and then in a few years time you repeat the cycle someone else does it in some other country and it all fails. So this distribution of efforts where everyone is trying to do the same thing and everyone thinks that they're doing better than the other person who did it before. It's a huge waste of time and money but how to coordinate people, it's really hard. Technology, though, it's not that hard. Technology is actually very important, but it's easy because it's there. Whether we want centralized, decentralized, everything is out there. It's about putting it together but also putting it together in a way that we will maintain it and actually use it. So all the components are there. So we can choose. We can all back on one centralized repository or we can have centralized peer-to-peer. I know for a fact that technology exists. Personally, I'm for the centralized solution because there's something that people often forget about sharing data and this data ownership and I'll put the license in here. So that's also a very important thing because universities actually own your data, not really you and you shouldn't put it in Google Drive or Dropbox or anywhere else which is outside of legal jurisdiction of your country and your university. We all break these things because at least they are not really strictly enforced but maybe in the future and especially now that we finally start to understand the value of the data and how much money some companies make just by having data. I think that will probably be a little bit more police. So I think that's why. By having distributed solution, yeah? Is it all the way down there? No, I don't know where to put it actually because it's important. Difficulty, I don't know, they exist. It is important but maybe. On the other side, all the funding agencies are pushing big time for open data. Yeah, I don't know where to put it. Maybe it's concerned, just like out of this box. Let's put it somewhere in the, like here, like it's somewhere in the middle. So it's important but it's often neglected because you don't know by your past lawyers, they are not really liked and so on and so forth. But with distributed system essentially what we have like every university would have their own repository. So the data is stored locally but what we need is an interface that will connect them all. And that's again, that's very possible, not very difficult. And so maybe if you kind of switch universities you don't have to like learn how to use someone else's repository and like learn how was the hierarchy and how you should log in. You should have like one interface and when you moved to the next university your data is easy to transfer, at least the interface is the same and you don't have to spend time to learn, tool to do exactly the same function. Which again brings us to interoperability, which is the favorite word. Now that is hard because it includes standards and we talked about it in the morning and we can't agree on anything. And the thing is with interoperability we don't have to aim for a particular one standard. We can have many file formats or whatever we want as long as they are able to talk to each other or being interconverted or whatever we need to do it's about information exchange and making it efficient, but it's hard. So again, going back to distributed things, again Torrent's excellent idea. If you don't know there are also some really interesting tools called IPFS which is interplanetary file system and that project which also supports versioning and Torrent like functions so you do get a high bandwidth of more people sharing there. They are for the most popular files. So that's another, I would say, this is not really difficult, but it could be important. I'm saying like a pro thing for distributed solutions and I would then provide the highest part of it all to build tools and infrastructure are people. And I really don't know how to get around it, how to get supported for all the people to support maybe not everyone trying to do their own thing but actually maybe pick a few things and then actually see them through rather than starting 100 things and then seeing them all fail at one after another. And how to achieve that, how to get funding and how to make things sustainable and maintainable it's a really big problem and I think we should really approach more about it than just trying to say we should do this thing because enthusiasm is great but then this thing needs to be a little bit longer as well. Okay, I'm done. What is the number of issues? Oh yeah, I forgot, sustainability, hosting calls and data storage providers. So we might have to re-normalize our chart at this rate because we're rolling at one quadrant and we'll be probably back. Any takers for next up? Do you want? All right, here you go. Who wants to do it? I think we'll stick to it. What was the issue then? I don't know, the competition is pretty good. I personally think it's important. You should have to be... I'm really fired. Okay, I'm tired of the... Okay, now let's figure it out here. I have first the existing goals. First we listed the GPCR MD which was just mentioned before. And MD is served. I've had those very interesting tools. It was the same thing. And then one thing which is quite recent is the Google data search. It's a few months I saw it. But that one actually explains our data from XANA to quite well. That's something we should keep an eye on. And then, of course, we have the animal-liquids database, which... I've heard about it, but it's basically a database which is indexing data, which is almost all in XANA. So the raw data is in XANA, but animal-liquids database is indexing it. So it's basically SQL database and it has based the indexing of tractors of lipid bilayers. There are roughly 300 tractors there. That's ongoing. So this, I would call it an existing goal. It's in animal-liquids.fi. It's the average if you want to go there. So that's... I think that's trying to tackle on some of the searchability issues of XANA and it's also an open collaboration which gives credit for the people who contribute. So it's really trying to solve that problem. So I will say tomorrow something more about it. And then, related to these existing tools, we discussed about... it's written in a method archive and finability, which we've put extremely important. Which means that, like, now we're starting to have the advantage. We have animal-liquids, we have GPC, our MD. But how do we... Like, they're searchable. They're both searchable already. They are accessible, but they are not findable in the way that people don't realize that they are there. And they are not centralized. They're only in a single place. But one way to solve this would be the method archive, which we just linked to the data. So somehow combined these. So that was one thing which we thought should be done. This is related to aggregation and indexing. So to make it actually searchable across a different field, we have to understand how we index them. Which kind of keywords we're going to use. Are there going to be molecule names? Are there going to be something else? And which kind of names we should use. And for thought that we should just start to do this, we should make a prototype of this metadata. Database, which would, for example, combine animal-liquids and GPC, our MD. And then we would learn how that would work. Red and Danzek, or as already. So that's important. This problem is file size compression. I think it's, yeah, regardless of the access and file compression. Right. I think it's a way of accessing the database. That would be convenient. And then we have automatic feature extraction. So in animal-liquids we have this already a little bit. I think it's quite important, but it's not that difficult actually because we have the MD trial that's kind of full. So it's rather easy to write down all these calls which analyze any kind of data. And I think GPC or MD have that as well. Did you have something? Yeah, I think we insisted on our table on the fact that we already have places where we can put the file. We already have databases that cost files or list files. The problem is to find these databases, query these databases and these files. We have a bunch of trajectory of the nodal, but a very specific database did annotate them. How do we query this database and expand the search to other databases that exist already? And also we insisted on how easy it should be to push the data there. And I think it's nowhere. And I think it's probably started with the automatic feature extraction. A lot of things are in the files already. They could, we should not have to fill 20 pages of form if it's already in the TPR file. And also on how interactive it should be. We don't know yet what we need as a keyword, but as we search and as we get used to use that search outlet, we get to know what we need to search for and how to add to this automatic invitation or how to manually annotate the thing better. On the other side, you spent three months running the simulation, spending one hour or feeding a few foes. It's a small overhead thing. When people deposit a BDB, they are forced to do it. Of course, if you can't find CDB, it's great. It's on the other side. So we have generic automation right in the middle. That's more like workflow automation or? I guess it depends. It can be metadata automation. So once you start a simulation, metadata is automatically extracted. Maybe it's not a project description as well because we don't want to necessarily write a project description for all the simulation that's going on there. Maybe workflows, I think it's a rather general experience right now. Everything else in early automation. Yeah, I'm just trying to organize this so that we have less categories. That's why we're going to go with, I guess it's going to be easy and can be... So this is more like metadata automation? Yeah, I think that would be the most first step and easy to solve that workflow. We also have searchability metadata and that archive and findability. I feel like that's all kind of the same thing. Which direction should we go in for this? I think, within ours, there are different things. Findability, searchability, there are different things. Because I have a database which is very well searchable. But Eric's opinion was that it's not findable. Because he doesn't know about it. Once you're there, you can find the trajectory. I just want to say, be sure, for example, it's searchable, you can query it, but you can't really find useful information. So perhaps we just take away searchability, this two-genarity term, and then metadata would go underneath this? Well, together, probably with the automated instruction of the data that you're going to be able to... Yeah, so... Well, the fact that it's automatic is important that the fact that it's there. But the automaticization is extremely important when it starts to expand. Like, we are expanding to the point where it takes way too much time with the automaticization. So I think if you really want to have a good database, it has to be automatic. How many submissions per day do you expect? How many? That's why it comes. If you're up to spend a lot of time, you provide an interface which is automated, but then you have the backlog. But it's also related to the update because if it's going to take me an hour to figure out how to do this thing and no one is forcing me to do it, I'm not going to. If that happens repeatedly, once the journal decides that you need to have an extrovert, repeat the idea before you can publish and people were forced to spend hours visiting and they did it again. And now it's being streamlined in some software where you prepare as much as possible. We have roughly one critical calculation. We have some issue of one trajectory per every week, average during the first, let's say, one. So it's like, if you have to use my powers for that, it's quite a lot of work, but that's all. Again, I'm a slightly devil's advocate, so the talk of forcing scientists to spend an hour when we were all very, very busy to worries me slightly, but obviously I was involved in the biasing lift project. And obviously this was only 10 years ago, but every single trajectory that would have been stored in biasing lift would now be discarded as a collaboration. So I would... One of the things... In the key, simple, stupid kind of way of working with this, I'd actually say just targeting the ability to put the input files in because what was really difficult to generate today is going to be a collaboration in five years' time. So it's the input files. What was run is more important to say than the trajectories and all of those kinds of things. Yeah, so we have a lot of requests but I want to try to request it. You might have to help me find stuff. So we had all these md-surface share, gpcrmd, it's a no-do, I guess everybody's aware that they exist at this point. But there were a few initiatives that were, I don't know how, if they were all together abandoned, I guess it's related to the fact that there have been grants at some point when they stopped. So there's this thing called dynamic... Dynamics? Dynamics? The one from Valerie Daggett or is this a different one? Yeah, but that's not... They were just like a massive database that you can search and use and reanalyze what you want to upload. So who puts their data there? Valerie Daggett's group. So should I just not put it under existing rules? It doesn't fit there? I mean, it's still sharing. Yeah, it's up to what they share. Yeah. Sharing your own way? Along the same lines, I think there's this molecular dynamics extended library model that's from my son. So nobody has mentioned that. There's a... Elsevier has mental data together with a journal. And that provides you a do it, but it's not... Anything, so an HTML? I don't know who wrote this. It's not too less. Okay, so under this. So these were the existing tools. In terms of features, I think a lot of everything has already been said, but we might have placed them in different places. So we had searchability, which you guys said is too broad. I think we meant... I'm interested in... I'm interested in this specific ligand, this specific protein, it's been built with that specific technique and that sort of thing. So that was this, no? Oh, this one actually. Like the metadata searchability. Exactly. Intuitivity, which I think is... We want... Ease of using, what is that? Is that right? Sustainability. Sustainability. This I think we heard also. We've put in concerns... Oh yeah, right. But it's an important... We put it under... Very important. It's having somebody who's going to be continuing this and developing... So it's not really under implementation difficulty? Quality control is something I was mentioning also by you. I think this hasn't been mentioned. I think that goes with contributing data in some ways. But it's really about how we trust that it's good quality that we can use and we don't accept any... Yeah. So it's a thing which makes you killed by a similar... It's really hard to do quality control. Yeah. That's very hard. It should be right at the top. But I can't even find quality. That was what we spent a month feeling about. It's a kind of simulation of being in collaboration periods. Yeah. I'm publishing it. Yeah. Or you can ask reviewers to... I mean, you're happy to do it? I can think that someone is looking quality data and you can say no, this is crap. We can... I'd say this is... But that's not... It's... Yeah. But if they came down to... We were trying to work on what was the at what point any action, 2.5 Actions, was a value for how good a... The crystal structure was. I can publish a trajectory which would be better. Should I be trusting that as much as something has been run with all the bells and whistles turned on and with all of that? And the quality work, it's just published is to say somebody else thought it was okay. There's a whole scheme of quality to quantify all the really high quality simulations of these GPCRs that I can trust. But that's impossible to answer. Exactly, that's why... Actually, you know, you have the PDV and all the evolution and then you can use all pair now. Because there are other pair resolutions you can check how much do they divide actually in the simulations. You should only accept trajectory data replicates. That's what I'm just saying. It's only extremely difficult thing to do. This is also what standardization is like protocols. So if you follow this protocol to some extent you can trust it itself. I think you don't want to have people who are in check-in yet. That becomes the PDV where you have annotators with nothing else than checking entries. But maybe there's things we can agree on because they have quality measures in the PDV of structures. So maybe we can say, well this, you know, it's a whatever thermostat. I think you should trust this. We had analysis tools. We divided this between advanced and basic because we thought that basic analysis tools are important but not difficult. And advanced analysis tools are not so important and difficult. And analysis tool is here, so it's nice because we... Consistent metadata. I'm not sure what that meant anymore. We've talked a lot about data. Is it automatic there? Visualization was in the middle exactly here, so I didn't have a question. I don't think we thought analysis and visualization was the same though. So... We also thought about identified the data which I think goes to the same. We thought it was important and not that difficult, but it's probably something different then. Why is this... I mean like CCVY, who can use and reuse your data. So just identifying particular data set. But also all the shit, right? Who has done this? Getting credit stories. That's for citations but not for credit. It's just for citations. Wow. We had interoperability which is already there. And concerns costs how decentralized should it be. We talked about it already. Something that hasn't been mentioned that I was mentioning in GPCRDB is the types of techniques called everything. This is actually really kind of crucial and we haven't talked much about it. So we're moving away or moving away from classical and G maybe more and more using advanced teams. I think that should be very important. How decentralized? Who said it's not a problem? You put it where? Technology, no? Yeah, we're a little bit higher. It's not that easy. But it's not useful. Setting up IPFS ritual is not... This was about our user base. So depending on who you're targeting actually all of this landscape potentially shifts and that was difficult for us to determine what was important, not important. So do you want this to be useful for... Yeah, so do you want it to be for the... for me sharing with my colleague or are you trying to bring a student and sharing that it's really cool or is that sort of a question? Am I done? What... That was... Yeah, what kind of data would be uploaded? So how many replicates? Just one representative? All the simulations you made? Do you strip it like only certain frames of it? And that was under-concerned? Exactly, because we would have to discuss this to... We don't have a consensus. And cost? It was mentioned that we shouldn't worry about having a consensus. I mean, yeah, it's certainly a concern of kids. So we have quite a few tools with a significant rate. We have a very complicated diagram here. What I find really interesting though is like, you know, kind of this is sort of the square that you usually focus on at the very beginning with these kinds of things. And realistically, our basic analysis tool was both the simplest and the most important. A question though, do we already have basic analysis tools? So we can check this out. That's great. We have them. We want at least one of them to keep being supported. Rather with the analysis tool, it's linking together the... So the analysis tool needs to be able to communicate to the trajectory file, the trajectory server, what data it needs. So it's actually an analysis tool. We're going to have to do an RMSD of a protein backbone. And at the moment, the analysis tool is taking the entire trajectory across to the tool. It's then going through all the atoms, throwing them away, and then getting the protein into the analysis. We could couple our basic analysis tools to the trajectory file formats, which can actually have things extracted from them. So subsets of trajectories gain easily. Then actually, I think we have a very easy quick win, which is a quick analysis tool. We can then all agree. We'll just put all of our analyses into this one tool. And that seems to be the end of the analysis, not any trap. But that's fine. Do I identify this? Relatively done. If you do something like this, you know, this is more of a question than anything else. But how centralized is this going to be like, yeah, space kind of thing, or like a central repository? Getting a central repository of satellites is pretty easy. Getting a central repository out of bytes, out of actual bytes gets much, much harder. So, you know, there's strong argument to distribute cases. I'm struggling for that sort of idea that we each have a close-down server that we can then use for whatever means. We can either share when publication comes out. And then it's your responsibility. That's your data. And if it's good or bad, it falls on you rather than when anybody asks about sharing, than we're in the last one. It also means that you can share it with people you cooperate with as well. That also means like you need to have one server where you could like, search for all the different files. So, you need to link them, I think, somewhere somewhere. If you're searching like... Yeah, and on that, metadata servers, as we commented here, like, you know, that's terrifying to metadata, which is a lot more than a lot more than a whole. Sometimes you have to think about this particular audience compared to the generic audience. Can you imagine having everyone spinning up their own server, post their own data? Unless this is mandated by funding agencies. I mean, I would just like to add that every university has their repository. Every library has a repository that just nobody uses them. Well, you're not going to take your gigabytes of data. It's changing now. We are trying to change their role now as well, but it could be... I mean, I think it could be possible to have like, servers that are here by the university, so you can use that as a solution. Or alternatively, I think you can just go and buy whatever you want to build your data with. After you, but you should have the choice, I think. I think the university libraries are always lagging behind what we are doing. I would disagree. To be honest, who actually communicates with librarians on a daily basis? Two hands. I give them input when they are designing what they want to do for open data. That's kind of the idea of where they're working. There are open air and things like that. There are many library designs, open air projects for data sharing. They are not ready to share this kind of data and this amount of data. When we run... We have the research data storage facility in Datestock for us. That's where you can dump data for the university, for researchers. They can all do this. It can be made publicly visible because it's a data bucket. We cannot run servers which go through the university firewall which has custom code doing complex searching. So the only way we can have our own server is effectively just doing a data dump. And I think what we are discussing here is that what we want is to be beyond here where lots of target is at and more towards actually this is a searchable thing with a sort of searching interface. So my concern with everyone who spins up a server is crystal. We definitely wouldn't do that. The only way we allow that to run is it was running in the cloud. We wouldn't let it run in university hardware because it's a punch through the university firewall. And if you run it on the cloud, your data transfer costs are exorbitant exactly doesn't solve the problem. What if there was a single robust agreed upon as secure well penetration tested piece of software that we could run that ran under very specific circumstances that most of our security teams would be happy with? Then it would cost you a ridiculous amount of money to engineer that because basically I do secure software. I would guess a few people in the room actually don't have to write secure software. And it just is. It's really hard and naturally doing security analysis and then persuading every single university that that's the software secure enough to actually be on the firewall edge. That's a huge undertaking. Should this morning one of the likelihoods where that we probably should not care about what bytes are written on this isn't a problem here that we should not care about where the file is hosted. The structure that can deal with regardless of where the host is. So you're saying we should encode them all as YouTube videos and then spread them somewhere else? The question is if we have a metadata server it does not care where where the trajectory is on. You have a URL or a DOI to the trajectory and here it is and the hosting is much smaller in any capacity. And if it's agnostic enough of where the files are it can be at the library of your university it can be on the nodal it can be on the cloud. It does not matter much. I agree except that all your metadata can do is basically have to either encode enough information to find it, but that means you can't actually go into the trajectory because otherwise you'd have to transfer the trajectory from the university system or the nodal to the metadata server to do the analysis to go and find the things the users asked for or you've encoded enough information that you don't need the original file but by definition information has the same density it's the same size so once you've got enough information you can do rigor researching in the file you've actually basically transferred the file to the metadata server. And if you have a pure metadata server over five years about half your links will be broken. So it's possible to have again a service which you as a university need to pay for to have this service being spun out for multiple, I mean for the entire university it doesn't sit in the university it's also kind of escape this entire thing this one for special purpose right it can be maintained by the university I don't think it's that hard if you have your IT people taking care of it rather than that's definitely 100% would not get through IT because once you're not having an IT person look after it, that's expensive so we do our best to get rid of services that people do buying a service that would actually describe it as a centralized service and so it's much easier to just say okay let's set up a cloud service and everyone wants to deposit their data pays 150 pounds to deposit the data that is so much easier to set up from an organizational point of view and have a cloud service that every university could pay for for their own researchers to set for their own data rate and that they don't have to manage themselves In the NMR limits the databases now in such that we have basically it's a GitHub repository which has indexed links to the broad air and when you want to use one of the data sets you have a script which downloads the data on the computer you're using so the actual database is very light and I don't think it even needs any server, it's just a list of links and scripts and then it really doesn't matter where it gets it but of course we need some places where to put it and now we have ten of them but that doesn't matter if it's ten of them or whatever and I don't think that kind of indexing database needs like specific infrastructure Sounds like a nice description but it sounds like something for the future thinking about this, setting this up and actually I guess almost everybody here has their own website somehow and putting there the links of the simulation putting various simulations somehow connected to some database that's not a big issue, that's not a big problem and then you can do now and then you come for now back to having your own database every group or something at least instead of like waiting for the next ten years to somebody to set up a global thing and this is like having little places where we can kind of nucleate this and once we have a large enough nucleation then you can sort of time these together and if we take it well there's something that nobody uses anymore but that's the sitemap it's like on my server I have these files at this place and they do that and they will present that and they had I know where these simulations are and it's a simulation run with this it has these molecules and it transforms that long then as a user you find that then you have to filter after it but you do your own filtering the filtering gets decentralized and this is something we could do rather quickly it would be having one file each and not a huge infrastructure is it easy for everybody to host files that are just you put in URL and you get a tarpaul back because some of these library listing things you have to register and then click through and then click a download link and there's no uniform interface that's the problem because at university again how they have money for that they build their own repository which they do last year I was at this e-research conference in Australia and there was this really really sad presentation from a Monash university they had this repository and then they had these statistics about users and it went from in percentages it went to 25% but it was like 4 users and I think 3 retired and then and everyone uses drop posts which is completely not in line with the university's policies but the problem is that nobody really communicates with the libraries they build these tools for researches and then they don't really use and then this again this is the story that repeats over like all the fields all the tools where people kind of think they're solving the problem but they don't so to counter the Bristol one it has about 3,000 users we charge the users £750 per terabyte which gives 25-year storage and actually we set up the economics today it's been running for 10 years so the first university in the UK to have a data store and we worked with the funding councils to make it a requirement to have data stores and all the researchers use it because it is the cheapest way to get a DIY and to have long-term high multi-terabyte storage data we host things like the alfalfa children that I just studied all of the data goes on that it's a massive infrastructure to run it when you get to that it's like a 5-6 petabyte data store now and you can create shared projects we have high specifications so you can use secure projects or the NHS data these things are built universities do use them but they are so difficult to build and have to be so secure that you can't put a custom server on top of them it's just not possible so it becomes a data bucket and you have to go through complicated links to get links to the actual data because of all of this additional infrastructure you have to build to make it work not just for easy data like this but human data and tissue data and all of the other data so maybe there would be a market for different types of servers with really high privacy location like patients data but this is the cloud you are just describing the cloud and the cloud you can pay for data stores but I mean it doesn't really matter like whether it's cloud or not it's how you connect it and it's the cost so if you're doing terabytes it's relatively easy but we're beginning now to talk about data it's not difficult it's when you get to having hundreds of terabytes or petabytes at that level a petabyte of data is approximately in the cloud 1,900 pounds per month that's a lot it's pulling it off and pulling it off is quite expensive but I think the storage is not something we're going to solve so that's why each institution has to solve to have this solution and then we're not going to solve that one we're not going to start running an empty database this is why I come back to where is the value in the data the value is in the analysis you are now so we need to make it much easier for analysis tools to be pointed to a trajectory file and actually getting analysis out of it and that's directly if you can strip data out of it live so I can just get the protein I can just get the water so I can do get that but it's also we need to say actually the trajectory data we generate today we don't need to keep it for 10 years because in 5 years time it's easy to regenerate which again changes the cost model you're working with and the most important thing to say for the trajectory are the input files as John said how did you generate it so we can regenerate it except in the fact that all trajectories are completely stochastic so we don't even need to perfectly regenerate it because every time we run we'll get a different trajectory having data stored somewhere and an analysis tool is not a problem it may be an annoyance but it's not a problem recently there was an MD analysis workshop and to share the example files with the people in the workshop started a very simple project that creates a MD analysis universe of the data structure of MD analysis file hosted on the loader so at some point it's just having a URL the difficulty is not, the difficulty is how do you know where the file is so that's why I'm going back to IPF that's why it's also another solution because they are now working on content based which, I mean I think it's still very junk technology so I think it's not a problem that's still solving all the problems but in that way you're probably also solving but maybe reducing the problem with the missing links because if you have the hash in the data you should be able to find the files where it is so you don't have to know where it is you just have to know both pieces of data in the data there's usually a directory because the data is taken down based on the guarantee that it can be up for 10 years and for example along the guarantee for 5 years and we've been talking about this for 5 years so since I've been up for the first time there's been a variable that's been working with that time having a link is not enough because if you're doing a workshop and I have a link to for 20 students to download 20, 20 megabyte files that's totally fine to share that but if my trajectory is actually 5 terabytes each then a link is really awful and I shouldn't be downloading 5, 20 times 5 terabytes so for once no, what we should do is move the computing analysis to the data and say you are going to move towards trying to build services which enable you to basically host MD data, it's so critical that the analysis tool will actually sit with the data because you want the students to move the analysis to the data not to move terabytes of data back to the students we have 6 millisecond data sets for individual proteins so we can't it took a good chunk of our cluster to compute on that we can't offer that to everyone so the cloud is the only place you can do that kind of documentation this is not easy, this is again building an infrastructure like the EBI and things like that and that's not going to happen here and that's not going to happen because major investment because now we are speaking about trajectories that are going to be in all kinds of different university places and repositories there will be no computer associated with those locations moving the computer to the data for the time being or doing the analysis when the data is not realistic now unless there is a big project that's going to aggregate everything which is close to some computer center but that's not going to happen on any short term I think or the EBI or one of these big things say yes we are going towards data then we can figure on it but now it's a feasible cloud but maybe as a more radical short term submission I mean we would like to have at the beginning of this day an ontology about how we did this simulation and we wanted to do that simulation and we wanted to make that searchable and we wanted to make at least provided by that I can know that Chris did this simulation on this and that protein so I can send you an email or you send me what you have done that might be much more feasible to step forward but this would be quite what we discussed this morning about ontology how did you do this simulation of course so on that we have five minutes left in this session I was thinking we could move it to two other pyrolysis if they're a bit more tractable so the first one is a topology that has both biological and chemical awareness John was that you if you again want to know what's in the system there's like PUD does it can tell you about what a small molecule might be or what a covalent add up might be unless there's also something in the chemical component dictionary but these are just like a random collection of things if you just look at the PUD file or whatever is in the trajectory you may not know that there is an aerobatic ring here or that this is corresponding to a particular ibuprofen molecule or something like that you can't just match it by the elements and the connectivity these might be a different topic so having something that says the biopolymer is isoform 1a of this protein residues 27 through 429 these are buffer molecules which can be used in the smile strings and the map things of atoms we've been trying to come up with a standard like this for the open force field project because you need that information to be able to apply force field parameters especially if you want to generally match anything that says in the small molecule universe you can at least say here's what I simulated and here's what the atom indices are of the things that are chemical chemically distinct does anyone know of something that hits the bill already one of those chemical ontology developed by Peter Murray Rust teacher Murray Rust yeah it's CML but it's like chemical yeah it's not going to cover that so CML is actually dead that was pulled up by C. Jason third evolution actually so it could be it's not a previous work QC scheme is great for quantum chemistry but it's not entirely provisional but at least kind of like ontology based actually CML is never ontology based in the low waves if you want like a true ontology then you need to talk to chemical semantics in Florida I do not remember the guy's name but let me do one search but there are some standards that people use like smiles, there's a specification for it you can tag the atom numbers and that becomes a smart string that you can then rematch and recover the association of which atom indices are which in your file so we're using those things at the moment it's not depending on the flavor of the software if you're going from the molecule to the smiles then that's not unique we're working on a canonical way to do that as well with Daniel but the other way around where you have a description of what's in there and then you have the tag atoms that is a uniquely unique thing at least does it really give you three different smiles for each chemical of the atom? yes because those are not entirely canonical so the other one actually the other thing that we haven't talked about that's within this highly important and not as difficult is automated feature extraction actually seems like also a goal so automated feature and metadata extraction I'm going to ask you at least something to begin with and whether it can be a better problem can we just use i volumes? yes no sir, any last minute questions I think we'll have a 15 minute break and after the 15 minute break we'll come back and do the next session to produce a streamlining molecular system alright, thank you everyone