 Okay, so thank you for coming back for the second session of the day. And we're going to talk about tools for directory file sharing. So based off this morning's, basically, events, plus another discussion, I actually rewrote all of my slides in fast-forward events. So if they are slightly rough, please bear with me. But I hope the change in slides and ideas will ultimately be worthwhile. So, of course, if you have not heard of Mulsey, I'm going to just take 30 seconds to talk about this real quick. So we are a field that's designed to sort of enhance the competition of the sciences community, so very closely related to YSL and their strategies in many ways. Probably one of the big differences is that we're not just one of your dynamics. We are quantum mechanics, we are quantum informatics, course-grain, et cetera. So the U-range that we actually cover is really quite important compared to other issues. And I think you're going to see a lot more of these coming out of the U.S. So there is this software structure for sustainable innovation grants, which have individual PI, multi-PI, and institute-level grants within them. In fact, we are one of the first out of two institute-level grants. The other one is called the Science Gateways, which is more like how do you connect a bench chemist or education to something like in the trajectory sort of building this. So they're much more on the community point. But then there's ones starting up, like there's SSI in the UK, MLI starting up, there's a physics one. So I think you'll see a lot more of these in future. We're about two and a half years old at this point, so in some ways we're kind of giving up there at age. We're also a huge collaborative effort by eight different universities. We are essentially located in Leicester, Virginia, but our board is made up from the University of the Universities. There's about 12 of us in total, actually at a central location. If you're in the United States, we also have about 24 software fellowships for undergraduate and graduate students to look into as well. So happy to talk about any of this much more. I won't take any more at this particular time. And so one thing I want to connect back to was back to this whole file ideas and trying to eliminate a little bit more of what we're talking about with Bayon models. So of course one of the things that we did was quantum chemistry schema. So how can we do while sharing quantum chemistry? However in comparison indeed quantum chemistry is incredibly brutally easy. Here is an entire different file which is exactly reproducible with any given program and will produce you effectively the same thing. So it is about as simple as it gets. However the ideas within here I hope we can eliminate them. So one thing I really don't have the hard one when we're talking about is that it's not only JSON. This is presented as JSON, but for example like within here I need maybe a 50 megabyte array. 50 megabytes array in JSON is basically the worst thing you can ever do in your life. So inside what you can do is you can dump this htf5. And if I had to find the dictionary like this kind of data, an umpy array I can just say to htf5 on my code and it's just there. And I can read it back into C++ or anything else. So these kind of ideas are extremely powerful. And of course like a lot of people want to do YAML for different files. Just be careful with YAML as there are things in YAML that are not reproducible in these other languages like fingers. Just put it out, Provenance is incredibly powerful. So Provenance has an incredible number of tiers. What we found is simply if you have the name of the program and the version involved gets you extremely far. Like it doesn't tell you a ton of data, but it actually gets you 90% there. And of course you can go on that down and it depends on the creator for this. But you can have also what kind of hardware was brought on, how many cores, how much memory, etc. inside this. Because again we're these key value dictionaries. This can be the minimal set and you can expand it as far as you wish inside these cases. Can I just ask who came up with this? Who agrees with using it? So it's long to this particular effort. We actually took about a group of 30. And we had members from 10 different QM programs, I had some visualizers, a couple of MD people, put them into a room and said, do this. So over about a two day span. The other thing is you create a different schema or different building blocks for the molecule and input and output and also for the directory. So this goes back to the data blocks and data systems that we're talking about. To kind of illuminate this, like if I have a molecule, that's a single local block. What I can do is I can compose that into an output result where I will want basically my input and output which is reducible and I have this single block instead of a larger block. And what I can do is I can compose it off the tree so that trajectory now becomes multiple of these results and having little blocks within them. And in fact, like in a job that's around position trajectory you have all the different results but then you also have the initial and final molecule outside of it. So in the full picture, I'd actually have two more green blocks out here. And so thinking about them this way, you have this more composed whole hierarchy that you can do. I'm worrying about transcribing this into a file format. If you have small data, it's trivial to do if you have larger data meets the translation layers involved. But you know, going from a data model like this down to a file an issue could be an obstacle for each other, at least from our point of view. So hopefully just a real quick illumination of what we're talking about with Dan models and hopefully, Chris, I'm not misrepresenting your ideas at all in here. So very simple diagrammatic overview of how this would actually work. And you can think about this the same way in terms of trajectories and parameter data and you can compose it all the way up into a single block. Okay, so a little bit more to the point of tools on MD file sharing. So I want to say a little bit about my work about quantum chemistry sharing, basically. And so what we did was we started, we went out and we asked about 10 maybe 12 different database providers from like Materials Project, Nomad, IUCAC, to you name it, and we said like, hey, how did you start this? Like what was the actual way that you progressed? And so their first comment was that everyone focuses on what kind of data is shared by the store. But no one ever really talks about what are you actually going to do with that data, what's the purpose of that data. And they said flipping it around is how you should always start these sharing tools. And so from that, what we stated for quantum chemistry again with the qualities of the quantum chemist is that we want to store hundreds of millions of hours of computing, which is hundreds of millions or billions of these tiny little kilobyte fragments because quantum chemistry can cost a very small and we want to do things like force field construction. So we went to the open force field and we said, what do you guys need for your force field thing? Then we said, what about physical property prediction? So we went to the experiment list and said, what do you need to extrapolate out? And so we went all the way down the line of course getting things like machine learning and methodology assessments. And from each of these groups, we were able to come up with a set of requirements. And so in a lot of times we always talk about what are the tools involved and what's out there right now. But the real question is, what do we actually need? Because you see a lot of these projects and their extent, but they might have maybe 50 users or 100 users. And this makes sustainability extremely hard because these things are complicated and expensive. So what are the commonalities involved where you get 1,000 users or 10,000 users in this particular space? Those are really the things that... And so this kind of leads me on to something called communities of practice. So this is something I actually stole from Neil SSI and kind of extrapolated out a little bit. So this is actually a slide that I use in my education work. So whenever I talk about education, it's best practices in education. But looking at it, I think this is actually applicable to this kind of project as well. You always need to scope these things out. You know, if I want to create something like the trajectory sharing, that's a huge project just because the data involved. So if it's a huge project because the data involved, my target audience size has to get big. The size of my project has to go up, which means the activity level has to be convinced right with all those ideas. Which is why I think when it comes to indeed trajectory sharing and file sharing, trying to figure out the commonalities involved to get a very large community into it is probably the most important point. The other takeaway from this is that there's always going to be communities. You can create a large community, but there's always going to be someone else. I think, for example, if I have some kind of trajectory file sharing, I'm not going to try to consume the PDV or vice versa. That seems like a very rough idea. But what are the commonalities involved? Two of these two projects, what can we learn from each other, is going to be very important. And so just looking out there at the world and trying to figure out, for example, what are our blank databases? Well, there's tons of material databases involved, and there's really great ideas to pull away from things like Kaggle. If you haven't seen Kaggle, really recommend it. It's machine learning data. It does an extremely good job for a particular use case. For in particular, things like Drum Bank and more, are going to be really important. And so, if I look at all of these, what they do is, they don't necessarily contain the same data, but what they do is, they cross reference each other. And so what this means, actually, is if I went to, for example, Satrine, and I actually had materials data for a drug, it would lose back to Drum Bank or similar. And so this is the very beginning of unauthorized between all these different programs, is as soon as you have one single repository for this data, that's after your specific use case, then what you do is you reach out and we look at how do these things connect to other ideas in my field. So I would encourage you not to try to solve everything at the same time, but solve a large enough problem that gets a large community, but it's not possible to do, and then figure out how you can link to all these other projects. Because those linkings are going to be really popular. Okay. So a couple of focus recommendations for this discussion. I'm really trying to avoid the technical challenges at the moment. Today, there's very few technical challenges that can solve, I would say, just takes money and time. But if you can figure out the ideas and the requirements and how to figure out how to build a big community, one at a time, I should come to it quite easily. I think you should enumerate the tools of fault. There are a lot there in the through-after-republishing already. So things to think about. And for each of these, though, I would really think about them as their solutions. I would think about them as what are their goals? Like what did they set out to solve? Why did they solve it? What kind of community did they build from that? That's probably the most important question that you can answer from each one of these. Again, concentrating on the social considerations, I think, is probably a more important thing. Do something that other people are interested in. How do I do something that's not just useful for myself, but useful for a big enough community that I can actually get funding from? And personally, the way that I usually come to do this is that we get the 90% case. If you get the 90% case, then you've made those people happy. And then, at least paying a little bit of attention to, can you extend out to the 100% case? What are the technical challenges involved? Maybe you cannot solve them today, but can you at least be flexible to solve them in the future? A couple of options, again, is database cross reference each other quite frequently. So I wouldn't try to come up with a single solution that doesn't have everything. Come up with a solution that does something well, and then cross reference to people who do that well already. Probably just as soon as directory format transformation is already possible. This is by and large true, I think, with all the tools involved, including MD analysis, MD trash, format, and all the names, this is at least somewhat possible. And I would also assume that posting is possible. I think if you come up with a good idea, I think there are certainly ways that we can find to actually post the data involved. And so with these ideas in mind, I hope you will have a very fruitful discussion. And I think if we have no questions, there are no questions.