 Good to see you all again this morning and welcome to the first session. We're going to spend looking at how to grapple with the question of standardizing file formats. We as a community of good simulation users and simulation tool designers and simulation tool implementers can make sure that we are doing the best kind of things that our mobile users can have. We're doing the science they need to do. Because we are, fundamentally, here to make sure that we can do good science. We've been doing this in the time that we've been able to do it. My name is Mark Lakeria. I'll be sharing the session today. I come to you wearing one of the hats from BioXL, which is the primary sponsor of this project. I lead the work package of BioXL that deals primarily with software development. A large number of European... Sorry, a small number of high impact European applications. Chiefly Gromax and Hadock, which are applications I'm working on here, further from the two other contributors to this session, I was able to work on them and they're going to end up working with both of you. So they respectively lead the Hadock docking code development effort and also the Gromax development effort. But they too have a number of other hats in this community. I will be privileged to donate to learn from their long experience. What they see as leads and how they see best we can grapple with some of those. So having heard a little bit from me to set some stage, we'll throw it to them. They will share their ideas and expertise with us. But then we'll move into it again to join today as we did yesterday. So we'll see our lovely sailboats here, which we'll use as models. I will pose to you some ideas that we can break up to those groups and then we can proceed and have some fruitful discussions bringing together all the disparate ideas and experience that we have here. Just as a reminder of what we are all here to discuss last night as Matthew brought to us that we're here to discuss to find the ideas that we'll have in our own minds from our own individual past experiences. We want to find the kernels of best practice that we have already. Sometimes identify what we need to create as best practice because our existing things are legacy best practice and we need to move as the state of the art moves. We need to highlight the issues that do exist. We need to find new solutions. We've identified and currently we have problems that maybe were good solution 10 years ago but no longer fit. We need to, as a community, be prepared to move on from them. And finally, we need to make sure that we have produced LF ideas and have an especially rounded discussion that we can produce a community white paper that can be moving forward to be able to propose funding for projects that actually implement some of the tools that we might propose or the formats that we might discuss. So the support for us to produce this concrete output so that our community gets a lot of coming forward from our discussion rather than just generating a bunch of ideas that we can start with. So my take on trying to evolve standard file formats is that we need to first of all start to understand what our community community really do need. Many of us already have that kind of knowledge in the back of our heads somewhere. Many of us have grown up from being simulation users or perhaps now tool developers or research BIs in the field. Some of our experience starts getting a bit dated so we need to remember, of course, that we should go back to the people who are actually living in the coal phase and remember to talk to them about their problems not just the ones we remember for being with you three years ago. So we need to make sure that we take the ideas that come from the expertise we have gathered here and go back out to other folks in the community as there's many other stakeholders. These views we need to make sure we try to represent them and to interact with others in a later role. Successful file formats, as with all kinds of standards, have to grapple with the 10 plus 1 standard format. I'm not sure we're all familiar with the classic XKCD where people say, there's a problem that's in different file formats. What we should do is design one true file format that would be able to identify everybody's use cases. We have 10 plus 1 standards and two years later the same discussion happens again. One of the challenges with that is that when you try to design a file format you only have a luma to the set of use cases in one of my chief proposals today. Is that we need to focus on things like modularity and extensibility of the ways in which we store data not necessarily file formats so that we have the ability to cater to future needs. While we have some good file formats that have been useful in the past out there, they're clearly dedicated to all the possible use cases that we will dream of in the future. We are in the business of doing research, we should not be targeting, yes, I need the perfect file format for what I can consume and do it today. I have something that also has the possibility of being usable in 5 to 10 years time or in some other uses of European capitals in 5 to 10 years, because we haven't yet another workshop for how we move to the state of the art. I want us to have the ability to search for solutions that are able to be extended, that are able to be implemented by the tool providers in a way that means the users just say, I have this file, this object storage model, the details of how it lives on disk or in the cloud or elsewhere should not be in the forefront of our user's mind. We want them thinking about how do I design and execute a simulation, how do I analyze my results, how do I advance society's understanding of these kinds of properties that we're trying to simulate. In building a standing, you need to be able to involve a community around that standing. So in my day job as the developer of Romax, I've been able to see for someone from far, the evolution of the standing, both in C++, but the perfect role is to open MP and MPI. Each of those hasn't evolved from a group of people that met together to realize that we have a need. Nobody appointed them the stand as a body. They got together and recognized that this would be the thing for our community. And as I don't remember as we came on, Matthew said yesterday, the legitimacy comes from actually doing everything that you do. If you gather together a group of people and you do something credible, you won't take everything right. None of these standards ever got anything right. We won't get anything right either. We'll move forward and recognize that not doing anything is the enemy of everything. We don't have to try and get perfection in our first step. So we need to have in mind that as we move towards some sort of standardization process, we need to try and build a community of people who represent the different stakeholders. We need some people who are users. We need some people who are research PIs. We need some people who are responsible for maintaining long-term data storage. We need some toolmaking matters. We need some analysis with NDS. We need to be at the table to look at the standard credible standard. So that's something we should have in our back-down minds of an HDR or two. People who are able to do something that can lead to more standard file formats. We need to have a community who are prepared to do it. We should make sure that we have a reference implementation in mind. It does us no good to spend an hour here theory-crafting of what we would do would actually be able to be implemented by people who are actually able to implement it. We need to be able to work on coding that can be valid for storing the data we currently have now, as well as how we see things. When we would have such a sample with the proposed standard available, we would need to make available to the community some standard files in that. So sometimes we would need to do some front work to actually say, hey, the community here is architecturally in this part of our work would look. Would you be able to implement your tool? Would you be able to package, be able to handle this? Would your red students be able to do this file format? Does it meet your needs? Does what we're doing into changing it before you get to the point of standardization? I would encourage all of us to start by assuming that models of file formats are really most familiar with unsuitable. We're very likely to have to move on from the state of the art. Remember that you too have only a small part of the picture. You're holding the elephant's leg or the elephant's trunk. You don't see the whole element. If you're doing this, you're holding the pieces of the element. Next a quick word about what we're not trying to actually do today. Here are just some things that have came up in my searching around this topic as things that we're not talking about either in this workshop or in this session. We're not worried about how dark sets will be sightable. We're not worried about how we're going to piece together simulations into large workflows. We're not worried so much about simulation management. What's the process of designing a simulation and how do I visualize my results? How do I understand what is going on with the complex machinery that is along the simulation? How do I analyze my share? All these sorts of things are things that we need to have on our horizon but we're not going to directly talk about those today, I feel. People have different ideas. I haven't come through in the group discussion. As a session chair, these are the sorts of things I don't feel are things we should have on the table today just so that we have managing problems. And actually, of course, we are far too small and far too selective in the community to actually really understand it and we certainly don't have the right time today to actually make some time to progress in that direction. What we can do is set up our role and go into how they see fit. So now, I would particularly invite contributions from people so that we have group boundary conditions for the subsequent group discussion. Pretty cool in any discussion of our formats. What do we think people's needs actually are? I come from a strong mathematics background but I have a little bit of other computational science experience as well. There may be things that you people perceive as needs that are a lot on my radar. If you see things up here that you think don't belong or that you think are missing from up here, I would very much like to hear from you now so that we can have all of you together as a discussion community. I use this often to describe the calculation I want to do. That comes in many different forms. I don't even want to feel these things missing here. Yes, Joe. There's not a description of what it is, what the chemical matter is or the physical thing that you're simulating in this case. You have the model physics that can be divorced entirely from the actual chemical species that's being modeled. Yeah. It's difficult to go back to interpretability if you're missing that information. There's no standard we have for describing that kind of thing. Yes. In my academic simulations that's often implicit. I have some coordinates I'm wanting to apology but it's very non-discovered what this thing is. So in a classic it's basically you lose things like formal charges because you've actually lost the chemical information about amortypes, bonding what is the formal charges on the system trying to redescalate that kind of information in fact it's extremely difficult because you can't just take points with charges but points connected and actually work out what chemical species is the source originally. So when we say that the two aspects we're trying to talk about it's all the way from full charge models all the way through to which state of which iron shell is this. Yes. It's a full gamut of data function of the information that's going to be missing in the world. And the model physics includes the parameters there? Yes. We have this question. Why is there cunot potential of the finite storm cutoff in the same piece? Yes. Those are two key aspects that are currently growing together in all periods. They don't have to be together. One of the classic ones is things to do with some packages doing the same thing some packages put things like the shape parameters in the file format and some packages put the shape parameters in the control input files and some packages are in great quantities and again, it's that kind of information which is almost there's a distinction between what is the thing we're modeling and how we're modeling it and those two things are actually two different things and if you bundle or trans-bundled what is the thing we're modeling with how we're modeling it together as one place actually makes fun for a bit of very difficult to work with because ultimately we're modeling it together which is both capturing trajectory data and capturing checkpoint data and restarting simulations and all the data which is how I'm actually going to do stuff with that but actually what you want is data which is this is the molecular system the chemical system what we're doing and then I'll take those molecules and I'll do docking of them and then I'll do kinematics and the information you need to do is different tasks should be in the file that describes this is the actual chemistry of the molecules well, once you have when you get the trajectory and you just have the trajectory without how it had been obtained you lose a lot of the interpretability it's difficult to interpret the trajectory if you do not know what's behind and how it has been produced you could have two files one on how and one on what but then if you have it has its own challenges in that you have to have these two files together in sync and be sure that you can get one from and the others that they will not be spread and dispersed very deeply if I have one file that says this is how it is I have a million trajectory all using that one file but this is how I did it and actually this is how I did it it's such a huge space in terms of there's a million beyond just for lack of analysis there's a million different ways you could how I did it and so actually trying to merge the trajectory information with how I did it is effective it's a simple thing to do it's just different things together but you have to do that that's undoubtedly in the file format it's as simple as this stuff in it so it sounds as if there is not at all a consensus over whether we should just have the data or also have the workflow in the file format it sounds like a warning for discussion yeah yep that would be something that I think we can usually discuss in our past quarters thank you for the contributions so far so yes there's a number of other things I've commented on there isn't there anything else people want to talk about I call that experimental data there because we really need to a regime where we want to try not to just have the force field and hold the encryption as how we describe simulation so you guys have multiple scales so one of the limitations in the file format is that they don't really cope at all when you have different parts of the system what made it different scales so there isn't a single trajectory format which can handle an atomistic interfacing with the core strain interfacing with the VESA system or CFG systems of people learning how to create really weird things so do you want to stay on trajectories so guys strong is to just one thing thanks for what we must have there not that we can have it we've been through this at ten times and what happens is that everybody agrees with a great idea is who will provide the funding for this without any funding to maintain it indefinitely so I think that thing comes down to what is the file format for is this actually for the left dynamic trajectory data atomistic so is this a scope question so I guess just to keep the station kind of in the shape it's very nice and interesting discussion what we could do is maybe keep your slide for the group discussion on make the discussion out of it and we might come up with some ideas or at least things in group discussions put everything together and we'll keep that as a support for the main discussion we'll be back moving from inputs to outputs I mean critical things for like the simulation architects but people will end up with that stuff often when you're generating thousands of trajectories it might be compute properties on the fly so that you'll amortize the cost for having to reprocess lots of trajectories so the ability to store arbitrary observables or some sort of mechanical properties is often extremely useful yeah I'm viewing that as public persuasive analysis whether they go into the same volume rather than all of these something a little different so I think it's a need to introduce those are those are okay so you're using language in these corners can you basically bind this is this little Q&Q it's all coins any time dependent state perhaps any time dependent state is cool sometimes lots of people want forces so it's very difficult to bind each other on the floor without everybody reasoning so I've forecasted some of my take already I think we very much need to move towards a container style approach many of our legacy formats were things that people developed over the preceding decades very much with I'm recommending this tool I need to get some dial on the disk and there's not an existing community or standard so I just need to do something we've learned lessons from what they did well and what they did badly so we should be more informed on this with view of not being to continue having these kinds of discussions we need to work on approaches that are increasingly extensible we shouldn't need to meet again in five years time we need to redefine what our simulations mean but we'll be able to be extended by two quarters of the day by being able to maintain that compatibility so things like models we can see in the audio visual community where there might be a file format with multiple codecs that describe the ways to interpret the contents of the individual containers within the file or the object store are ways that I think we need to move even within the Gromix projects about ten years ago we were talking about the TNG file format that is both the object store and the numerous kinds of containers and also the codec for the risk of getting compression of both time and spatially which is extremely efficient there's no reason why we can't take those codecs and use them in other file formats we should be very much thinking in terms of these approaches people are going to come up with different domain specific compression algorithms we try to design one true file format without thinking too much we generally need to think about both human and machine readability for some of our files humans obviously aren't going to sit and read a whole lot of analytics directory but they absolutely need to be able to read the metadata we need to extract that from the file we need to make sure we can annotate our files after we have written them without disturbing the integrity of the works of simulation dial that are contained within it so often we need to run a bunch of simulations so that some of these are more interesting than others or valuable than others annotate those afterwards while still preserving things like internal checksum of particular buckets of data so that when we come to finish up the project we have to move our working set of simulation data to the long term store at the ability to store it we are still storing our original directory data where we intend to we're storing post-processed stuff where we intend to we should also be able to describe something about the input that produced it as Joe was pointing out what is the simulation description now how big is the workflow description this is a full question we will likely need different solutions for different needs we are going to kind of design the one to be able to do a simulation file format we have needs for input, needs for output needs for large amounts of compressed directory data needs for human readable topologies that you have in your credits to kind of be able to take the browser around so we will need to come up with combinations of things that are meaning our multiple needs we should also take in mind that there are many other communities out there that have dealt with similar kinds of challenges so the human people also have to write out large amounts of data about in their case typically things like molecular orbitals and so forth astronomy too has very long time scale simulations we should be checking what data formats do they use how do they solve their problems bioinformatics likewise has to take in large amounts of input data what we need to learn from them material science and so on and of course the audio visual community we already have so let's now throw to our experts here perhaps Eric first and then I'll go over it