 So Eric is the long-time leader of the Burmaks in the Concealation project. He has a number of other strings to his bow as well, so we'll see what we may hear about as well. I want to see what he works with them now, and we can have a chat. HDMI or BT? HDMI. This is HDMI, but we can have it. Oh, HDMI. I think they're doing HDMI. Oh, they're doing it. HDMI is. Thank you, Mark. Yes, let's see. I'll try to be briefer as we have time for discussions. If you have any short questions, go right ahead and wrap me. But if we have these long discussions, I think it's better to take them afterwards for everybody's sake. Given title of this session was standardization of pilot formats, the first thing is that we're not going to standardize pilot formats. Because we made this mistake in BioExcel, and you have no idea what standards are. We might have some common pilot formats. Pushing things through a standardization body takes at least a decade, and it's exceptionally difficult. And we actually got some of the reviewers nailing us for that, and you haven't standardized pilot formats. So let's expect to push this through ISO or ECMA. It's not going to be a standard, and I don't think we're going to do that. But we might have some defective standards. But I actually realized that we wrote one of these ten simple rules based on some frustration of Arne and me this summer. And I think we'll be out in a few weeks or so. And I won't repeat this, but hey, either when it comes out or in the break, do interrupt me, and they'll get to me and I can share some of this with you. But since Mark told me we should think of needs, I will adapt some of my points here to the needs we saw here. Most of the needs were based on frustrations that when we read this, most of the papers we review or edit in PLOS, it's really difficult, or it's going to be really difficult for anybody to reproduce their results. The other thing that came up, there are a couple of vendors nowadays that are starting to, for competitive reasons, because it's a tough business, right, have force fields for instance that are closed, encrypted parameters. And while many of these force fields are great, and I, in terms of developing new drugs, and who likely the exception is successful, the scientists then also worries that we're increasingly starting to close our parameters and everything. At some point this is going to start to repeat scientific progress. So we also had a fairly extensive, how we want to handle this in PLOS. So I'm not going to go through all the rules here, but I think most of these rules are really based on the ability of need, or the ability of you to work with use and check my results, because there will be bugs in my results. Any questions how severe those bugs are, can we find them sooner rather than later? And also the whole idea of standing on the shoulders of giants, right, that you should be able to use my results to balance your research and vice versa, and this requires us to have some sort of reasonable amount of sharing. Then of course you have to balance this, because I'm not religious about this, I'm not out to kill commercial companies. It's perfectly fine to make money from software, even though we don't, in a way we do, right? Because we get grant funding from software, most of us. And part of this has really reproducibility, reproducibility, reproducibility, these are probably the things that comes back most of these, and then a part of this fair data repository. It's not enough to share your data, it also has to be findable, accessible, interoperable and reusable. Just dumping a pile of data on you, although it's theoretically possible to get the information out, you will in most cases not be able to do it easily. And then at the end, be nice, that's way more important than you think. But rather spending more time on that, I had a couple of them. I'm deliberated then briefly on these slides. Most of them migrated over the weekend or yesterday night, based on our discussions. So if we take a step back, then why do we even need common file formats? It's very easy to start with the assumption that that is what we need or that is the goal. And I think it's good to be optimistic here. Because if we're going to have joint file formats, it will be difficult to fund them. We can certainly fund the development for one or a few laps, right? But most funding cycles are two, three, maybe up to five years. But if these formats are really going to be standard, that's a 20-year commitment for all of us. And if we take all the 20-year commitments, by definition it's not going to be funded. So what are we willing to do, assuming that we didn't have a sense of funding for it? And in my case of egotism, that's a first, encouraging reuse of existing data that is certainly nice, promoting science, blah, blah, blah. But for me, if I can make it trivial, if I suspect that there is a bug, either in my program or some other program, if I can tell the students to take that file and run it through the other program, oh my God, we would save so much time. Or for whatever reason, there's a new piece of hardware, or you're running simulations, so I'll get access to a huge amount, either open-seal, deep-use, or a platform where I have to run deep-use only. If I could take the same file, what is specified, run it through Android instead of Pro Max? That would simplify things. Or I need to do docking, for whatever reason, there is a feature that hasn't support. That might happen. If I can, the question is, what is the program instead? That would save me time and effort as a user instead of a developer. There are certainly many of us working in this sort of high-drew producers. As long as it's a single program, it's usually fine, right? But again, I have a pipeline of five, six different steps, suddenly I want to introduce a set of steps. So the mere fact that there are easy ways to exchange data would make Life Center for me and my students. And then we have all these important points with reproducibility. For me, reproducibility is a great way of finding bugs, but it's certainly great if I have to go back to a simulation that we can do. I think this is important for all of us, but I'm not entirely sure what the strong characteristic driving force here is. So we're all these. I think I'm increasing kind of, we need well-defined formats. We need interoperable formats. It's great if we can start to share these formats, not so much to have the universal standard, but to share the load of maintaining them. But I would argue that formal standard is not what we should aim for. And in the terms of the plus discussion and everything, that good file formats that include everything might mildly enforce users to store all the important metadata that we talked about that just came up in the discussion. Simply because you can't write the format without having that. Then Mark already mentioned a bit. I would largely separate things into... I'm an empty person, so I'll use some empty terminology here. But when I say trajectory data, there are large amounts of coordinate data. Amounts that are so large that the storage requirements start to matter. We're not going to do the same clear text because it would be insane. And in this case, what we haven't had before, but we have some new formats, I will share a little bit with you then. I think it's having some regular prominence is important. It's not the same thing as fair, but for a given program where... What program generated this data? Was it Eric Lindahl in 2001? Oh my God, you should be careful, don't trust that student. But also, what specific brand generated? Was that the buggy version of the program? What was the previous step? If I'm simulating this particular I in channel, that's all fine and good, but what PDB file did you start from? And you can already... Do you really need to know what PDB file you started from? Well, maybe we could do it the opposite way. If I'm browsing around in the PDB, could I find anybody who has ever done docking or simulation based on this PDB structure? That's starting to be something pretty important to find out everybody who has started this with computational methods. The compromise for this is that efficiency beats readability here. This has to be very highly compressed. And we also need a lot of future even better compression algorithms compared with video. There are new compression algorithms coming all the time as processors get faster. We certainly need lossless compression, but we also need lossy compression to save space. You need very efficient parallel IO, as we're getting larger and larger machines. I would argue we need some sort of hashless and digital signatures, partly to make sure that you can manipulate the data, but also as the amount of storage is growing, there will be random errors in data. And it's great to see it's sad if a random error happened in the data, but it's much better to know that there is something wrong with this file because the hash does not match. And then historically we focused on storage of full predictorists. And then thinking based on those assemblies or somebody's discussion yesterday that for some large systems, even single coordinate frames or confirmations, or what I would call it, are getting so large that we might actually need to start thinking of an efficient way of storing that too. But that then becomes a balance because one of the things we, in the old days we had formats that were just rule coordinates. The problem with that is that they are literally just rule coordinates. You have no idea what the atom is. And then you need to remember to pair this because we don't do that in our new formats because we figured that that header describing the system is so small compared to the thousands of frames of storage. If you're only storing one frame, that balance is less obvious. The other half here has to do with force fields, option, metadata, everything that is not the rule coordinates. And here I would strongly prefer doing the exact opposite. Do the verbitin, detail formats, all the units, all choices documented. They should fully, together with the coordinate input, this should fully describe the simulations of what we're doing. Ideally, I would like to be able to combine all this in a single file that I can transmit either over the wire or on an email or something. And if you just round that file through two different programs, apart from statistical fluctuations with random numbers, compilers and everything, again, within the statistical estimate, this should provide the same results in three different simulation codes. The challenge here is that you get a very high performance textile. The good thing is that people have already solved that problem. Anytime we try to do this ourselves, we unsolved the problem. So here, don't write this ourselves. There are a handful of code languages we can use in there. On the other hand, this type of data is likely more difficult to write because it's not just plain and dimensional array supportments. So for this to work, not only do you need some sort of good frameworks but I think we will also need to join and take on creating some sort of reference implementations that are completely free so any code can just incorporate the no strings of text. The problem here is that whenever we have these discussions, I think we've had them four times before. But these free political thoughts is that you have meetings like this, or emails. We had one 15 years ago about molecular file formats. Everybody was so enthusiastic and you came up with the QMM formats. How are we going to describe multiple extensions of force fields? How are we going to enable the user to do anything or a short versus long formats and everything? And the beauty of it would look like after four months everybody finally agreed and that was my level. Great. So how are we going to do the implementation? And then the entire list would be quiet. So don't do that mistake again. We need to focus, I think, think what we must do, not what we can do. What is the minimal grade we can start from? If that is successful it is going to be easy to build momentum. But don't think, don't go for great David Kinsen thing, by the way. I have a couple of proposals here. When it comes to trajectory data I think we should think in terms of the Matroska containers or QuickTime or something. It has sort of framework container that enables you to, again, access just the protein of frame 445 or whatever time step. But the actual compression let that piece over the black box inside it. But it just was QuickTime, right? QuickTime now is H.265. But you open them with the same programs and everything. For me as a user, I don't have to care deep. If I just update my software I can read H.265 frames with it. So it's also a container around it, and then the internal components can be different. You need lots of portability and everything, but that's fine. That just works for movies, magically. There are some fancy modern compression algorithms for your multi-frame compressions just with movies that we would like to do. We certainly want fast random access. In terms of focusing on what we must have here, I would advocate focus on the state. That means coordinates and velocities. Possibly sort of if the number of atoms are changed, even if we don't implement it right away, we need to allow the number of atoms to change between time steps or if you do inside dynamic protonation or something. But everything else can, in theory, at least be recomputed from the state. And if it can be recomputed, I would say build it. Actually, the way we define this, we can store it, but that's not necessarily a must. So I will have, yes, I do a little bit of time. I'll show you two, three slides from one product, and we still don't really, we have it in Robice, but we're not really pushing it that hard. So what we define here is first we have, we define the entire time format in terms of blocks. So at the start of a trajectory, you can have sort of general information about the entire time. But when we start reading, I need some sort of block describing what is the molecule. And that describes not this, for now at least, we haven't described all the parameters and everything. But this should be enough for me, indeed, to visualize the molecule without also requiring a coordinate file. You want the collectivity, you want the atoms some basic, maybe chargers or so. We can have some sort of constant data, free envy, land dust, etc. And then we have these frame sets. In one example, a frame set could be here. So this could be a batch of say 10 frames. And if I can compress those frames together I might get better compression than just compressing one of them just in the video. But then we also need some sort of particle maps or something to allow the number of particles to change. Again, the point here is not the details, but the hierarchical, the hierarchical. And then the frames in here would then have a label of what type of compression you have with a particular frame set. Now with this arbitrary data, you could also store any type of the derived or measured data that you want. What we did with TNG, and again, I'm not advocating that this is the solution, but there are some things that come up. One of the reasons why we wanted something compressed is that this is the problem we use pretty much always takes space format or batch binary formats. They become insanely large. TRR is a non-compressed format, and just as the charm file formats. If you use NetCDF, this is for hammer presses. They're also fairly bad. You can compress them a bit, but the game may be a factor of 30 percent or so. XTC is horrible in some ways, but the reason we love XTC in Romax is that it's fairly small. TNT for normal data, the same way we use XTC is some 50 percent smaller. But if you use velocity data, TNT will be like 50 percent smaller than XTC. There was even somebody that got in touch with us online a few years ago, and that came up with an even breadth of formats. I think we should be able to do a test case to get this out to 40 percent of XTC or something. Can you comment on why the difference between NetCDF and TNT? NetCDF supports compression. Yeah, but they don't denominator by default. Yeah, I'm sorry. I think that, let's see, there was a while we wrote this paper. This is the default file. This is one of the file formats used in average. There they do not support compression. But, I mean, can you at least say something about the merits of something like a hand-rolled format like TNT versus a very general, well-supported infrastructure like NetCDF? Yeah, sure. I'm sorry, we actually looked a lot into both NetCDF and HDF5. One of the problems with these is the size of the codebase. HDF5 is pretty much the size of Chromebooks, and it's larger than Apple. This is a gigantic codebase. And then there are problems with the new computer, such as the K-Computer. It didn't build on the K-Computer originally. So, I'm not saying that TNT is only rounded the back 200 kilobytes or 8 megabytes of code. It's also a fairly advanced code. And again, I'm not arguing that we should not use HDF5. That, in principle, I love the idea that their community is supported, right? If you start to search and run for HDF5, there are quite a few users who are very unhappy with the performance again, where I are apparently parallel to HDF5. And I'm not saying that HDF5 is a problem, but HDF5 is essentially re-creating a file system within their formats. And if you use that the wrong way, you can very easily end up with things that are poorly done. All these things are about us. You can have either a compact or something that's optimized for what we need or a generality. I think having both of them might be difficult. When it comes to the parameter data, I need to wrap this up. I would argue that it's just taking this very simple key value trees. And one of the reasons that we've seen that even in the last decade, formats keep evolving. There will be a new format 10 years from now that nobody has had any idea of. But if it's a simple key value tree, that's something that we can encode in any format. I think a lot of us has tabled more or less in XML. And the problem with XML is that it's also one of these standards by committee, right? It sounds great with attributes and complex objects and everything, but it also makes it non-trivial to translate XML to other formats. I would propose to go with Jason right now. This is up to the tab, because there are very small and very fast implementations. Method data for everything would be awesome, but again, focus on what we must do, not what we can do. In principle, I think it's a reasonable to have a single JSON file containing an entire simulation starting state, so that, again, I can do the simulation. The problem if that's going to be everything, that would also have to include the entire force field. What do you then do if you have an external link to the force field? We want to encode the force field 10,000 times. Of course, you can have an external link to the force field, but then the file is no longer self-contained. The more I'm thinking about that, I think, yes, it's better to reproduce the force field 10,000 times, because these formats are not going to be gigantic anyway. You can't compress them. And then the bill of compression, and again, most of us don't fill our hard drives with JSON files. So having a clear specification to do the more, I think it is probably more important than doing this bit of pre-medical optimization. Sorry, there was one more thing. One argument for why I'm increasingly negative to XML is that if you just look at Google Trends, XML is not a good factor. It keeps going down while JSON is going up any more, but it's fairly constant. So I think the JSON rules are in general better. But, again, there will be something new in a few years. This is the other reason why I like JSON. Maybe we should not even think so much about file formats, but how are we going to share and think more than standards for how do I get Jaws data? If I could get Jaws data, I really don't care that much if it's an amber or some other format, because I can likely write that to have a converter, no problem. So if I could get an easy way, just with a single command, I could get all this innovation data with this DIY and get it done with my local hard drive. And I think our advantage is that we do control directly or indirectly a large part of the community and we get all the major codes and communities that start pushing this, it would happen. The point here is we talked a little bit about this today, are there any such standards? It has to be extremely resilient. It should not go down because there's a server or a few servers because now they should be distributed in some way. I think that we necessarily should focus on storing only the metadata and we have such a technology developed in Sweden, not by me, but by other people and it's called the pirate thing. And you're laughing. But this is a system that the MPAA and tons of lawyers have not been able to take down. I would argue this is far more resilient than any art is. Again, legality is the side and I'm not suggesting that we should distribute our forms through the pirate thing, right? It is something to say for resilience. It is a technique that works. You download multiple sources and the more popular a pirate data is, the greater the bandwidth. It's going to be available in many places. It's trivial for me to pull down Joe's data and lose his data. In theory, if somebody has a directory with boobish rights, that also tends to see. So maybe we should take in terms of having the metadata tracker, but the groups of data repository would also be the sharing depositor that if I have Joe's files, suddenly the network knows that those are available in Europe too and I can get them from Stockholm and then get up on the bandwidth. So this is a bit tongue-in-tick, but arguably, if you look at the pirate thing, there are tons of different high performance, right? The programs are able to read many of the movie performance, but the sharing happens because it's easy to share, not because they agree to one common performance for all the movies. I'm done there, but some points to keep up the discussion here. At this hand part, no standard is available from a single lab. We need buy-in from several labs in particular, you. I think we need to be in that quality. What can we each of us contribute completely and what would we be willing to have next month, not just next month, but in the locker? I think it is important what will we get from it. I guess that's important. We might have to do a whole lot of this work for free and if there is no tangible benefit from doing this two years from now, we might not be quite as enthusiastic. I want to avoid the situation that we're having yet another workshop in five years, where we're thinking that it would be great if we have some common pilot formats, because we've been there down there. Maybe we should, I would like to think more in terms of compromising to make it feasible, rather than getting a perfect pilot format, and don't do the mistake that we did 15 years ago. Literally here, perfect is the enemy of good. And with that, I think we should take out our summer first and then we can continue with the summer.