 So, I will talk about decentralized management of digital objects for science and although I'm giving the talk, what I will present is not my own work or not solely my own work, but you'll see the essentially the works of five core developers and a dedicated documentation project team. And given that we've heard how important funding and networking is, I'm happy to say that we're funded and networked. So that should give you an indication of the likelihood of survival for the immediate future. So I'll be talking about a tool called DataNet and it's for joint management of data code through their entire lifecycle. So that includes version control, getting data from A to B, it's a free and open source software I've written in Python, MIT licensed, so it has a Python API and a command line interface and so on. And instead of, you know, giving a lengthy explanation of what this really means and what the details are, given that this is a developer meeting, I anticipate the question, sounds like it, why not use Git? And I can tell you that a DataNet dataset, which is the core data structure that the DataNet software uses to handle everything, is actually Git repository. So everything you're seeing, feature-wise, sits on top of a very mature, let's call it industry-grade base of tooling. There are a few principles that we follow in the development of DataNet that try to make sure that we don't ruin the features and possibilities that the tool like Git gives us. So for example, DataNet only recognizes two entities in the world, ones are files, we all know them, and datasets, which are collections of files. And there's no other domain-specific specialization in there. We try to minimize custom procedures and routines that are necessary, so if Git provides this feature to implement a solution for a given problem, then we will do that with the aim that if DataNet being a somewhat academic development project happens to die, that its users will not unnecessarily suffer from that death, but they can continue with the mature base still being intact. And I think we can agree that if Git will vanish, something will be developed that allows us to transition to it. And we also try very hard to not compromise the complete decentralization that Git allows us, so no mandatory services. And if you've worked with Git and you try to put lots of data into Git, you'll know that Git doesn't handle large files very well. Quick question to the room, who knows about or has heard about Git-annex? So quite a few people. Git-annex is the tool that we use in DataNet in order to manage large files, so instead of putting the content into a version control, we only put information about its location and identity into Git, which makes a lot of things much more easy. And Git-annex, for those who don't know it, but know Git large file support, is from our perspective and given the principles that I mentioned, the superior alternative because it's a completely decentralized solution. So everything that you can do with Git for code, you can do with Git-annex, plus Git for large files, which is very nice. And the purpose of DataNet here is to make, for those who know Git-annex, the actual use inside it to transparent to a certain degree. So when I just said that we are basically using Git and Git-annex and everything that we want to do can be done with them, why is there a need for dedicated software? It sits on top of it. And there's a single most important reason, there are other reasons, but one is most important and that is a single repository is typically not enough for many of the workflows that we need, particularly in science. And there are technical reasons for that, for example, if you try to put a million files into Git repository or try to do 100,000 commits, you will quickly see the end of the system. And the other application-based issues, for example, if you have different bits of information that are targeting different audiences with different access permissions and so on, it gets very complicated even on a technical level to make that happen if you're stuck with a single repository. And there are other reasons I could go on about this quite a while. So in science, we essentially only have what I would call modular data management issues in particular in collaborative environments. So if we look at a typical science workflow, then we're talking about the combination of individual pieces that have been developed in the past into something that produces novel results, which will then be published in one way or another. So in more abstract terms, we're talking about aggregation of works across time and also different collaborators or contributors. So mapping these onto dedicated repositories feels kind of natural. So you develop a library, you have a source code repository for it. You have a data catalog, you have some sort of data set for that. So these modular environments are kind of natural to the thing that we do. Now, what I want to show you now is a lot of code that just runs through and just gives you a more or less visceral demonstration how that feels like if you do it with Git and Git Analytics natively. So just click the play button. It's not necessary that you read all the lines. What's happening here is basically assembling in Git usage the picture that we've just seen. So there's a student that created a repository for code. There's another student that collected some data, put it into another repository. And now there's a postdoc that creates another repository that uses Git sub-module mechanism to combine these repositories into a third repository that then will track analysis results that have been derived from the combination of the data or the application of some novel analysis algorithm to these data. So that could be a situation that has happened in the past. It's the situation in some lab. And now the PI comes after a while and writes up a paper about this. So how does it look like if we have another Git repository that tries to contain the manuscript, which then somewhat comprehensively tracks all the inputs? So how does it look like? So we start with the repository, gets created, and then we put in the entire study repository as a Git-annex-infused repository. And the PI needs to remember that this one needs to be initialized. So he needs to look at what the branches are that are available. And then we need to init that one. And of course, at this point, we are ready. So we can use Git again to assemble the entire work tree, which clones all the repositories. We have a nice project-based directory that contains the data, the code, everything that's necessary, really nice for reproducibility, excellent. But the PI finds a bug in the code. It happens, right? So tries to apply the fix to the code repository because the postdoc is not there to do it for the PI. So it goes right in there. So it's a decentralized system. It should be no problem to just do it in any random clone. And then tries to just Git add the code. And then the problems start in this scenario. So Git does not really well or makes the usage of those submodule mechanisms really easy. So we have to remember, if we want to commit something to the code repository, then we have to remember where's the code repository, go into the code repository, and then commit it there. But given that it's a submodule, we will discover that Git uses what's called detached head. So if you commit something to that, we'll later only discover that we cannot publish it back to the original repository. It all becomes a nightmare. And in the end, we have to go up the hierarchy and commit all the changes in order to have an orderly, clean repository at the very end. And of course, this can all be done. But it's very complicated and tends to be used as an argument why we wouldn't want to use that kind of machinery in typical science workflows. And then there are little bits and pieces. We can use Git annex to obtain the data. That's what it's written for. But again, Git annex focused on a single repository. So it won't give us the data we need unless we go back to the repository that actually contains the data and then discover it knows how to get it from, fulfills the file handle, and so on. We have the data, all the situation that we wanted. Everything is good. And at the end of the project, the problems are still not done because we can't just use a system call to remove the project because Git annex uses special tricks and pieces to protect us from data loss, which then turn out to involve required knowledge that we need to give it back permissions, and so on. And you can imagine how that would feel like for a person who doesn't do that full time, right? So all the information is lost all the time, needs to be rediscovered all the time. It's a big hassle and causes complications. So for the rest of the demonstration, we'll just do the PI part again and use data let. So you can see how it's different. We use the exact same pieces as input that were produced by the students and the postdoc to provide the building blocks for the paper. But we'll now use just daylight commands. So we can create a data set, which is create a Git repository simply by using create. That's all you need to remember. We can clone any other data set in any place in any other data set and make it a sub-data set that uses the sub-module mechanism. We don't see it yet. All the setup that is necessary is done inside. We can ask for any piece in this super data set simply using the get command that will automatically obtain all the intermediate repositories that might be necessary in order to fulfill a request to a particular data file or code file. All happens automatically. There is no need for specific reinitialization. And the great thing is that if we actually work with these hierarchies, data set will make this hierarchy of nested repositories actually feel like a single monorepo. So we can do things like status requests. And it will tell us not only that the top level sub-data set or sub-module is modified, but give an actual indication what is the situation all the way down. And because it can do that, it can also apply modifications really quickly. So we can just say, save me this entire thing. And it will make sure that the actual change is committed to the repository that contains the change. And all the repositories upstairs know that there was a change that was caused by this thing and applies the commit messages all the way up. So we have a clean hierarchy at the very top. So that's my idea of a demonstration how why Datalight is particularly suited, more suited than Git and Gitanax on their own for these modular workflows. So the situation, if we continue this line of thought, is that we can essentially map all those modular problems where we have data that comes from different entities, is maintained at a different pace, evolves at a different pace, has different access permissions, et cetera, that we still want to combine to produce scientific results in a very identifiable and accountable way. We can map onto these technological pieces. And you can use the same kind of idea in many contexts. So it doesn't have to be science. It could also be, for example, the scenario of a continuous integration system, where you have a repository of data. Maybe that was the output of a scientific study. But you want to use it in context of continuous integration. So you use it as a dependency to a software package. And the sub data set doesn't need to be polluted with the idea that it's now being used inside a continuous integration system. So you can flexibly reuse components of data or code in whatever system you're trying to build. And the point I want to make is that this model actually scales quite far. So if you go to this page on GitHub, you can actually data let clone a repository that tracks 50 million files in 80 terabytes of data, where the actual sub data sets, 4 and 1 half thousand of them, don't even live on GitHub, nor does the data. And for data let, it just feels like clone this thing, get me that file, and it figures all out internally, given you have access permissions. And the readme tells you how to get that. So it can extend quite far into the complexity that we often face in scientific scenarios. Another key bit that is missing in most cases, most version control systems, is that we typically don't know what the cause for a change was. So if we modify a file in some code base, then we have to be disciplined and amended with an appropriate commit message in order to transmit the message of what the cause for that change was to the past. If we run any sort of tools in scientific data processing, that step is typically not captured. We have input data, then some script is applied with some parameters to produce output data, but not necessarily is that parameterization or how exactly the script was called recorded, which is a huge problem for reproducibility of scientific results. Because at the point where you want to reproduce, the person who's done it is usually no longer accessible and everything is a big question mark. So in data light, thank you, in data light we can use the machinery that I've just shown you, that it can basically analyze and commit arbitrarily complex trees of connected or nested modular units to very simply capture pretty much arbitrary modifications of any dataset. Only thing we need to do and data light provides a tool called run is we just wrap the execution of an arbitrary command line with data light and data light does nothing but checking that a working tree is clean, however complex runs the tool, checks what the modifications are, and then makes a commit message that amends that recorded modification with information about what tool was run in which way and puts it in a commit message, like we would do with a manual edit in a code base. And now I hear some of you saying, yeah, but if you can compute it on one system, that doesn't mean you can compute it on another system, so lots of information is still missing, but we all know that there's technology like containers, so lots of scientific institutions are capable of running things like singularity. A singularity container is just a file. This system is built for tracking files of arbitrary size. So you just put your container in there, it's tracked like any other piece that you use and data light provides a specific extension to make use of those containers. So you run the same execution and data light performs the exact same thing, make sure it's clean, apply it, record the change, amend the commit message, but the execution actually runs in the container that the data set also contains. So you pretty much have achieved comprehensive provenance tracking, although you don't know, still don't know what exactly happened inside that execution, but you know which execution it was and you could do forensics later on, which is much more than what normally is able to do. And the last point I want to make in the remaining minutes is that we've heard about the necessity for metadata repeatedly, and especially if we're going to heterogeneous data collections, large data collections, stuff that we cannot humanely process, I would say. It's an issue how we deal with metadata. And metadata, how we describe data is an active field of research for decades and has the problem of not really delivering the great results that justify all the effort that went into it. And that's because these standards change all the time and we need to somehow automate these issues in order to be able to track the developments that are done on the metadata description research, and apply them to the vast amount of data that is being processed and generated. So in data-led, just to register the thought, there is this idea that metadata is programmatically extracted from data and multiple representations of the same information can live in parallel next to each other. So we can use a metadata standard that exists today and transition to a new one once it's available and have that transition be machine-aided. So it can be done with justifiable effort. And data-led purpose here is not to be a comprehensive description engine. The description is the duty of the specific domains of application that know how data needs to be described. The purpose of data-led here is that you can provide a little script that tells data that here is the metadata that I learned about this data set and it takes care of handling it, of storing it in a standardized format, also being able to detach it from a data set in order to be able to put it into databases to enable queries and search and so on. So to summarize, data-led is a tool that makes the combination of git and git annex for decentralized data management much more convenient and simpler for many scientific workflows where we're not necessarily being interested in making every scientist a software developer and cannot afford it. And it aids provenance capture and discoverability through support of metadata. And it can do much more than I was able to present today. And if you want to find out more, you can go to handbook.data.org and get a fairly comprehensive view of what it can do both from the basics of why is version control important for science and for other applications, but also for specialized use cases that shed light on problems like how to construct a scalable data store or how to write a reproducible paper for science using this type of tooling. And with that, I thank you and we're also hiring not just in Germany but also in the US. So whatever you prefer, get in touch, please. Thank you. Thank you. That was not a question, but I would like to repeat that. He has talked to the core devs for years, but until the talk here, we never understood what it meant. So question one was whether we are employing WC3 proof standard for provenance capture and the two stage answer, one is data doesn't care what we support. It just accepts a structured report of whatever structure. And if that's crappy, then it will be crappy but it will be managed if it's good like WC3 proof, which data lets own metadata extractors internally use. Yes, then it will be more usable, but there's no requirement that the metadata, your first attempt at metadata needs to be perfect or standardized or signed off by some entity. You can do whatever you want if you think that's useful. And the second question was whether we push containers using some standardized mechanism to some kind of archive or catalog. Yeah, there's nothing in data that does that, but for us again, data let knows files and data sets. So container is a file. You can, data has an extension mechanism where it would be very simple. There's a template you can clone from GitHub where you could add sub commands to data let for example would say, okay, list me all the containers that it knows in this data set and then give them a unique identifier and push them somewhere. But there's no built-in support in data let core for that. The very fundamental question. And the few scientists I work with. So the question was whether, I'll paraphrase it if you excuse me, the question was whether Git's way of identifying information is sufficient for scientific data and whether we should switch to something that is based on a different theory. Is that fair? And so the, I can tell you, I've never thought about that because this whole project is about practicality, right? So we are sitting on top of Git not because we like its theoretical implications, but because it's a widely adopted platform that we still have to discover the limitations for the work that we've been doing, right? And likewise for Git Annex, right? It interfaces with the storage infrastructure that the planet has right now, right? So that's why we're doing it. If this turns out to be a problem, I'm sure it turns out to be a problem not within the scope of data let, but within the scope of tracking information or identifying information in general. And then I'm sure we'll jump on the train wherever it goes, when it goes. Thank you, right? So in data let, we don't even, we don't make that decision, right? In data let, it's Git Annex that is responsible for data transport. So Git Annex, for example, can use torrents for data transport. It can use IPFS for data transport. It can use all kinds of stuff, right? If your data is on Amazon's big pod drive, you don't care, right? And it can use that too, right? So we're not, you are not making a decision for a particular single technology because in our experience, there is no single technology that serves all the use cases, right? And that's why that is Git Annex's domain and it does that really well. So in that, they make different choices and they have a more stringent set of technologies that they choose for reasons that are valid and there's no, it's not that we, we can, you can do similar things in data let plus it's foundation, there is no direct conflict of any sorts. Okay, sorry.