 So hello everyone, Zach, Stefan and Zachioli here. Because of the remote. So we are here to present the software heritage project that we, it has been the subject of our talk last year that comes as well shortly after its announcement. So in this talk, we'll first give an overview of the project for those of you that might not know about it and then go through, rather in depth, review of some of the technical and technological choices we made during the early phase of the project and what worked well and what didn't work well. It might be of interest for any of you who's doing big data storage or big data analysis in general storage and whatnot. So I start with some of the motivation that actually made us start this project about a bit more than one year ago. So this room, luckily, does not need any explication of why free software is important, the distinction of source code and binary format. So I skipped all these part that are usually part of our presentation about the project, but they want to make a specific point that even if we are free software people, sometime it's not really spelled out. So source code is actually a rather interesting knowledge representation thing. So in free software, you usually think of source code as just a way to deliver software freedom. So without source code, you don't know what the program is doing on your computer or with your data. And without source code, you certainly cannot modify the program that does your computing. But source code, so this is, if you want a bit utilitarian view of source code, is a meaning to make the machine do what you want. But source code per se is also very useful. So source code includes knowledge. So in a specific snippet of source code, you might have an explanation of a rather clever way of rendering 3D frames or of queuing network packets or doing something that has never been done before. So if you have that piece of source code, properly commented, knowing it in context, maybe knowing its development history, as it's stored in a version control system. So that piece of source code might actually tell you a story that might be as relevant as, say, a scientific publication. So every time we publish a single bit of source code and we release it freely under a free software license, we're actually potentially publishing knowledge that has never been published before. Of course, not source code is that interesting, but there are some bits of source code that are that important. So in a sense, when we, as free software developers, are writing and releasing and publishing source code, we are contributing our little stone to a building that is the building of free software knowledge. So there is this notion outside the specific realm of computing that is the notion of commons, right? So the commons are those resources which are not controlled by any specific individual, but are held in common and are accessible to everyone a little on no cost. This is something you use to describe stuff like hair, water, or habitable health. And more and more, in the digital age, people are starting to talk about digital commons. And specifically, there is a thing called the software commons, which is all the body of software which is out there available for everyone at little on no cost that can be freely used and modified by anyone who has the knowledge to do so. And of course, you are free software people, so you immediately see the resemblance between this definition of the software commons and actually the definition of free software, right? So every time we collectively release a little line of source code under a free license, we are actually contributing something to this free software commons, okay? The software commons that ideally is something we want to protect and actually carry on to future generations. So this is something that strikes me as something very important. So it's legitimate to us, similarly to the political discussion we have about commons that are not the software commons, but the other commons, I think it's legitimate to us if we are properly taking care of this software commons. And to be honest, there are reasons to be concerned. So of course, as all digital information, we might lose software and specifically software source code. It might happen, for instance, that you just didn't have a backup of a software project and then you lose your data and you've lost your piece of source code. But there are plenty of other reasons why we have seen in the past months or years that we might be at risk of losing forever, stuff that was available before as free software. For instance, there has been very high-profile acts or cracks, there was a code hosting site called Codespaces that was completely hosted on some public cloud and an attacker took control of their dashboards and started some sort of ransom scheme which said, okay, so either you pay me given large amount of money or I will delete volume one of your repositories and didn't pay and they just destroyed volume one and so on and so forth. In the end, all the volume were destroyed and the company went bust. You might ask, didn't they have backup? Sure, they had backup in the same public cloud dashboard environment. So the company disappeared completely and the source code was hosted there unless people had backups of it elsewhere has been lost forever. So you know, we have all seen Gitarios come and disappear. We've seen Google code which is going away. More recently, we've seen the Microsoft code hosting site announced that they are going away and every time one of those sites disappear, I personally worry that maybe the source code which was there, the free software that was there might be lost forever. And of course there are more natural reasons for potentially losing pieces of source code like tapes in basement of the, literally there are tapes in the basement of the free software foundation with source code that, you know, at some point we literally rot and we probably need to do something about that unless we want to take the risk of losing the source code that was there. So essentially we know we have the internet archive if some website disappear, but where do you go if a specific repo on GitHub or on gitlab.com for what is worth goes away? Sure, it was a Git repo, so whether as a copy of that still have some we will have the entire history but good luck finding that person or that copy of the repo when you need it and when that happens. Then there is a second motivation that is important for me as a researcher which is my main job and is that I'm kind of envious of our friends in physics which they have this amazing infrastructure where they just federated the resources around the world and if they want to do some important research in a case which is really expensive to do so like particles or astronomy or that kind of stuff they just pull the resources together and create the stuff like the LHC or the very large telescope in the Atacama Desert. And those are resources that's being created by pulling money together and scientists who want to actually use those facilities to do some research, just pay for essentially renting the facility for a given amount of time can be even minutes or hours or days. They do the research, they go away and the infrastructure remain there for others to use. Well in computer science we don't have any of that. If there are people who want to study the entire body of software that has ever been published in source code form well they don't have that. They need to go and do their own scraping themselves and people do that. There are empirical software engineering communities, there are mining software repository communities that actually do exactly that. They scrape a given random number of repos on there from GitHub and try to extrapolate from that to the practices of free software as a whole in general. But that of course is not as good as being able to go to a single place where you have archived all the source code that's ever been published, run an experiment there, check whether it's true and then you really have an hypothesis that you have tested for being valid or not over time. So essentially if you study the stars you go to Atacama if you want to study the entire body of source code that we have ever published as humans where do you go? So this is why a bit more than one year ago we created the software heritage project. So the goal of the project is a rather ambitious goal to collect, preserve and share the entire body of software that has ever been published in source code form. So it's very specifically focused on source code, not binaries, okay? And the idea is that we will try to find where it is, ideally on the internet. We try to preserve that in the very long term and give access to it to anybody who needs it. Might be a researcher, might be any random user that just lost access to a specific, a target ball or a specific rep or I want to find it again. So the project is meant to serve several different use cases which I will not detail in very much but it just may give you some points about each one of them. So of course we want to serve the cultural heritage part of all of this. So we believe that software is knowledge. We believe that software knowledge can only be find in source code. So we want to make sure that the knowledge that is embedded in software is preserved for future generation. So there are interesting use cases for industries. Essentially as we know, industries are more and more using free software but essentially they don't have the equivalent of part numbers that you have in other parts of industry that are not related to software. So essentially keeping track of the equivalent of a bill of materials for software you ship when you're using free software components kind of hard. So we really need a place that can help people doing that kind of tracking of the specific version of software they embedded in specific products be them purely software products or maybe other products that do embed software themselves. There are use cases relevant for research, as I mentioned. There's another one important for research that I care deeply about which is scientific reproducibility of experiments that use software which is something that happens in all branches of science these days. So essentially it's very common to publish papers where you have used software to do something. But if you want to rerun the same experiment, good luck finding that software. That's not because people are trying to cheat or anything but it's just that because we don't have good practices and good places where to store the source code that has been used to run a specific scientific experiments. And there are use cases related to education. So imagine being able to do some historiographic research of the evolution of the implementation of a specific algorithm over time or having the ultimate source book where you can pinpoint students to real life implementation of those algorithm that they only see in pseudocode on some book and doing mixed mixing and matching among the two. So this is what we are meant to serve but we don't want to do all of that. The only thing we want to focus on is the archival part that so that others can help those use cases using the software heritage archive. And it's an ambition thing to do of course. So we are trying to making decisions in the design of the project itself that will maximize the chances of having some success here. So there are some decision about how we develop this thing and so you know us we are free software people. So of course we decided from day one that all the software that is being developed for running the project itself is gonna be free software. We believe it is an important mission and we believe that we need to be accountable to the public in the way you are doing things. We want people to be able to look at our code and say, hey, in your archival or scraper or whatever component here, you made a mistake. So we want to help you fix it. So this is why we want to publish everything we do and we're already doing that in fact, as free software. Second, we believe that to maximize the chances of this thing to be here in the very long run should not be a for profit endeavor. Not because companies are bad or anything but because companies may change their priority way more quickly than the time scope of this project is. So we believe this, we should be run as a non-profit organization that we are building and of course to maximize the chances of surviving any sort of disaster. We believe that replication should be built in the project from day one. So the idea is that the archive will be mirrored and replicated around the world as soon as others have enough resources to have mirrors with that. So this is the general idea of what we're doing and the principles we are using to run it. Let me give you some more specific details, technical details about how we are doing this and where we are right now and also what's changed since the last time software that has been presented here at Depco. So what do we archive in practice? So essentially what we do is that we go after places that are meant to distribute source code on the internet. Today, this essentially means forges like GitHub or GitHub instances or for source forge or any other places where you can find source code. And when we find one of those places, what we want to archive is the actual file content. So the content of the file, the real source code. But not only that, also the entire development history of the project, which means revisions or commits with all associated metadata. It means releases, so which version has been tagged to be which release of the software in question. And it also means, of course, scrolling information. So where and when we have found any single software artifacts like the ones mentioned above. And we want to store this in a model which is agnostic and independent as much as possible of the specific technology used to distribute the software. So we want a model which is independent of whether your software was in Git or in SVN or whether it was distributed as a Debian source package or as a source RPM or as a terrible. And the reason why we want this is that technology to distribute and develop software evolve over time. We don't want to have to re-download and restore again something when the next major version control system comes and takes Git by store. So what we do not archive, on the other hand, is all the rest of the information about software development. So we don't archive websites. We don't archive Wikis. We don't archive mailing lists. We don't archive code reviews, logs that you might have in any free software project. And not because this stuff is not important, it's really important, but we want to avoid scope creep. We want to avoid to stretching ourselves too thin. We think that already only archiving the source code which is development history is a huge task. So we're trying to focus on that. But at the same time, what we want to do is make it easy to point to our archive and the specific artifacts that have been archived in software heritage from other archives. So for instance, if you have a mailing list to archive on Gmail, you might easily say, okay, this mailing list corresponding to that project was source code that's been archived on software heritage or same thing with the internet archive, for instance. So ideally we want to play our role in an upcoming, ideally Wikipedia of software or semantic Wikipedia of software that can be used for all sorts of reasons. Not necessarily going to build that things ourselves, but we want to make it easy for people to point to us in this sort of meta archive that might exist about software. So this is what we archive. The architecture, the data flow if you want, of the archival process is a rather straightforward crawling approach in which we have essentially two tier. So our first tier is that of listing the available software distribution places. So essentially we want to be able to go to something like GitHub and list all the repositories, all the public repositories available there. Or we want to go to a specific GitLab instance and list all the repositories available there. Or we want to come to Debian, a specific release of Debian or a specific suite of Debian, list all the packages available there, or NPM or PyPy and so on and so forth. So for use of those things, we will have a software component that is capable of listing, doing either scraping or using public APIs if they exist, list all the specific repositories or packages available, which we call software origin. So essentially by periodically running those listers, we will create a set of software origins, which might be quite big. So it might be in the under the million software origins that periodically need to be consulted to retrieve software artifacts. So the second part of retrieving the source code from those places is what loaders do. And we will have one loader specific for any specific version control system or any specific source package essentially. So we will have a Git loader, we will have a subversion loader, we will have a mercurial loader or a Debian source package loader. And what they do is that they go and fetch stuff available at a specific origin and add to the archive any artifact that has never been seen before. So essentially loaders do de-duplication by design. So if we have archived a file from a Debian source package and we encounter the same file in a Git repo, we will store it only once. If we have archived a commit in a specific Git repo and we find the same commit in a subversion repo, we want to store it only once, okay? So everything is de-duplicated by default. And the reason we're doing that is that again, we don't want to re-archive everything once people decide that GitHub is no longer cool and we will migrate to the next StarForge or whatnot. So we really want to be economic in a sense in terms of storage archive to maximize the chances of making the cost of archival sustainable in the long run. So what's the data model there? What does this software heritage archive look like? So I believe everyone in the room is familiar with what a Merkel DAG is. So it's data structure in which everything is content addressable. So in each node contains some content and points to other nodes, okay? And the identifier of the node is a checksum, is a cryptographic checksum of the node itself, which means either the checksum of the content of the data storage in the node or something like the checksum of the content plus the identifier of the nodes pointed by that node. It's a very popular construction in cryptography because it's really, it has very good properties for instance for comparing to different structure in a very efficient way. It's used natively in technologies that are really cool today like Git or the EPFS or any sort of blockchain out there. And it essentially has built in the duplication because when you try to add something you add it only if it were not available. It was not available before. So what we do here is that any kind of artifact we store, a blob, a revision, a directory, a release, it's stored as a different kind of object. And this is an example of a commit made by Nikolai here by the way, and it's essentially a manifest. So you have a pointer to a specific directory containing source code, you have all bunch of metadata like a commit message, the author, the timestamp. And we take a checksum of all this and this is the identifier of that node. And to anticipate what is frequently asked with question here, no, we are not relying on Shawan, we are well aware that Shawan is broken and we are being conscientious about avoiding hash collisions. We'll get back to that later. So in the archive itself is essentially a huge Merkle doc. So you have on the right of this picture the blobs. So any blob is stored only once and you might have directories pointing to blobs. So directory has just structures like on any Unix-like file systems. So you have iNodes essentially pointing to either directories or blobs. And you woke up your way on this diagram. You have directories, you have commits. Each comments point to a different directory, which is volts over time. And on the left, you have snapshots, which are essentially the pictures of a repo we take every time we visit that. So even if you're in a repo, something disappears later on. For instance, if you do some sort of git push-dash-force, where we will still keep the stuff that's been seen before, because all objects that are now gone but were available before are still reachable from previous pictures of that repo that we have archive. So this is not just a theory. It's something that exists today. It works in production. And in the software heritage archive, we have already archived an entire mirror of GitHub itself. Of course, of the public wrapper of GitHub, not the private ones, because we don't have access to them. The archive will maintain up-to-date. We are lagging a little bit due to some resources issue, but it's maintained up-to-date with some time delta with respect to GitHub. We have archived all of Snapshot Debian org as of 2015, as a one-shot experiment. About the same time, we have archived all the releases of GNU projects that were available at the time. We have retrieved full dumps from Git or use from Google code, thanks to collaboration with the internet archive for Git or use and with Google itself for Google code. And we're in the process of adding them to our archive in the process, because we don't have all the loaders that are needed. We have a loader for Git. So we are basically process all of Git or use and the Git part of Google code, but we don't have yet in production loaders for other version control system, which are, which were supported by those hosting platforms at the time. And we have a, we're working on a Bitbucket. I don't know if Avi is in the room, no he's not, but he's helped us to develop a Bitbucket Lister, and there too we'll soon put in production the mirroring of Bitbucket itself. So there are some numbers that are kind of whole, and it didn't take fresh snapshots, but you can screenshot, but you can go on the website and see the numbers that are today. We have already archived more than three billion unique source files. So all those numbers are unique. So there are no duplicates in there. More than 800 million comets, as we've seen around, coming from something like 60 million different projects. Here 60 million projects should be interpreted as 60 million repositories, essentially. So it's pretty big thing. So it takes, the blobs themselves are something like 150 terabytes compressed, and the all the relationship between the different kind of objects that are in a DB, in a Postgres DB, which is something like six terabytes today. As a graph is an interesting beast, is something like seven billion nodes and 60 billion edges, which as Nicola will discuss later is something that is close to the state of the art of graph DBs, actually past the state of graph DBs. We believe, and we have claimed this many times without nobody has actually contested that before up to now, that this is the largest source code archive that has ever been built and keep on growing, because every day our listters and loaders go on and on and retrieve new stuff that is added to this archive. So what do you can do with that? So the most advanced feature we have is actually a web API, which is our REST API, JSON base that you can access at archive.software.eritage.org slash API. And essentially there you can navigate through this Merkel tag. So you can start from, we'll give you an example. So essentially what you can do, you can say, hey, I'm interested in a project that was hosted at github.com slash highlang slash high, which is poll tags implementation of a list dialect on top of Python. And you can say, do you have that? Sure, I have that. And we will give you a new URL with all the visits all the time we have taken a picture of that archive. So you go there and you have a list of the snapshots we have took of that archive, of that githrapple. And you will find, okay, among those pictures, there is one that we took on September 2016 and give you a pointer to a URL with more information. So you go there and it will tell at the time where any single branch or tag we're pointing to. So you can see all releases as github. So you can see all the branches at the time and each of them will be pointing to a specific, or an object of different kind. Usually it would be a revision or a release. So you go on, you ask for more information about a specific object. Here, for instance, I am asking information about a specific commit and we'll tell you all the metadata about that commit. So we'll tell you the author, the date, and in the end, it will tell you to what directory the commit we're pointing to. So you can go on, navigate through that directory structure, and at some point, you will reach blobs. And once you arrive at that blob, you will have all sort of information we have about it, usually different kind of checksum. So we have SHA-1, we have the salted SHA-1 git style, we have SHA-256. And we'll give you some additional URL, which we are processing on a best effort basis with information that we detected about that blob. For instance, the file type, the license, which we try to detect with license detection tools and with file type detection tools. And we'll give you an URL that ends with dash row to actually download the content of the file. So download of the content of the file is not open for all the objects we have yet. So it's open essentially only for the file that we have detected as being text-like. But the idea is of course to open it up more and more. We're trying to focusing our resources on allowing to download what's most useful right now. Also, we don't have the huge resources to run this API right now, so there is a pretty heavy rate limiting in place. I think right now is 120 requests per minute. So you cannot really use the massive scraping of this thing, but you can actually play with it and navigate through what is possibly the largest miracle dog about source coding existence. So what do you want to go from here? So essentially we have already a sort of way back machine available only as a web API on top of what we have archived. We want to have the same in form of a web UI, proper thing that you can navigate in a browser without having to use a web API. And of course, we want to offer you a way to WGet or Git clone directly from this thing. This is something we hope to roll up after summer. So essentially the idea is that you will say, hey, I want this tarble point that has this checksum or I want this, the commit of which I have an ID and where is that something asynchronous background process that will cook up something and will tell you now it's available, you can download it and import in your Git or whatever other version control system you are using. And from there, sky is the limit, right? So we want to have some provenance query. You will tell us, hey, I have this file, for instance, the GPL VTRAIL license. I want to see all the places where you have ever seen this file. That might be a lot of places. Or, and of course, we want to do full text search, right? Because once you have this archive, it would be cool to be able to find lines of code you care about, for instance, that matches some bugs or regex used to find patterns of bug in this thing. So, this is it for the general architecture part. Now, some more gory details about the technical implement, the low-level implementation of all this. So, let's start with a problem statement. How would you store and query a graph with 10 billion nodes and 60 billion edges? Keys, you cannot answer this. How would you store contents of more than three billion files which is around 300 terabytes of raw data with only 100,000 euros? We have made some choices that can be debated. So, for hardware, we went on with two hypervisors with 512 gigabytes of RAM each and 20 terabytes of SSD. And then into that, we plugged huge disk array of 60 times six terabytes of spinning rust. So, pretty slow storage. And we have one backup server with some RAM and another of those storage arrays. For our software stack, we decided to go with some things that we were comfortable with. We didn't know really what the scale of the thing that we were building would be. So, we started safe. We started with a relational database management system, so Postgres, for the storage of the graph. And for the actual file contents, we were uncomfortable with putting 300 terabytes inside Postgres, so we used plain file systems. So, two separate components, one to store the metadata, which is just a thin Python API on top of a pile of Postgres equal functions so that we could keep the relational integrity of the data that we stored at the lowest layer of storage. And to store the blobs, the objects, again, a very thin object storage abstraction layer on top of regular storage technologies, basically a directory, or a sharded directory, or any other cloud-based storage system. And then, on top of that, a separate layer for asynchronous replication and integrity management, the idea being that we don't want to impose technology on future mirrors. So, we only ask you to provide us with a simple key value store, and then we can do the replication and the integrity management on top of that. So, this object storage was actually implemented on XFS. I think we made some wrong assumptions about the possibilities that we had with X4. And basically what drove our file system choice was the number of I-notes that we could store. So, we made 16 sharded XFS file systems using a nested directory structure of four, well, three levels deep, and 256 directory-wide. So, 16 million directories, which is about a million per partition. We've also made a prototype to store all the blob data on Azure, thanks to one of the sponsors. So, 16 storage containers and all the objects stored in a flat structure. So, just have the shower as the key, the JZIP data as the value, and that's it. So, what we've noticed is that really, the generic model that we use for storage is fine. It's a very, very thin abstraction layer. It's very simple and to implement all the things that we wanted on top, like replication, integrity management. That was really easy. The implementation that we used, well, not the implementation, but actually the backend that we use to store our file systems, not that good. Spinning up 60 disks takes a while, having not too much RAM because the RAM is shared between the database server for which we put most of the RAM that we could and the workers that do all the crawling and this storage server. We've limited the amount of RAM that was available for it. When you get 16 million directory entries in your cache, basically you don't have RAM for anything else. So, you get very, very, very bad performance on your file system accesses. Pretty much, to access a single object, you have to wait maybe quarter of a second. So, it's kind of bad. So, the question is, who'd more memory help? Probably, we would be able to keep more stuff in cache, but even then, because the, I mean, the directory entries don't change that much. We only happened some more data at the end of that. But the thing is, all the accesses are using hashes that are randomly distributed. So, we will never have any locality of data accesses. We're storing three billion blobs that are, I think the median size is three kilobytes. So, it's really, really, really tiny, tiny files. So, every single access that you do, you have to see at a random place on disk. Using faster storage like SSDs would help by some measure. But I think what's needed and what we noticed when doing the prototype on Azure is only parallelism can help. Having the data on a lot of tiny servers that are really cheap, but having a lot of them to parallelize accesses is what I think we do moving forward. For the metadata storage, so we deployed a PostgreSQL using a primary replica, a primary and a replica with PG Logical. The idea was to be able to split the indexes between the primary server, which is tuned for writes, and the replicas that are tuned for reads. We've done most of our logic in SQL, either raw SQL or PLPG SQL, which is the PostgreSQL language. And we've added a really thin Python API over those SQL functions. The idea behind that was to do proper handling of the relationships between all the object types at the lowest level, at the database level. And we had the dream of doing somewhat fast recursive queries on the graph, for instance, finding the provenance of a single content, working up the whole graph in one single query. And we implemented that. And PostgreSQL works really well until your indexes don't fit in RAM. We have a six terabyte database. I think the smallest table that we have is 70 million entries, which is not much. But for instance, the content, so the database that stores all the content identifiers, it's 3.5 billion objects long. It must use 600 gigabytes just for the data. And then you add the same amount for the indexes. Furthermore, the recursive queries jump between object types all the time, between contents, directories, lists of directory entries, et cetera, et cetera. And all of those jumps go between evenly distributed hashes. So no data locality, no caching whatsoever works for this data. Finally, our massive data application means that we get a very efficient storage, but it means that for recursive queries, the width of each level goes exponentially because one content, for instance, the GPR is in a million directories, which are themselves in two million revisions, et cetera, et cetera. So it explodes really fast. Finally, referential integrity. Well, sure. I mean, if you publish a directory, you want to have the contents of all the files that are in this directory. But the real world and the real repositories that we download in the wild, there are all kinds of corrupted and we want to keep that data. So we have some dangling links that point to objects that we don't know about and that cannot be downloaded from the source either because the repository is broken or because of policy reasons, for instance, because something was taken down because of copyright issues. So in the end, we had to relax all the relational constraints that we had in our database. This gave us a slight boost in performance because, well, there's nothing left to check, but then you have to have another layer of data integrity checking to make sure that you don't put crap in your archive, which we have done for some tiny bits. So we're kind of satisfied with the way that we've made the abstraction for object storage. The prototype that we've put on Azure works really well. Plain file systems on spinning rows, no, that doesn't work, but then we knew that before starting. So we need to investigate other storage technology like Salesforce or any other object store that is completely scaled out. For the main copy of the archive, which we will be able to do as our budget ramps up, of course. The metadata storage is a bit less satisfying. We wanted referential integrity everywhere. We wanted recursive queries, and we noticed that neither of those really work in the real world. So we're kind of considering migrating to number object storages for all the object kinds with another layer that we have already started to implement to check the integrity of the metadata that we store regularly. So to conclude, I want to tell you how you can help. And of course, there is code. So as I told you, all what we are doing is free software. So all our code is available from our forge. And there is a page pointing to other community resources such as IRC channels, mailing lists, you know the drill. So our development priority right now are mostly essentially in coverage. So we need people to help us creating new listers, and Avi has created an amazing API that makes really easy to do any sort of scraping code for listing the content of any forge out there. And of course, loaders, which are a bit more complicated and more involved because there were some knowledge of the archive structures of other deep level. But we will need loaders for any sort of VCS out there and any sort of source package out there. You can also consider joining us. We have been hiring recently. We will be hiring a bit more in the future. So there are opportunities for people who want to work for software heritage or doing internship at software heritage. And also I want to share with you the fact that luckily we're not alone in what you're doing. So we are lucky enough that a bunch of different organizations has expressed support for what you're doing. In this slide, you see a number of free software organizations you know very well from being part of this community as well. You know, a bunch of academic institutions that are really looking at the research potential of this and a bunch of also institutional partners essentially from states that are looking forward to the real potential that software heritage can deliver. We have also sponsors of course, so the project luckily has been fully supported since day one by INRIA which is a French research institution with a very strong background in free software. It's still essentially incubating the project till it will be able to migrate to a completely separate structure as a separate foundation which we are working on. We also have some sponsors from companies that have decided to join us in this mission and we are actively looking for other sponsors of course. And also, we've been very proud to essentially be part of a landmark in agreement with UNESCO about the importance of preserving source code. And in this picture which was amazing to be part of, you can see the former president of the French Republic, the head of the UNESCO and the president of INRIA essentially making this agreement something real. And the next step in this partnership we are working on with UNESCO will be a conference in September about the essentially preservation of source code with any other party interested in helping us and in general working on the mission of preserving source code even in different organizations. So this is basically it, some pointers here. If you want to know more about some architectural details, there is a paper that we have published recently in a digital preservation conference and a print is available that the URL you can find in the slides. And with that, I think we still have a few minutes for questions. So I'm wondering how software heritage might be useful for Debian packages and in particular one of the things that you mentioned very briefly was that part of what you do as part of your ingest process is you might do some analysis of the code and a particular interest is the license data. So we would all love it if all free software came with a machine readable license like SPDX or something. In fact, we would probably be satisfied if all free software came with an explicit license or copying file, but we know it doesn't. And as packages could we use the analysis that you've done that meta analysis to help guide us for packaging. So yeah, it wasn't the main focus of this presentation, so I didn't enter into details of that. But for instance, in the LibrePlanet presentation this year of the project, I focused very much on the vision that software heritage can be the open data archive about software provenance and licensing. So of course, the real license of something can only be determined by looking at who write the specific lens of code and which agreement with this employer. So that's really complicated. But we love to be able to be the place where people automatically checks for licenses, detect license and store the results on software heritage as factual information. So factual information means today, I've run this version of Phosology on this specific comment of this specific project and the result I got is there. It's not necessarily true information, but it's a fact that Phosology returned that and returned that result. And of course, once it's there, you can access it with any sort of machine API that anyone can use. So essentially the DBs you have today is in compliance companies, right? Which are locked in DBs that only by being a client of those companies you can use, my dream and our dream is that it can be stored for everyone to use in software heritage. Have you considered monetizing the archive? I'm thinking there were logos of GitLab, for example. And at work we obviously have a GitLab that is in turn and we have a lot of code in there that we don't necessarily publish. But as you're interested in all type of software that might not be freely copyrighted, it could be interesting if we could have a way that we could even pay for saying we push every day our code, but the day our system crashes we want any version from the past back. So essentially this is a question of how you make such a project sustainable. And you have all sort of models to... Well, it's more a question of how I can use you, but... Right, so yes, we have thought about that. For now we are really much trying to make the project sustainable only on the basis of sponsorship with no strings attached. And the hope is that we will have enough of those to offer any service for free to people. So the only limits we would like to put on the archive are for privacy reasons. So we have a DB in which we already have a table with a huge mapping of developers' names to emails. And that table is not something you can put on the internet, right? So there is a trade-off to be made of, but aside from privacy concerns, we would really like to offer this thing as open data for anyone to use. And then maybe that will not work and we will consider different models, but it will always remain a known profit endeavor, of course. Is there one tiny last question? Yeah. Tiny, tiny. Okay, so it's a completely unreasonable request. Could you possibly understand all the different built systems such that suffer heritage could identify dependencies in a project for us? So no, but the vision of the Wikipedia of software, ideally something that people can be able to contribute stuff that actually parsed a Debian rules file or Debian control files and try to say, hey, this thing should be buildable with that version of that archive elsewhere or in software. Another can contribute something for a different package management system. And you can try once you have that to automatically rebuild software that we have in our car. Huge problem per se, sorry. And to complete with what you're saying, we are building the infrastructure for other people to read a feed of the new revisions that arrive in software heritage and to be able to download that and run their own analysis tools on that. So we already have the basis to make this work. Yeah, we have a break, I'll go ahead. Sure. Do you have any kind of metrics on how you stand on the whole indexing or free software, where you are right now? In terms of how much we need to archive, you mean? One percent, 50 percent? Of course, it's very hard to tell. So we have someone working with us who has been part of some of the analysis of the corpus of free software out there very simple and sketch estimations that we are a bit halfway, because there is a lot of stuff which is not indexed yet but is already stored. Because, of course, by archiving Debian, it's not like you have all it's there, but you have content which is really massively popular. So we are very... Back of the envelope estimation is that we probably need twice of this to have all it's there. Plus there is the fact that, of course, it can be published. So the growth rate is a different kind of problem. Thank you. Thanks.