 Hi, thank you. I'm Nicolas Dandrimon, and I will indeed be talking to you about software heritage. I'm a software engineer for this project. I've been working on it for three years now. We'll see what this thing is all about. I guess the batteries are out. So let's try that again. So we all know we've been doing free software for a while. That software source code is something special. Why is that? How the Habelssohn has said in SICP, his textbook on programming, programs are made to be read by people and then incidentally by four machines to execute. So basically what software source code provides us is a way inside the mind of the designer of the program. For instance, you can have, like you can get insights in very crazy algorithms that can do very fast reverse square roots for 3D kind of stuff. So like in the quick two source code, you can also get insights on the algorithms that are underpinning the internet. For instance, using seeing the net queue algorithm in the Linux kernel. So yeah. What we're building as the free software community is the free software commons. Basically the commons is all the cultural and social and natural resources that we share and that everyone has access to. And more specifically the software commons is what we're building with software that is open and that is available for all to use to modify to execute to distribute. We know that those commons are a really critical part of our commons. Who's taking care of it? The software is fragile. Like all digital information you can lose software. People can decide to shut down hosting spaces because of business decisions. People can hack into software, hosting platforms and remove the code maliciously or just inadvertently. And of course for the obsolete stuff, there's rot. If you don't care about the data, then it rots and it decays and you lose it. So where is the archive we go to when something is lost, when GitLab goes away, when GitHub goes away? Where do we go? Finally there's one last thing that we noticed is that there's a lot and lots of teams that work on research on software and there's no real big infrastructure for research on code. There's tons of critical issues around code, safety, security, verification, proofs. Nobody is doing this at a very large scale. If you want to see the stars, you go to the Atacama Desert and you point the telescope at the sky. Where is the telescope for source code? And that's what software heritage wants to be. What we do is we collect, we preserve and we share all the software that is publicly available. Why do we do that? We do that to preserve the past, to enhance the present and to prepare for the future. So what we're building is a base infrastructure that can be used for cultural heritage, for industry, for research and for education purposes. How do we do it? We do it with an open approach. Every single line of code that we write is free software. We do it transparently. Everything that we do, we do it in the open, be that on a mailing list or on our issue tracker, and we strive to do it for the very long haul. So we do it with replication in mind, so that no single entity has full control over the data that we collect and we do it in a non-profit fashion, so that we avoid business-driven decisions impacting the project. So what do we do concretely? We do archiving of version control systems. What does that mean? It means we archive file contents, so source code files. We archive revisions, which means all the metadata of the history of the projects. We try to download it and we put it inside a common data model that is shared across all the archive. We archive releases of the software, releases that have been tagged in a version control system, as well as releases that we can find as tabules, because sometimes both views of this source code differ. And of course, we archive where and when we've seen the data that we've collected. And all of this we put inside canonical VCS agnostic data model. So if you have a Debian package with its history, if you have a Git repository, if you have a Subversion repository, if you have a Mercurial repository, it all looks the same and you can work on it with the same tools. What we don't do is archive what's around the software, for instance the bug tracking systems or the homepages or the Wikis or the mailing lists. There are some projects that work in this space, for instance the Internet archive does a lot of very good work around archiving the web. So our goal is not to replace them but to work with them and be able to do linking across all the archives that exist. We can for instance for the mailing lists there's the G main project that does a lot of archiving of free software mailing lists. So our long-term vision is to play a part in a semantic Wikipedia of software or wiki data of software where we can hyperlink all the archives that exist and do stuff in the area. So quick tour of our infrastructure. So basically all the way to the right is our archive. So our archive consists of a huge graph of all the metadata about the fights, the directories, the revisions, the commits and the releases and all the projects that are on top of the graph. We separate the file storage into another object storage because of the size discrepancy. We have a lot, lots and lots of file contents that we need to store. So we do that outside of the database that is used to store the graph. So basically what we archive is a set of software origins that are git repositories, mercual repositories, et cetera, et cetera. So all those origins are loaded on a regular schedule. So if there is a very active software origin, we're gonna archive it more often than stale things that don't get a lot of updates. And so what we do to get the list of software origins that we archive, we have a bunch of listers that can like scroll through the list of repositories, for instance, on GitHub or other hosting platforms. We have code that can read DBN archive metadata to make a list of the packages that are inside this archive and can be archived. Et cetera, et cetera. So all of this is done on a regular basis. We are currently working on some kind of push mechanism so that people or other systems can notify us of updates. Our goal is not to do real-time archiving. We're really in it for the long run. But we still want to be able to prioritize stuff that people tell us is important to archive. The Internet Archive has a Save Now button and we want to implement something along those lines as well. So if we know that some software project is in danger for a reason or another, then we can prioritize archiving it. So this is the basic structure of a revision in the software attach archive. You'll see that it's very similar to a Git commit. It's just like so the format of the metadata is pretty much what you'll find in a Git commit with some extensions that you don't see here because this is from a Git commit. So basically what we do is we take the identifier of the of the directory that the revision points to. We take the identifier of the parent of the revision so we can keep track of the history and then we add some metadata of our chip and committership information and a revision message and then we take a hash of this. It makes an identifier that's probably unique. Very very probably unique. And so using those identifiers we can retrace all the origins, all the all the history of development of the project and we can deduplicate across all the archive. All the identifiers are intrinsic which means that we compute them from the contents of the things that we are archiving, which means that we can deduplicate very efficiently across well, all the data that we archive. And how much data do we archive? A bit. So we have past the billion revision marks a few weeks ago. This graph is a bit old, but anyway, you have a live graph on our website. So that's more of more than four and a half billion unique source code files. We don't actually discriminate between what we would consider a source code and what upstream developers consider a source code. So everything that's in a git repository we consider a source code if it's below a size threshold. Hundred, so billion revisions across 80 million projects. So what do we archive? We archive GitHub, we archive Debian. So Debian, we run the archival process every day. So every day we get the new packages that have been uploaded in the archive. GitHub, we try to keep up. We are currently working on some performance improvements and some scalability improvements to make sure that we can keep up with the development on GitHub. We have archived as one of things the former contents of Git or use and Google code which are two prominent code hosting spaces that closed recently and we've been working on archiving the contents of Bitbucket which is kind of a challenge because the API is a bit buggy and Atlassian isn't too interested in fixing it. So in concrete storage terms, we have 175 terabytes of blobs. So the files take 175 terabytes and kind of big database, six terabytes. So the database only contains the graph of like the metadata for the archive, which is basically an 8 billion nodes and 70 billion ages graph and of course it's growing daily. We are pretty sure this is the richest public source code archive that's available now and it keeps growing. So how do we actually what kind of stack do we use to store all this? We use Debian, of course All our deployment recipes are in Puppet in public repositories. We've started using self for the blob storage. We use PostgreSQL for the metadata storage with some of the standard tools that live around PostgreSQL for backups and replication. We use standard Python stack for scheduling of jobs and for web interface stuff. So basically PsychoPg2 for the low level stuff, Django for the web stuff and Celery for the scheduling of jobs. In-house we've written an ad hoc object storage system, which has a bunch of backends that you can use. So basically, we are agnostic between a Unix file system, Azure, SEF or tons of other things. It's a really simple object storage system where you can just put an object, get an object, put a bunch of objects, get a bunch of objects and we've implemented removal, but we don't really use it yet. All the data model implementation, all the listers, the loaders, the schedulers, everything has been written by us. It's a pile of Python code. So basically 20 Python packages and around 30 Puppet modules to deploy all that. And we've done everything as a copy left license, so GPLV3 for the back end and AGPLV3 for the front end. So even if people try and make their own software heritage using our code, they have to publish their changes. Hardware-wise, we run for now everything on a few hypervisors in-house and our main storage is currently still on a very high density, very slow, very bulky storage array, but we've started to migrate of this this thing into a SEF storage cluster, which we're going to grow as we need in the next few months. We've also been granted by Microsoft some sponsorship, in-kind sponsorship for their cloud services. So we've started putting mirrors of everything in their infrastructure as well. Which means a full-object storage mirror, so 170 terabytes of stuff, a mirror on Azure, as well as a database mirror for a graph. And we are also doing all the content indexing and all the things that need like scalability on Azure now. So yeah, and finally at the University of Bologna, we have a back-end storage for the download. So currently our main storage is quite slow. So if you want to download a bundle of things that we've archived, then we actually keep a cache of what we've done so that it doesn't take a million years to download stuff. We do our development in a classic free-open-source software way, so we talk on a mailing list, on IRC, on a forge, everything is in English, everything is in public. There's more information on our website if you want to actually have a look and see what we do. So, oh, that's very interesting, but how do we actually look into it? So one of the ways that you can browse and that you can use the archive is using a REST API. So basically this API allows you to do point-wise browsing of the archive. So you can go and follow the links in the graph, which is very slow, but gives you pretty much full access for the data. So there's an index for the API that you can look at, but that's not really convenient. So we also have a web user interface. It's in preview right now. We're gonna go, we're gonna do a full launch in the month of June. So if you go to archive.triot.org slash browse with the given credentials, well, you can have a look and see what's going on. Basically, we have a web interface that allows you to look at what origins we have downloaded. When we have downloaded the origins and so with a kind of graph view of how often we've visited the origins and a calendar view of when we have visited the origins. And then inside the visits, you can actually browse the contents that we've archived. So for instance, this is the Python repository as of May 2017. And you can have the list of files and then drill down. It should be pretty intuitive. If you look at the history of a project, you can see the differences between two revisions of a project. I don't know, that's the syntax highlighting, but anyway, the diffs arrive right after. So yeah, pretty cool stuff. I should be able to do a demo as well. It's gonna be, yeah, it should work. I'm gonna try. I'm gonna zoom in. So this is the main archive. You can see some statistics about the objects that we've downloaded. When you zoom in, you get some kind of overflows because, yeah, why would you do that? If we want to browse, we can try to find an origin. Gdpsi. Okay. So there's lots and lots of random GitHub folks of things. We don't discriminate and we don't really like filter what we download. We're looking into doing some relevance, a kind of sorting of the results. Here, blah, blah, next. Xilinx, why not? So this has been downloaded for the last time on the 3rd of August 2016. So it's probably a dead repository. But yeah, you can see a bunch of source code. You can read the readme of the Gdpsi. So if we go back to a more interesting origin, here's the repository for Git. I've selected voluntarily an old visit of the repo so that we can see what was going on then and then. So if I look at the calendar view, yeah, you can see that we've had some issues actually updating this, but anyway. If I look at the last visits, then we can actually browse the contents. You can get syntax highlighting as well. This is a big make file with lots of comments. Let's see what the actual source code. Anyway, so that's the browsing interface. We can also now get back what we've archived and download it, which is kind of something that you might want to do if a repository is lost. You can actually download it and get the source code back again. So how we do that, if you go on the top right of this browsing interface, you have actions and download, and you can download the directory that you're currently looking at. So it's an asynchronous process, which means that if there's a lot of load and it's going to take some time to get actually to be able to download the contents. So you can put in your email address so we can notify you when the download is ready. I'm going to try my luck and say just okay, and it's going to appear at some point in the list of things that I've requested. But yeah, I've already requested some things to download. They can actually get and open as a table. Yeah, can you please? Yeah, I think that's the thing that I was actually looking at, which is this revision of the GitSource code. And then I can open it. Yay, Emacs. That's when you want. Yay, source code. Yeah. So this seems to work, and then of course, if you want to actually script what you're doing, there's an API that allows you to do the downloads as well, so you can. So the source code is the duplicated a lot, which means that for one single repository, you get tons of files that we have to collect if you want to actually download an archive of a directory. So it takes a while. But we have an asynchronous API, so you can post the identifier of a revision to this URL and then get such updates. And at some point, it will tell you that So here, the status will tell you that the object is available. You can download it and you can even download the full history of a project and get that as a Git fast export archive that you can re-import into a new Git repository. So any kind of VCS that we've imported, you can export as a Git repository and re-import on your machine. So how to get involved in the project? We have a lot of features that we're interested in. A lot of them are now in early access or have been done. There's some stuff that we would like help with. This is some stuff that we're working on, provenance information, so you have a content you want to know which repository it comes from. That's something that we're working on. Full-text search, if you want, I mean the end goal is to be able even to trace a source of snippets of code that have been copied from one project to another. That's something that we can look into with the wealth of information that we have inside the archive. There's a lot of things that I mean, there's a lot of things that people want to do with the archive. Our goal is to enable people to do things, to do interesting things with a lot of source code. So yeah, if you have an idea of what you'd want to do with such an archive, please, you can come talk to us and yeah, we would be happy to help you. Help us. So yeah, what we want to do is to diversify the sources of things that we archive. Currently, we have good support for Git. We have okay support for Subversion and Mercurial. If your project of choice is in another version control system, we are gonna miss it. So yeah, people can contribute in this area. If for the listing part, we have coverage of Debian, we have coverage of GitHub. If your code is somewhere else, we won't see it. So we need people to contribute stuff that can list, for instance, GitLab instances. And then we can integrate that in our infrastructure and actually have people be able to archive their GitLab instances. And of course, we need to spread the word, make the project sustainable. We have a few sponsors now. Microsoft, Nokia, Huawei, GitHub has joined as a sponsor. The University of Bologna, of course, Inria is sponsoring. But we need to keep spreading the word and keep keep the project sustainable. And of course, we need to like save and danger source code. And for that, we have a suggestion box on the wiki that you can add things to. For instance, we have in the back of our minds archiving Sourceforge because we know that this isn't very sustainable and it's at risk of being taken down at some point. So if you want to join us, we also have some job openings that are available. For now, it's in Paris. If you want to consider coming work with us in Paris, you can look into that. So, yeah, that's software heritage. We are building a reference archive of all the free software that's been ever written in an international open non-profit and mutualized infrastructure that we have opened up to everyone. All users, vendors, developers can use it. And the idea is to be at the service of the community and for society as a whole. So if you want to join us, you can look at our websites, you can look at our code and you can also talk to me. So if you have any questions, I think we have about 10-12 minutes for questions. Do you have a question? How do you protect the archive against stuff that you don't want to have in the archive? I can think of a stuff that is copyright protected and that GitHub will also delete after a while or if I would misuse the archive as my private backup and store encrypted blocks on GitHub and you will eventually backup them for me. So there's I think two sides of the question. The first side is do we really archive only stuff that is free software and that we can redistribute and how do we manage for instance copyright takedown stuff. So Currently, so most of the infrastructure of the project is under French law. There's a defined process to do copyright takedown in the French legal system. We would be really annoyed to have to take down content from the archive. What we do, however, is to mirror public information that is publicly available. Of course, I'm not like a lawyer for the project, so I can't really like I'm not 100% sure of what I'm about to say, but what I know is that in the current French legislation status, if the source of the data is still available, so for instance if the data is still on GitHub, then you need to have GitHub take it down before we have to take it down. If we are not currently filtering content for misuse of the archive, so the only thing that we do is that we put a limit on the size of the files that are archived in software heritage. The limit is pretty high, like 100 megabytes something. We can't really like decide ourselves what is source code, what is not source code, because for instance, if your if your project is a cryptography library, you might want to have some encrypted blocks of data that are stored in your source code repository as test fixtures and then you need them to build the code and to make sure that it works. So how would that be any different than your encrypted backup on GitHub? How could we software heritage distinguish between proper use and misuse of the resources? So I guess our long-term goal is to not have to care about misuse because it's going to be a drop in the ocean. We're going to have so much while we want to have enough space and enough resources that we don't really need to ask ourselves this question. Basically. Thanks. Other questions? Have you looked at some form of authentication to provide additional assurance that the archived source code hasn't been modified or tempered with in some form? So first of all, all the identifiers for the objects that are inside the archive are cryptographic hashes of the contents that we've archived. So for files, for instance, we take the SHA-1, the SHA-256, one of the Blake hashes and the Git modified SHA-1 of the file and we use that in the manifest for the directories. So the directories, the directory identifiers are a hash of the manifest of the list of files that are inside the directory, etc., etc. So recursively you can make sure that the data that we give back to you has not been at least altered by like Bitflip or anything. So we regularly run a scrub of the data that we have in the archive. So we make sure that there's no rot inside our archive. We've not looked into basically attestation of for instance, making sure that the code that we've downloaded, I mean, we're not doing anything more than taking a picture of the data and we say we've computed this hash. Maybe the code that's been presented by GitHub to software heritage is different than what you've uploaded to GitHub. We can't tell. In the case of Git, you can always use the identifiers of the objects that you've pushed. So you have the commit hash, which is in itself a cryptographic identifier of the contents of the commit. In turn, if the commit is signed, then the signature is still stored in the software heritage metadata and you can like reproduce the original Git object and check the signature, but we've not done anything specific for software heritage in this area. Does that answer your question? Cool. Other questions? That's one in front. So it's partially question, partially comment. So your initial idea was to have telescope or something like this for source code. Yes. For now, for me it looks a little bit more like microscope, so you can focus on one thing, but that's not much. So have you started thinking about how to analyze entire ecosystem or something like this? For example, now we have Django 2, which is Python 3 only, so it would be interesting to look at all Django models to see when they start moving to this Django. So we would need to start analyzing thousands or millions of files, but then we would need to use some SQL-like or some map-reduced jobs or something like this for this. Yes. So we've started. So the two initiators of the project, Roboto De Cosmo and Stefano Securely, are both researchers in computer science. So they have a strong background in actually mining software repositories and doing some large-scale analysis on source code. We've been talking with research groups whose main goal is to do analysis on large-scale source code archives. One of the first mirrors outside of our control of the archive will be in Grenoble. There's a few teams that work on actually doing large-scale research on source code over there, so that's what the mirror will be used for. We've also been looking at what the Google open-source team does. They have this big repository with all the code that Google uses, and they've started to push back, like do large-scale analysis of secretive vulnerabilities, of issues with static and dynamic analysis of the code, and they've started pushing their fixes upstream. So that's something that we want to enable users to do. That's not something that we want to do ourselves, but we want to make sure that people can do it using our archive. So we'd be happy to work with people who already do that, so that they can use their knowledge and their tools inside our archive. Does that answer your question? Cool. Any more questions? No, then thank you very much, Nicola. Thank you.