 Welcome to Monday and also Welcome to the software heritage talk by Nicola Dandrimon. Thank you So hi, I'm Nicola Dandrimon and I will talk to you about the software it is project I'm really really glad to be here and to be able to talk to you about software it is It's been a long year of work And you'll see that it's just the beginning so The soft software is very pervasive It's at the heart of our society. We use it everywhere. We have software in our pockets. We have software everywhere around us. We have software in our bodies and It's really at the heart of technology Appliances in a house Contained 10 million source lines of codes Phones contained 20 million lines of code cars can contain up to a hundred million lines of codes And it's not gonna stop anytime soon. I mean we're starting to put software in every single thing that Going to market right now Software is the mediator to access all of our knowledge Information is one of the main pillars of modern societies done with software and software is critical to reproduce research Software must be therefore collected referenced and made accessible because it embodies our Knowledge and our cultural heritage Software is really fragile Have you tested your bike ups recently? Did you run get a fsek on your repositories? Did you have an account on Gittarius or on Google code? Both of those have shut down Software is scattered all around, you know, we have several forges github githlap beat bucket source for alia of your personal home page CDs in drawers Software is really everywhere and there is no single standard to retrieve software. There is no uniformity. There is no stability urs change Often even in Debian we change the URLs to our repository quite often Software migrates it moves from one host to another because host us close host us change policies so With the software habitat project What we are trying to do is to collect organize preserved and share all the software source code machines to execute so Software is executable and human readable knowledge, which is an all-time new even hardware software with FPGAs with ASICs and text files Can be interpreted forever Of course software evolves over time The development history is really important to understand software And software is really complex It has a large web of dependencies. We get millions of lines of code So software is not just another sequence of bits We need to be able in a software archive to make sure that we can interpret the software Its history its dependencies everything that we need to understand This tiny C file that exists on your hard disk So software it is working on the foundations We're building infrastructure base infrastructure to allow for all the applications that You could use a software archive for Cultural heritage industry research education We're laying the foundations So that people can build applications on top of our archive We're preserving the world software heritage by building a structure archive of all the world software So we can preserve the knowledge That has been put in science and technology over the last years We try to enable continued access to all digital documents and information and We're creating a building block for thematic portals and collections of software. We try to build Something that can be used to create better software for the industry the idea is that a Lot of companies have been running software for dozens of years and have completely lost the source of their software This is unacceptable because when you have a bug and You cannot fix it Life can be at stake So we need to ensure the long-term preservation of critical software We intend to ease the vulnerability tracking to get more secure software and We will make sure that software is traceable so that we can better integrate software and so that we can make sure that For instance when someone reuses a piece of code They're I mean they're using it correctly license-wise for instance for GP compliance Of course, we're also trying to support more accessible and reproducible science The global library referencing old software used in research fields So we are completing the infrastructure for open access in science You get open data you get open access for papers and you need to be able to have Open source code to be able to reproduce the results and We enable large-scale verifiable software studies with a comprehensive archive of Billions of source fights, so that's a lot of talk, but really what do we do? So meet the team We are seven people who have been working for different amounts of time Roberto and Stefano have been working on the project for the most part of two years now Antoine and myself have been recruited as engineers during 2015 This summer we've had two interns working on front-end and back-end and Guillaume has been advising us on industry issues our stack is Hardware hosted by Inria who is a big sponsor and initiated the project It's currently just one big hypervisor with a dozen virtual machines high-density storage array with 300 terabytes usable and To be safe, we have another copy in another server room. So we have the duplication of this hardware and We're working very hard to enable a mirror network to make sure that our contents can be kept as long as possible Our software runs Debian Every machine runs Debian. We use PostgreSQL for metadata storage. We use Python 3 and PsychoPG2 for the back-end Flask for web apps and RabbitMPU for task scheduling. So we're using only free software and we're building free software Our values are those of Debian. We use a hundred percent free and open source software licenses GPLv3 for the back-end code AGPLv3 for the front-end so it stays free forever Apache 2 for the puppet manifest because that's what the community of puppet developers uses We really encourage bug reports and code contributions for everyone interested in Pursuing our software preservation mission And to do that today, we are opening our forge so as a thank you to Debian and I mean with time the opening today So thanks to the Debian community for what is has brought to us so We have this infrastructure. What do we have inside it? so it's really exciting to work on Such huge amounts of files of commits of projects. We've got Replicated all the known for GitHub repositories. That's 22 million repositories Those 22 million repositories contain 600 million commits and 2.6 billion unique source files We're also importing all the Debian packages from snapshot.debian.org and we have imported the new projects FTP archive But we didn't stop there We also have been talking to Google and we have been fetching all of Google code before it closed So we have the 12 million 12 million Google code repositories Ready to be imported We also have talked to the archive team about Getorius and we have a copy of the two million repositories that were on Getorius We're storing all our files all our git repositories as Lose files so each and every single version of each and every source file is stored as one flat file in In a file system and on top of this file archive We have built a Postgres database for all the metadata. So the metadata is basically one big Directed icicle graph inspired by the git model Where at the bottom layer we have contents which are blobs which are files Those contents are stored in directories Those directories so software So to a source code is organized in revisions. So you do iterative changes So those revisions are stored in our database or releases of the software are stored in the database Then on top of that we have origins which are the source repositories that we're getting data from and Occurrences which are at every point of time where we looked at a repository the branches that were available pointing to Every single one object at the bottom so This is probably the biggest Distributed VCS graph in existence Yes, so 120 terabytes of files on disk 3.1 terabyte Postgres database for the metadata 2.7 billion files 2.2 billion directories 600 million revisions 12 million people 5 million releases The biggest dvcs tree in existence What will we do? We have a lot of planned features So right now our website allows you to look for contents by hash So the idea is that if you have a file a source file You can put it in the box on our website and make sure that Software heritage has archived it What we want to provide is provenance information for all the content so that we can say we have seen this file at that date on github.com We want to enable people to browse the contents because Putting all the software source code in a box is not what we want to do. We want to enable everyone to Look at the source code So basically we are trying to build a way back machine for software source code We want to enable full-text search in all of our archive that's That would be quite a price 2.7 billion source code fight that you can search into instantly and of course What we want to do is to enable people to download Every single bit of software that software heritage has archived And basically what we want to what we could provide is a git clone For every data source that exists whether it is a git repository and asian repository the Debian source package We will enable you to git clone from software heritage Of course There's also many more applications one could imagine All of the world software at your fingertips in a single graph We also have a lot of technical changes because software changes all the time so we have to handle the backlog and All the data that we have saved as a one-shot import which is the GNU.org mirror and the snapshot Debian archive We need to make sure that we can keep up with your dates We also need to make sure that we get the new repositories and commits on github we need Basically reliable standardized even feeds that we can tap into and that people can send us or cap can provide us for us to be able to Update and make sure we stay up to date And of course All the software is not on github all the software is not in Debian or the software is not in the GNU project So we need to expand we need to discover and classify all the software sources Which means that if you have a forge at your company and you do open source We can get your software If your Linux distribution and you have a forge we need to get your software If you know of someone's web page where software is released we need to get it. We need to get everything and And of course not everything is in gith or in the table or in the Debian package So we need to make sure that we have importers for all the version all the version control systems With we have started work on an SVN importer We need mercurial importers, DAX, whatever. It's a wonderful playground if you have time to help So how do you help? our forge is open on forge.software.health.org You can subscribe to our mailing list as SWH-Devil at inria.fr The link to the subscription page is on the slide and you can take a look at our wiki Where we store the public information about the project if Your company or your organization can join us as sponsors We are welcoming support from everybody in inria has initiated the project inria enables me to be here today Inria is the French Institute for research in computer science Inria contributed to the birth of the W3C 4500 people work there many prestigious scientists and recently inria has worked on TLS vulnerabilities And lots of stuff that have been made public recently Inria is fully supporting the bootstrap phase of software heritage But we do need the help of everybody so If you think your company can help us more info is available on our website Software heritage. It's a revolutionary reference archive of all the software ever written It's a unique complement for development platforms like github and We're building an international open non-profit mutualized infrastructure and we're ready to work with you If you have any questions feel free to ask right now Or you can contact us by email and you can look at our websites Thank you very much for your attention So the software that you're used you're using in this has been written Do you think that could be used also for snapshot WN org as the front end? so Currently we're using snapshot debian org as a data source so Just the designs you were talking about sounded very similar to what we already have which is Not very Worked on very much at the moment. Well, maybe it would be better for us to be using So I think the main code base, right so the main difference between what's been done with With snapshot debian org and what we're doing in software heritage is that we're unpacking all the source files Which snapshot that debian org doesn't do a Snapshot just has a pool of files that existed on mirrors. That's Do you also keep the? the Tables So, yes, we are storing the Tables to be so We have yes right now. We're keeping the Tables We're not sure that we're going to do that long term because Tar changes and As we know with pristine tar for instance So we want to make sure that the software is available now is available in 10 years in 50 years in a hundred years so the only really The only thing that allows us to do that is to store the plain source files Yes also a big difference with Snapshot debian org is that snapshot debian org stores binaries We do store binaries because people put everything and anything else in their git repositories But we are not really interested in that We have a file size limit, which is currently set at 100 megabytes and We are not importing anything bigger than that because it's very very probably not source code So we don't currently have the infrastructure to store big files Related question Suggestion the debian derivative sensors is downloading source packages from a whole bunch of derivatives Maybe you'd like to also store those files very much talk later about that Yeah, it doesn't pretty much that you did your own storage writing to this directly. Yes Have you ever investigated using cloud technologies like swift cluster FS sheepdog What not? Yes, and roof and roof FS. It's nice to so we've started with a Very limited budget. So we had to do what was the most dense storage for our price price tag basically So we went for the very simple solution We've optimized for data in gestion. So we have been able to import 2.7 billion files in a year But of course we need to make sure that we can retrieve those files, which is currently not very Efficient with our storage. So yes, we have we are starting to investigate over of a storage Storage capabilities the main issue we have with file storage is that so we have 2.7 billion files The median file size is three kilobytes. So the files are very very tiny They are source code files so For instance the git storage model breaks down very bad with so many files and Well, I mean Yes, we need people that know storage and that can help us Improve on that so the other question is did you investigate archiving language specific repositories like I don't know a sepa and pi pi may burn these kind of things. So Currently we have focused on github because Basically it was a low-hanging fruit. It's very easy to clone a git repository And it's very easy to unpack and to make sense of the metadata that there is in the git repository of course, we want to archive everything and Getting language specific forges is a very important step towards that So we're welcoming any help in doing that. Yes Yeah, very interesting project Software written by scientists is open source and not all data they publish Produces open data. So do you have training material training courses for scientists and for students to convince them? so Roberto de Cosmo who is leading the software a tech project has done a lot of outreach in the open access community and in the research community in general to Underline the importance of writing open of writing free software to enable reproducible research So, yes, I think we do have some training materials that Must be available somewhere. I think you should send an email to info at software tech dot org And I'm sure Roberto can point you to some material Hey, hey So this is a Very ridiculous question, but I think this is such an important project for humanity planning for the apocalypse Like for instance by planning to have an occasional snapshot put in Secure of facilities you can find possibly with the computer that could be used to access it So we haven't done any plans for the apocalypse yet but our interns or Intern working on backend storage is starting is enabling us to have a leader follower model for For our storage We need to put something in place to be able to replicate the database replicating the database has lots of issues because we have Something like four point five million people so four point five million email addresses that are easy to pick Just take a big list and you can spam everybody So there are some considerations there Making snapshots is of course a possibility It's gonna need a lot of storage, but it's certainly doable. Yes For now, we haven't thought about it yet Another security related question. There is an interesting article about the forgery the Gospel of Mary or something like that and Since you know an archive You might become the target of people trying to rewrite history or even Remove the documents or something like that and some cases that might mean just deleting it entirely Yes to obfuscate the fact that it's been done. Do you have any mechanism to worry about that or? so every single Identifier for objects in our database is intrinsic Which means that it's a hash of the content of the object. So for instance files are identified by their hashes Directories we write a manifest saying this directory contains the file with such ID and Etc and we hash that and that is the Identifier of the directory etc etc for all the layers of software heritage To be resilient to attacks What we need to do is have mirrors everywhere The idea is that if you copy information you cannot remove it anymore because it's everywhere. It's pervasive So this is this is really what we want to focus on now Is making sure that we have copies everywhere in the world in every jurisdiction in the world So we can make sure that if a government wants to take us down if someone wants to rewrite history then the history is available everywhere and They just cannot physically do that anymore Mention Software registry is there is there any relation to So the software registry you was talking about No, I Wasn't at her talk. So I Not sure I can Not sure I know what the software is but We can talk about it Is that everybody so far anymore? Does anyone see if there's any question from my RC maybe okay? Thank you everyone