 I'm sorry I cannot be here there in person, but I'm happy to be able to join virtually. And let me share my slides so we can get started. Okay. So, hello everyone. Today we're gonna talk about software preservation, in particular in the context of the software heritage project. But let me start with a more general notion, which I believe is dear to many people active in free software. And in general, people that care about the sustainability of various kinds of ecosystem. And that is the notion of commons. So commons, which in Italian we call Benicomuni, are all those resources, which are accessible to all members of society. And that encompasses various kinds of goods that are held in common and not owned privately. So this is very well known in the context of ecology movement, for example, is for sources such as air, water, and habitable earth. But it's also relevant in the context of digital goods. In particular, there is this notion of the software commons, which is defined as essentially all computer software, which is available at little or no cost and which can be altered and reused with very few restrictions. And of course at this conference, we are all into free and open source software. So the resemblance between this definition and free and open source software should be striking for most of the people here. And the first point I want you to think about is that essentially every time you or someone that contributes to free and open source software releases a line of code, which is released under a free and open source license, essentially that line of code becomes part of a greater body of things, of digital goods, which is the software commons. So essentially we are all together creating and maintaining this body of knowledge, this body of digital knowledge, which is the software commons. And if you agree with that, there is a legitimate question of whether, aside from actually contributing to it with our own lines of code, we are taking good care of this body of digital code. Essentially, are we making sure that there is a sustainable future for the preservation of all this software that we are in a way producing together? And in fact, there are reasons for concerns because like any kind of digital information, software in particular, free software is fragile in the sense that it can disappear. It can be distributed today from a place that tomorrow might no longer be there. For instance, there has been a situation in which large forges and that were operated by for-profit companies are no longer available to date for entirely legitimate business reasons. For instance, Gitorius that some of you might have used no longer exists today. Google code that for a while Google operated as a forge for collaborative software development no longer exists either. Bitbucket that still exists no longer owes some kind of repositories such as Mercurio repositories. So essentially all the places that we use today to produce and collaborate on software are here with us today but are not necessarily going to be with us in the future if you consider a long enough period. So essentially, imagine trying to find the history of a website. And if most of you probably know that you can use the internet archive to retrieve all versions of a webpage and see how it was years ago. But if a repository on GitHub disappear or if GitHub as a whole 50 years from now disappear forever where do you go to retrieve the source code that today is hosted there? And sure, if you have used Git you know that Git is a distributed version control system. So you might say that someone should have a copy of that repository somewhere but that will not necessarily help you when you need to find it if you don't know who has a copy of the repository. So this is the reason why a few years back the idea dates to 2015 and we announced it publicly in 2016 we launched the initiative called software heritage. And the idea of software heritage in a nutshell is to collect, preserve and share with everybody who needs it the entire body of software which is available in source code for. And we do this to gather for different use cases. So the first use case is actually to be a reference catalog. So essentially archiving a place that people can find and reference all software source code. So essentially it's a place where you need for instance the part number of something that has been released as source code you can go see if it exists and have an identifier for that piece of software. It's also meant to be an archive for archival sake. So essentially if we agree that we are creating together an important body of knowledge that is stored in software source code it is important to avoid that that knowledge is lost. So the idea here is to create a long-term place where if something is lost disappears from its current hosting location you can go to retrieve it in the future no matter how far in the future. And finally given I'm by my main profession I searcher I'm kind of envious of what our colleagues in physics can do in building amazing research infrastructure that are shared and used by a researcher around the world when they need it. And essentially the key idea on this point here is to build a research infrastructure where researchers who want to run experiments on source code itself can actually do that without having to recall the software themselves. So there is a body of research which is empirical software engineering in which researchers often want to do analysis on vast body of source code. They cannot do that so they do that on small samples. And here the idea is that you have a place where all the software source code that has ever been published is so you can run your experiments in a reproducible way on the entire body of the software and then go home and analyze your data so while others can do similar analysis by themselves. So this is the general view of why we're running software heritage and the kind of use cases we want to support. We're also doing this in a kind of a principled way. So here you are essentially, we want to be a piece of infrastructure that is helpful for the use case which I mentioned but we're doing it following some key principles. So in terms of technology, the technology that we are building ourselves to actually create and maintain the archive is itself entirely free and open source software. And the project is run as a classic open source software project. So you can come, you can see our code, you can propose patches, they are reviewed and they can be accepted and become part of the infrastructure themselves. We're also building it in a way that is fully replicated because we know well that the only way to make sure that something will leave on forever is to make sure you have enough copies of it. So we operate multiple copies of the archives ourselves and we're also building a network of mirrors operated by other actors around the world so that each one of them has a full copy of the archive as we have it. In terms of the content we archive, we are principle on the fact that all the pieces of code we archive are identified by intrinsic identifiers. I'm gonna talk a little bit more about those in a bit. And we also make sure to not have, not store opinions about artifact we store but only store facts about them. So everything we store in the archive as an information about where it comes from. And instead of storing information like this software is opinion like this software appears to be under the GPL, we store facts like the license file in the top level director of the software says it is under the GPL or we have run this tool and this tool run with this configuration and this version says that the software is under the GPL. In terms of organizational infrastructure, we are a non-profit initiative and we are multi-stakeholder. So we have funding from public bodies, we're funding from sponsors, we're funding from donors and we believe this is the best way to essentially minimize the influence that any single actor can have on a mission that we believe is for the good of humanity. In terms of what we actually archive in case you're curious about more details essentially we replicate the data model that you find today in most modern version control system. So we essentially we crawl places like a git repository or a subversion repository or packages from source code distribution and we store all the versions of all the source code files we find there. So not only the version that happens to be as the most recent commit in a given repository but really we crawl the history of each repository we encounter and store all the version of all files in there since the very first version. We also store all the revision metadata. So for instance, with the author of a commit what was the commit message? When was the commit done in terms of timestamps? If there are cryptographic signatures of releases we store all of those and also all crawling information. So essentially where and when we have archived any single element of source code among the Bob that we have stored in our archive. We do that in a data model which is canonical and independent from any specific VCS technology. So it's not like we only run some git clones but really we crawl git repository and store them in a common model and we do the same for subversion, for mercurial so that if 50 years from now everyone migrates to different technology, our data model will be the same and we will not have to recall everything just because the word moved to a new fangled version of the system. We currently do not archive other parts of classic open source projects like websites, wikis, issues or mailing lists because there are other initiatives that are taking care of archiving those stuff. So essentially our plan is we focus on source code and its development history. We make it easy out reference what we have archived so that other places like some sort of semantic wikipedia of software can reconstruct the history of individual open source projects saying the source code is archived as software heritage the mailing list are archived the no IG main and so on and so forth. In terms of data flow, we are essentially a big crawler so we do not crawl the entire web but essentially we have some curated list of places that we know distribute source code software. So software resource code form, sorry. So these are all the forages that you can mention GitHub, various instances of GitLab, Bitbucket you can have your own instances of self-hosted forages added there. We crawl distributions that also distribute source code like these data source packages or package manager repositories like NPM or PIPI or CPAN. So we have all these places. We develop as time goes by essentially components that are able to visit those places and list all the packages or all the repositories that are stored there. And for each one of them we create a data point which is a software origin. So essentially a URL that identifies a place where source code is distributed from. And then periodically we visit all those places called those software origins with a dedicated loader component like a Git loader or a Mercurial loader or a loader for a data source package. And every time we call that software origin we see what is the new software origin there and we store it in the archive. The archive has a peculiar data model itself in which it is a giant graph which is called a Merkel DAG. It's a structure that you might be familiar with if you have looked into ledger technology or if you look into the internals of modern distributed version control system like Git itself. And essentially it is a structure that has some interesting properties in the sense that it duplicates everything. So essentially in this structure if you see the first time you see one file you store it in your graph as a new leaf. But in the future if you encounter the same file in a gazillion other different places you just do not store it again. You just add a link to the file that you store the first time. And this goes essentially all the way up. So the same entire directory of source code files is stored only once. The same commit which we call revision in our data model is also stored only once. So if you have one million repositories that all have the same commit maybe because there are forks of the same project at a given point in its history that commit is also stored only once. This is the same for releases, for software releases and also for the entire state of a version control system repository which we store in objects called snapshots. So a snapshot is like a picture that we take up of a Git repository or of any kind of software repositories. And if that repository is forked 1000 times on GitHub or even a different platform like GitLab or any other forge out there essentially we store the full state of the repository only once avoiding wasting resources in storing it over and over again. And in addition to this structure this Merkle structure we have what we call software origins which are just URLs pointing to the state of the repository that was found that the URL at the time of the last visit. So in a sense what is interesting about this data structure is on one side the fact that we deduplicate everything making it feasible to essentially do the archival work that we do but also that we are materializing a unified view on the entire software comments. So we can see where software that you have developed yourself and published on your personal Git repository maybe later on it's been used as a basis for doing something else in our research work or in some enterprise open source software and we can see who has used your software if they base their work, do public work on it and see that they've created a new version of it. This is just not theory something that exists for real. So you can go to archive.software.digital.org and you will see there what are our current data sources you will find that we crawl in most famous forges we call many distributions we call many package managers we have there the archives of software repositories that no longer exist, software forges that no longer exist. For instance, we have a full copy of Google code or Vitorious or we have recently started archiving source forge and also we have retrieved all the mercurial repository from Bitbucket when they stop supporting it and all these are sources that you will find indexed at archive.software.digital.org It is a pretty big archive so we have more than 11 billion unique source code files archives, more than two billion unique commits archive coming from more than 160 million projects. On disk it's an archive of about one petabyte so it is big, but it's not, you know big as on a video archive for instance but it's still substantial that's something you can easily use on your laptop and if you're curious, if you are into graphs it's also pretty big graph so it's 20 billion nodes and 200 billion edges it's not as big as the graph of the web but it's still a substantial graph. It is the largest public source archive in the world and of course it's growing daily as long as crawling goes on. So let me just show a few examples of how you can use it so I'm gonna share a different window here for a moment just a sec, I'm gonna share my browser window and okay, so this is something that you can do yourself so you can visit archive.software.heritage.org today and you will have a classic search box that by default searches on the URL of software origins. So this is a bit naive but it's often used, it works pretty well in most cases because usually in the URL you have the organization name and you have the project name so for instance you can search a HTTPPB Apache HTTPPB for instance and you will end up on the, what did I say, this is my network connection I think, yes. So Apache HTTPPB you will see here all the list of results you will see that here the first step is github.com Apache HTTPPB and if you click on it you will essentially what you will find is that essentially the equivalent of a repository browsing interface but you're not browsing the live version of the archive you're browsing a version of the Apache HTTPPB archive on November 3rd, so 10 days ago and if you go here and visit you will see all the different archive valorants we've done of the repository starting from 2015 when we started archiving up on today. So you can browse the directory as usual so I'm not familiar with the source code of Apache myself but at some point you can, for instance let's go through the models there's a caching model here and at some point we will find some definition of code that comes directly from the Apache HTTPPB source code base. You can do interesting stuff like you can reference individual finds using our persistent identifier so if you click here on permalinks you will find essentially a link that you can copy and this is an identifier that we always refer to this specific version of the file modcache.c and you can share it with others and that will allow anyone to retrieve this precise version of these source code files forever within the Software Heritage Archive. You can also do other stuff so for instance if you go back to the top bar you have here a save again button so for instance here this can be used to tell to our crawlers can you please prioritize archiving again this repository so that the next crawl it's not whenever the crawl have time for it but will be like in the upcoming few hours. You can do that. You can also see all the branches and all the releases of this repository, okay? So here you have a bunch of, you can navigate through the branches you can choose a different one as you would do in a generic version of the system interface. Something that you can do is use this interface which is save code now which essentially instead of just asking our crawlers to archive again something that has already been archived you can ask the crawlers to archive a repository you care about. So there is for now support for Git, Mercurial and Subversion. We are rolling out support for CVS soon and there is also support not open to the public but already available to staff for archiving individual parables and zip files that people might have on some institutional websites or their own pages as well for demand archival of specific piece of code you care about. So back to the rest of my presentation. So this is essentially the experience you can have as a user. And in addition to that, if you are a developer you might want to integrate with our API. So it's a RESTful API which is essentially the equivalent of what I've shown you you can do with the browser. So you can use the interface for searching for stuff that you have archived and then for both browsing them down. So you can see what are all the visits of a given repository. Find the top level snapshot of them and essentially go down to the revision to the directory to the file content and so on and so forth. You can also retrieve metadata for all the archived objects. For instance, we detect the license of files that we are archiving with the Phosology. You can retrieve those metadata and you can also have all trawling information like when have we last archived the repository you care about and where it's venting we're pointing at at the time. I'm not gonna show you the detail of the API. They are fully documented on the web. Just click on the web API if you are a developer and you will find all the details that you care about. Something else I've mentioned these kinds of intrinsically identifiers of all the stuff that we archive that you can obtain with the permalink button on the web interface. And these are actually becoming pretty popular identifiers in the ecosystem. For instance, if you care about software of bill of materials of software that include free and open source software, there is a standard called STDX which is used among industry to share this kind of information. So which open source components are contained in this product I'm buying or putting on the market. And you can use software heritage identifiers in that kind of documents. If you go on Wikipedia and you can also associate to software projects a property that points to their software repository and you can use software identifiers there as well. It's also a YANA registered URI graphics. So essentially what we are expecting is that in the future people will be able to just put those URI into their browser or any other application that uses YANA URI graphics and get automatically pointed to the version of that piece of code archive or software heritage. So for instance, here in this slide, I have two links. I'm gonna share the slide. So here there's a link that if I click on it I'm gonna bring me directly to the famous implementation of the reverse square root in Quake three and it's just a SWH double one, et cetera intrinsic identifier and same, this is a piece of code. This link will bring you to the piece of code of the Apollo 11 source code, which of course we have also archive. You can use those identifiers on your own code. So you can install the Python module that compute these identifiers for source code you care about with the keep install SWH model and you can use it to compute the identifier of software you have on your local machine, okay? Of course as long as the software is only on your local machine you will not be able necessarily to find it in the archive itself unless someone else has archived it but if you discover that software you care about is not archived yet in software heritage you can use the save.softwareheritage.org interface which I've shown you before to request archival of the piece of code so that in the future everyone that care about your software will be able to find it on software heritage as well. And the last technical thing I wanna show you today is that we also have a software heritage file system which is a virtual file system built on top of Fuse file system in user space that you can find on Linux and that allows you to mount on your machine a piece of the software heritage archive as if it were locally available software, okay? So it's the software, the file system itself is of course implemented as open source software here you have links to the source code of its own implementation and its documentation and we also have a paper describing it if you are interested in the software engineering aspect of it but if not, let me just give you an example here you can install PIP install SWH.Fuse which will install the Python model implementing the virtual file system. Then you will mount an empty directory that you just create for this occasion so you create the directory, you mount it and you will have in there a bunch of virtual directories from which you can start browsing the archive. For instance, imagine that you know that file you care about have a specific software heritage identifier like this one maybe because you have found it on the web or maybe because some paper or some other website was referencing code using software heritage identifier maybe Wikipedia for instance and then you can simply do cut archive slash identifier and this will show you the content of this specific file which is the class one of the many possible version of a classic L award implemented in C. So this is just for a single file then of course you can do more imagine you have this directory identifier here you can CD into that virtual directory you will do a less you will find that there are 127 files in there and this piece of code is in fact the code of the Apollo 11 guidance computer that you can grab. So for instance, you can grab antenna in a bunch of files including this file which I happen to not contain stuff about the antenna positioning and you will find comments from the original Apollo 11 source code as if they were locally available on your machine. So this is without having to do any git clone any retrieval of Carbol at all. It's all a virtual file system that operates over the network using the software heritage or SAPI. You can do more. You can for instance CD into a virtual directory for a commit. So this is a software heritage identifier of the rev type for a vision and you will see in there that you have a bunch of virtual directory containing the development history such as the previous commits or metadata. So metadata are in JSON format. So you can use for instance, the command line tool JQ to retrieve the author name, date and message of this commit you have retrieved here. And you will find that here is the author name. This is timestamp of when the comment was done and here you have a commit message. Or you can just do stuff like this is a specific commits of the jQuery project I believe so you can search all the JavaScript file contained in this source code version and count all the linus of code in there and there are 10,000 lines of code in this specific version of the jQuery library. Last example, you can check all the branches that there are in a given project at the time of archival. So you can use SWH web search which is the equivalent of the web search which I've shown you. Here we are searching for a project called Gitanex which is a great distributed and open source replacement for Dropbox and similar services by JoyHass. You will find here the project. And so essentially what you can do is that you can CD in the virtual directory corresponding to this project. And you will find that the last version when this slide has been written there are more recent archival now by the time there was only a single visit taken on December of 2020. And you can see where the master branch was pointing at the time that snapshot was taken. So this was it for my general presentation. So let me just tell you something about how you can help and then I will be happy to answer your questions. So you can help first of all, as a user by expanding archived coverage. If you find something which is not archived, something you care about, some piece of free and open source software you care about it is not archived yet in Software Heritage. Just go to save.softwareheritage.org or just click on save code now on the archive web UI and ask our crawler to archive that piece of code. That is very important and it's a simple step that everyone can do. You can add financially. So we are a non-profit initiative as mentioned. We accept both donations from individual and sponsoring for companies or institution. So if you want to be a champion for a new sponsor that's very much appreciated, you can help as a coder if you are into development by just joining our community of open source developers. So you have all the classic information like where is the code, where is the chat, where is the mailing list, where are the bugs and so on and so forth. It's all available on the website on the developer community page. Or you can also take the next step and consider working with us. We have opportunities for both students as internships, both practical and there is more research oriented internships and we also have job openings for both technical and management level profile. So anything you can do to help with our mission would be much appreciated by us and also by future generations. So thanks a lot for listening and I'm available to answer all your questions. Hi, I was wondering as you gave this presentation how could such a large corpus of information be classified? So the equivalent of a large libraries that there's a taxonomy system that says, for example, there's the standards of the Library of Congress in the US that there are different subjects like medicine, astronomy, philosophy and so on. So I was wondering what could the equivalence be for software? Right, so essentially this is the question about metadata essentially and there are several standards about software metadata. We are working with a lot of digital librarian and people developing ontologies for software and there are several ways to do that. The way we are attacking this problem right now is that we are mining separately from archival. We have essentially an asynchronous process that periodically is the mining of all the software metadata that are contained in software itself. And essentially we are exhibiting it as an ontology of all the software we have archive. So I haven't shown you but if you go on the search page you have a button which says search in metadata instead of searching in the software URL and you can search software using that ontology. Essentially, this is the first step we're doing right now allowing users to find software on the base of the metadata declared by software authors. But of course there is also the matter of how you cross-reference what we archive with people that are curating the history of software. And on that front we are working for example with the Wikidata folks to create links from Wikidata resources about software and the archive helping essentially with collective curation of all the software we archive. Hi, I'm Gianluca Boyano and I want to ask the main concerns about binary object storage against versus the source code because I've seen that in introduction the main topic is source code archiving, archiving. So what are the main concern about a binary object? And I have another question. I don't know if I can follow it. Yeah, maybe yeah. Thank you. Yeah, so about that one very quickly. So we have taken our first decision early on in the project of not trying ourself to exclude anything from the stuff we're archiving, we're archiving. So essentially we decide that a place is primarily used to distribute source code. If that place also happened to includes binary files we also archive those at the moment. So for instance on GitHub you can find many repository that people are using let's say not really properly also storing binary files in there. For now we are also archiving those. It's not, it has not been a major problem right now but that means that in the archive you will file also incidental binary files which is not creating a major problem for now at the moment. So maybe that leads time for your second question. Okay, thank you. The other question is about the previous that we've all we have seen the trial of GitHub for example to archive on Arctic code vault to archive the history of code and all the main contributions. So they stored the source code on microfilm made by a Swedish company. So what are the limits to storage and archive source code on I don't know a hard disk or digital proven yeah. What are the main concerns? We are part of the archival program of GitHub. We are in partnership with them and in fact what they stored under ice to simplify was only a very limited subset of GitHub. For instance, the only stored project with a given number of stars, the only stored project that were active at the time and the only stored the most recent version of each project at the time of their archival because that technology is really, really, really expensive. However, it is important to have also archives that are offline that are not stored on our disk as you say and the way we are attacking this problem is that periodically once per year we want to take essentially offline copies on the entire archive, store them on technology that we cannot delete and that would resist the hypothetical electromagnetic store of planetary dimensions. We're doing that for now in collaboration with CNES which is a large digital archive institution here in France but any sort of offline archival that can be taken periodically of everything we have it's really important and it's part of our mission.