 organizing committee that keeps doing their wonderful job for seven years and many more to come. So I'm Alexis Alvarez, I'm doing open source compliance at large American company. So our talk is about outsourcing our wonderful legal obligations which is a nice very nice catchy title for big companies so Intel at Intel we deliver a lot of software we are mainly a hardware company as you probably know that makes my job much easier because the company mentality is that software is just an enable for us to sell more hardware so it's pretty the default to make software open source and we deliver a lot of software and our software is always a combination of our own and open source software right and as it's typical today typical software product is around 80% software that you get from other sources and 20% the software that you write yourself if you're writing much more than that you know you don't have a competitive advantage something like that. So many software components that we use have a legal source code distribution requirement I'm sure you've heard that before and in our case we also want to deliver sources for other pieces of software that may not have this obligation requirement but we still want to deliver the source. So this is a reminder for the legal requirement this is copied from GPLv2 that when you deliver an executable you know you have to deliver the complete corresponding source code as it exactly says complete source code it means all the source code for all the module it contains plus any associated interface definition files plus the script used to control compilation and installation of the executable and this is a basic idea in copy left licenses this that you have to deliver the complete code of the software that you're delivering the executable for the wording might change for license for license right different terms may be used but the idea is the same so in GPLv2 it's called complete corresponding machine readable source code and the verb that you have to fulfill is you have to accompany the executable with this thing in GPLv3 the terms were changed to corresponding source and you have to convey it in Mozilla version 2 they're talking about source code form of the software and you have to make it available and the source code form has to be made available it's a passive term and in GPLv2 in eclipse public license it's also source code and it also has to be made available right so trying to fulfill these obligations we were thinking of how to do this thing in an ideal world or in very German organized company based in Germany so you will have a full proof process that it's in place and everyone follows it strictly and you set it up once and everything works magically for the rest of your life this is not always the case it's not happening in the practical consideration especially if you're talking about very large company spanning the whole globe as we do as we are so people change roles or they leave the company reorganizations happen pretty much more than you would think of and things get forgotten and you know people completely lose the corporate history of different products and stuff like that so you have to make sure that you always deliver the complete corresponding source code because the product as you probably know may have a very long life much longer life than the original team that created the product so trying to build this in house we started thinking of what are the different use cases right so when we have to deliver source code what do we have to deliver and there are a couple of obvious use cases so the first thing is we have to deliver the source code that we wrote our self we have it in software package archive and we have to deliver this thing second use cases we have to deliver the very well known package you know like a gcc version a specific version how we build it and all this stuff the third case we have to deliver the same well known package on a specific revision level right so not one of the published releases but we took one specific one because it wasn't released as a release yet we have a single commit anyway we have a unique identifier for a specific version of an open source software which is out there and the last bullet is a use case of a combination of the this thing so we took this well known open source at a specific version but we did some patches to that and that's what we are using and because we have to deliver the complete corresponding source code of the binary we have to deliver this source code right so and obviously the combination of that ideally you can have a bundle because you might have more than one packages there right so these are the different use cases that we want to deliver source code from so translating this in functional requirements we need to provide our own software packages right the source code that we wrote and we have right or we have to refer to well known free and open source components that are on the internet with a release version or a specific commit or commit ID or something a specific snap point right snapshot and we want to combine the two right components with our own patches with our own software right so the great idea that came to us is can we outsource this fulfillment process right so instead of us doing it can we ask somebody else to do it for us right the major question for this room is is it compliant right can we really outsource it without getting into trouble right and we had lots of discussions with John and all FSF and lots of other people and many of our lawyers and so for example in GPL FAQ there is this wonderful question can I put the buyers in my internet server and put the source on a different internet site right so there are two GPL online the GPL FAQs online for version 3 and version 2 both of them the reply is yes so one is that you have to make sure that the source remains available right but you can have it somewhere else right and the even in the older version for version 2 if you make arrangements to have it always there we think that it qualifies from the same place right so it doesn't have to be on the same server let's ask if somebody else can do it right so thinking of all the trouble that we're getting into in order to fulfill this requirement wouldn't be great if we can hide someone to fulfill this our requirement and we found that so that someone here is not actually me so I'm not gonna host all the corresponding source code tables of Intel on my personal website but the idea is to use the software heritage project as a place that can works with actors like Intel to host those those complete and corresponding source code tables so this talk is not about software heritage so I will give you more point as we're no born about the project at the end of the talk or you can check out our keynote here it falls them last year describing the project in detail I will just go through the basics of the project that are relevant for the use case we're working with working on with Intel so software heritage is a project was mission is to archive the entire body of free soft free and open source of source code software we can find on the internet so we are versed for this we are best places where we know we can find source code and we archive that the mission in a naturally is collecting all that body of source code preserving it in the very long term to avoid it get lost and sharing it with everyone who needs to access source code that we have our car so we have we are focused on a single mission of doing the archival part so we but we are meant to serve different use cases so there are some cultural heritage use cases there are some industrial use cases like the one we're discussing here today there are some scientific use case like imagine offering researcher the ability to analyze all these code in a coherent way and there are educational use cases but the point is that we're only working on the foundation rather than try to implement ourselves the solution for all the different use cases and to maximize the chances of succeeding given the mission is fairly big we're developing it in a completely transparent and open way so all the code the code we develop ourselves for the mission is free software generally copy left and we're also setting up this as a non-profit endeavor because we think it has more chances to remain for a very long term if it is a non-profit endeavor rather than a for profit one by default so before this kind of adaptation for interesting industrial use case what we do is that we are a crawler okay so we have a bunch of places we know are meant to distribute source code like forges or software source code packages in distributions or source code packages in language specific package managers and we crawl so I won't go through the details of our architecture but we're basically a crawler we go in those places where we retrieve all the source code we can find and we store it in a single data model where everything is deduplicated so if we find the same file or multiple forges or multiple packages we store it only once same thing for the commits because we also store the entire development history of all the source code we can find so this is our general architecture and it's a real thing so it's not just theory so we already have assembled a very large and substantial archive of source code so we our sources that we track day-by-day our github and Debian for now and we have also ingested in the archived entire history of Google code and Githarius at the time they shut down and we're working on ingesting also Bitbucket and adding other sources of source code this is something like more than four billion unique source code files about one billion commits coming from more than 70 million origins so git repos or packages or that kind of stuff it's a pretty substantial archive if you see the graph and we believe it's today the largest archive of public source code that you can find on the internet and of course it's going daily because every day we re-crawl those sources and find new stuff there to add to the archive but this is the the crawling part the pool part for satisfying the needs of distributors of products that contain source code you need some sort of push way of adding stuff to their car so the idea here is that now we have a service that we are opening up as a prototype which is a deposit service so essentially with people that are partners of the project because we don't want to become you know another mega upload or something that where people go to just store warrants but so if we have an agreement with Intel for working on this for instance they will have credential to access this service and they can just you know push when they want when they add some source software source code targets to the software heritage archive in in tech terms it is an implementation of a protocol which is called SWORD which is a protocol used by digital libraries to exchange articles or data set that has been deposited by researcher on those services but we have reused it for depositing source code and it has a restful API that you can use as to implement your own deposit tools with a common line wrapper that you can use as a ready-to-use tool so as an example imagine you have the software table that Alexios were talking about while you create your table that's not something we're gonna do for you that's not a part we're gonna outsource for you but you have your table you have taken care of the fact that the table it is indeed the complete and corresponding source code table for your project and what you can do you had some metadata that are not in the table itself so you had for instance if you have an internal identifier for that release you can put it there you can say the name of the project you can say the the person responsible so a bunch of additional information about that that table and you have a tool in which you can use that to push it to the software heritage archive so initially we give you an answer which is essentially a receipt tell you okay we have received our table and you have a deposit ID which is number 11 and you can use that receipt to monitor the status of the deposit process in the software heritage archive so why the status because we are not just going to keep your table it's not an FTP service okay what we're gonna do is that open up the table look at everything that inside and integrate it in the rest of the archive this way when you go finding go back finding your source code you can see for instance where else it has been used and see it in context of all the rest of source code we have in the archive so using that receipt number you can check what is the status and at some point well ideally for small table it will take just a few a few minutes can be more for very large tables some point we'll tell you done this is done and will give you an ID which is a persistent identifier of the kind of object that has been created so this number here is the persistent identifier that we guarantee it will always work and will be around for for reaccessing your your stuff later so with that with that identifier you can reference stuff that is been an archive you have a URL where you can point your users and you can navigate the thing in the archive so for instance even if I should never do demos we are opening up the browsing interface to the archive for FOSDEM so you can go to archive dot software heritage dot org password is 2018 and navigate the entire archive and here it's what you will see if you deposit a table where you can navigate is a github interface and if it were something more a number complex object like a git dump for instance you can see the vision history and that kind of stuff also you can of course download it because that's the point you want to point your users to our archive to retrieve the source code so there is a service which is called the software heritage vault when you carry a quest to download stuff that too is an asynchronous service because for very because we did duplicate everything so for very large objects it can take a while to collect all the files that you want to download but again you have an API no fancy wrapper yet still a bit rough on the corner but you have an API to request the download of the object you can request a notification of when the object is ready and when it is you go there you retrieve your object and you can you have back your title so to sum up a long-term hosting of completely corresponding source code might be honors for for large corporation that is what I learned from from lectures and from many other open source people in big companies in the room and it is okay for copy the flight senses to outsource the responsibility of hosting that tarble to a third-party site provided you have some agreement and provided you make sure that stuff remains there software heritage is meant to keep software available in very long in the in the very long term so we are doing crawling so if you what we want what you want to offer is already part of the archive that's fine you just find the object in the archive and you can point your user to it if you have more stuff which you usually do because we have budget source code you have additional source code well you can now push it to the archive and retrieve it later and point your users to software heritage to find a source code it's still a prototype it's still something a bit rough on that just for instance we do not support yet the use cases of doing partial deposit in which point to stuff that is already in archive without having to upload again a giant GCC tarble but that's in the working in something we plan to support in the in the following all the tooling is a friend open source software because that's that's what we do so if you're interested in being part of this experiment you are doing you are more than welcome to come and talk to us after the talk thanks rings a bell yeah so the we what we want to do is preserve source code which is available which might be already free software today it might be not we want to preserve it anyway because it will become free software one day for instance when copyright expires so of course we expect that they use case like there was an Intel it will be free software all the way through but the only thing we are doing is adding some automatic detection of metadata like we run some license scanners for instance and we have an API to fetch those metadata but for now we don't have human creation of that something that can be added on top of software heritage but it's not part of the archival part and of this use case so in this case the filtering oh sorry the question is do you worry about people uploading stuff to the archive that we do not want to store so in this case the the responsibility of checking they do not push crazy stuff and or whereas or whatever it's on their side and but that's why we are only working with selected partner for this kind of another example is collaboration with open access archives for papers for scientific papers that those that in those cases where the research are adding software to paper and they can push it to us but that too is essentially their responsibility so we so the question is can we go and search thanks can we go and search and find all this stuff that has been deposited by Intel yes so we there isn't why there is an extra metadata and there isn't why it is stored as a commit rather just than just as a tarball is that so can we have a place where easily add all the metadata and those metadata will be searchable today they are not yet so if you go to archive software retage.org with the credential I've given you can search because you can search on all the other stuff like the the name of the repository but we will open up metadata so again we're talking about the software that sorry I repeat the question right so are we are we not afraid that people will learn that we're using basically right think of it this is this is the fulfillment service of the thing that we already provide there is nothing more secret than you know whoever gets it kind of could already get it before so for us it's just to upload essentially incomplete stuff with references to stuff in the software retage archive is that's just a convenience step but I will a lot we'll offer for download the entire thing okay so if you have found something in the software retage archive means it's already there and when I create the tarball or the git rep or whatever I will do the merge of the things to get GCC version and he will make sure he will understand that hey I have already that on my archive we will not be searching to see if he already has in order to upload the patches for example right it's because they did they imagine the for instance next releases of the same product if you want avoid re-uploading dependence you could that's really up to you so that's a feature that's available and then whether using it or not it's up to so the question is pull github backup yes we do and well it's so it's scrolling and it's a schedule so it might be lagging but that's that's part of our current coverage not only that if you look at the there is a documentation page on github mentioning archival projects and they point to us as a they don't call us as a backup but we we work on archiving that's the public part of course the question is do what do we see commercial partnership like this as a way of making the project sustainable so I should thank here Intel they are a sponsor of the project and yes of course this is part of it but right now it's it's more like developing the foundation so that we can offer this kind of opportunities to other part so what I said is that we have no guarantee right now that what we archive is today free software but the the approach we use is that once we go to places that are meant to host free software so github is meant to all of course people push to github also stuff which is not free today or maybe biners so in addition to that first let's say curation step what we do is that we use automatic tools to the tech licenses okay and we expose that information so that information is available we do not actively delete stuff but if someone wants to you know discriminate what is free today what is not they can rely on those metadata to to make a decision oh no no no not at all it's yeah correct so as someone said it's incidental yes so have we considered partnering with other archival organization yes and we have so for instance the the archive of Gatorios that year we've retrieved it's been thanks to collaboration with the internet archive the Google code thing has been thanks to collaboration with Google and we are kind of complementary because the internet archive stops at the boundaries of source code basically so it where they will not crawl version control systems they will not open up targets and index inside them and we work and we are not doing the archival of the web part so in the ideal world we have persistent identifiers for the stuff they archive for the stuff we archive and some external service like Wikipedia or a new Wikipedia of software in the future can make the link between all the things that have been a cat last question so the question is it correct to say that we're going toward an escrow service right now no because everything we archive is not like embargoed or anything but but potentially you can imagine yes you can imagine doing that this is like the best way of like doing CCS and showing that you've got like the deposit ticket and the shot it's like I love it I love