 Hello and welcome. I'll be talking today about the state of open-source license clarity My name is Philippe Omedal and I have one weird fact about me I signed off of one of the largest relation of lines of code in the Linux kernel But these were actually not really lines of code. These were commands and license commands So I'm very good at that at deleting lines of command in code I maintain primarily open-source tools that help discover Where code comes from and what's the license and whether there's security issues So that's what I do both from a community and business perspective so We'll be looking at these topics today. The first thing that's really important is Why you should care about license and licensing. I think that License is the essence of free and open source. I mean without licenses. There won't be any free and open source software at all that's what defines it and It's it's really the getting item to to open source now I Can't fix bugs. I can patch vulnerabilities But even if I have the code I cannot fix the license, right? Only the authors can do that So that's really important to just let that sink in for a second License is really the essence of free and open source software now If you dive down a bit lower if the license information is not clear Then it's harder for everyone to consume Free and open source software It's it's true if you're an open source project and you care about the GPL You want to make sure that the code that you would depend on or integrate in your own tool as license that compatible with the GPL or a The same is true for any kind of a commercial proprietary endeavor where you would reuse software you want to make sure you're allowed to use that and I think that In contrast with proprietary software where every single License contract is something unique Part of the success of free and open source software has been that we've developed licensing norms If I say Apache or GPL BSD, it's clear and well understood in terms of meaning There's no need to think and read lengthy contract and legal ease and that's Really big a been a big win So any time there's a problem with licensing clarity, I think everyone is losing They beat you as a code author or a consumer of the code so What what does it means really to to have some lack of clarity? What about a package that says? I'm very debatable, right? That's not super useful. It used to be common in RPM packages in the past To say well, this is redistributable or there's no license at all, which is unfortunately more common occurrence Or if we look at older kernel thermal drivers You can be funny a witty We said you distribute your code under the terms of the general public license. This has been fixed It's actually something I discovered when when clean cleaned up the kernel But this is a kind of problem at scale That creates ambiguities Is this really the GPL or not and it may trip some tools That will be looking for patterns such as Under the therm of the GPL, right? Here the term and the terms would be two different patterns or It can be a first Katie that time This has been seen in a module license where GPL was written in ASCII meaning that you cannot really read that stuff, right? You have to interpret the ASCII around the code So all these contribute to the lack of clarity in licensing And so The other thing is that there's really an explosion of the number of third party packages we depend on if you think about node package, it's common place to have several hundred packages That are pulled in as dependencies of the application you you may be running using you may be building using JavaScript or If you use Docker containers You are effectively integrating hundreds or thousands of System packages on top of application packages together with your code in the Docker image and And each means you have thousands of corporate holders and thousands of potentially different licenses and so really we're reaching a point where Open source is widely reused and it's essential Yet at the same time the volume we're dealing with demands that we put in place some some automation and beyond automation Having clarity in licensing is going to be the only practical way to Know what the license you you're dealing with and how to comply at any scale because of what you're you're going to spend a lot of time Just figuring out what is the license of this piece of code and remember as I said earlier on This is essential. You need to know first What is the license of a piece of code before you're you're able to do anything with it? so in an ideal word This this presentation this talk shouldn't be taking place right the provenance and licensing of all the sort party packages that you use in any piece of code should be able to to be discovered with structured metadata and That's all clear and jolly and there's no no ambiguity whatsoever All right, again Because it's important for everyone to know that information And we should really know it all I've made a study and that's part of the doc The peak today about the license documentation found in in roughly five thousand Open source packages application packages. So things like ruby jams by by packages maven node and less than five percent of five thousand of these did contain what I would say a quasi-perfect complete anonymous like license documentation and That's troubling because really what this means is Eventually, there's 95% of these packages that would need to get some review before you can effectively use them without risk to Not comply with license because you don't know what are the license terms now, how do you get about? Collecting license formation. So there's many different ways One is to Get structure information from what's called package manifest You know if you use Python, you're going to create a setup that by script Which can contain the license tag Which is a structured information about what the license is the same that happens for ruby with a jam spec files or node with package.json or maven with the maven pond Each of these have Placeholders where you can put a structured information about original license In some cases are built script. So they combine both Build and the metadata aspect of the package. That's the case for for instance RPM spec files or Gradle build scripts Or they can be fully dedicated to licensing like Debian copyrights. All of these are package manifests and and they can provide with explicit license information beyond that we have License files Text files like a copying file Notices tags that may exist such as the SPDX license identifiers. We'll come back on that in a second and all of these Are present in different places. They can be in top level files. It can be inside the code in various documentation And there's a lot of different Indirect provenance clues that can be used also If you find a URL saying hey, I copied this code from Stack Overflow Would be a good example of a common clue that there's code from a third-party origin that may have some well-known license attached There's a lot of other techniques, but these are the key ones and again, I'm focusing here not about finding Code reuse directly, but finding license of that code. So if we look specifically at package manifest so that's the first technique I was mentioning and it's Structured metadata and The thing is that in practice, unfortunately as much as we want to have that information Clear not from because that's the first obvious piece of information. You have a field in a Datafile that says license GPL that's very useful, right? It's very obvious. It's clear concise and unambiguous in in many cases so the the problem is that only a subset of these packages contain Proper provenance Declared information we call that declared when it comes at the top so for the clear defined project I evaluated the Actually, we evaluated the clarity of license Documentation for as I said about 5,000 of the most popular free software package so popularity is actually something difficult to figure out, but we came with a few your sticks and and came out with list and So the average license clarity score for these are about 45 out of a hundred 45 person that means literally It's pretty crass right it's below below bar and below below half of the score apportionment and We had only 194 out of roughly thousand that had the score above 80 and score above AP being something which would be like considered as as being a Something which would be great documentation, but not perfect just great and So that's the first thing that's that the information is not really always there the second thing is License that we can find in code and text. So that's where you you come with what's called scanning and There's different techniques for for that The best one is the third one, which I think is the most comprehensive and which Extends to doing a pair wise comparison of All the license and text samples you can find against all the files you have in your code base and it's it's better because it's It's perfect. It's a diff. It's in legal terms would be called the legal red line There's other techniques which are used by other tools than mine such as but unmatching and probably this text Actually, I I use all these and I think the best tools should and would use all these street techniques together To ensure that you you don't leave any license behind So what do we mean really by license clarity? It's pretty much intuitive if you think about a Package that would have a clear license in its package manifest declared at the top level So that's one second all the license would be documented in the code or in the files When it's possible with either a notice or an spdx license identifier, that's two These license at the file level would be the same as the one declared at the top level Yeah, these are consistent and that's important. It's more fun than not Not the case and that's a problem So that's three sir would be using well-known licenses not kind of a less travel Thing and unknown quantity that would require review So I know about the GPL I know about the BSD the Apache license and so on. There's roughly a thousand different open source license and There's definitely some which are much less known than others and so the proxy we use here So we need to say is it the license that's been referenced at spdx knowing that spdx also Covers all the license from the free software foundation last license list all the software License from the OSI and from the federally, so it's been it's pretty comprehensive for the most common well-owned license That's four and last most of the license I have a simple requirement at the minimum, which is make sure the license text is present somehow and Is the license is present as a text is important? It's quite common to see very terse license declaration, which says oh This is BSD or this is MIT good except that there's potentially 53 varieties of 57 actually varieties like ketchup of MIT and BSD licenses So it's important to have the text to make sure that we can comply correctly With the terms and know these terms when we need to reuse the code So for each of these five elements, we've assigned a weight and Then we can compute these Automatically because they are factual, you know, there's no ambiguity and That's the essence of the the the scoring for license clarity now if we run that on about five thousand packages like we talked about the median score is Is really a bit over the map what we see that come out is that Jam Ruby Jam's note packages and Python piped packages tend to have a bear median score than Maven and nugget packages and It's interesting too by the way because Maven and nuggets are primarily being redistributed as binaries and They're very often literally based on anecdotal evidence very often. They're missing any license information whatsoever So it's just a reflection of anecdotal experience the average Is doesn't do any better But note packages actually do bear on average and what we see something for instance is package that have been around for longer or package that have been and forcing Some norms such as jam with using spdx for license failures and npm's have of course much larger number of packages that have and spdx well-known license That's an easy thing But also this helps everywhere in terms of the clarity and the presence of a declared license Where you see that they both have very high score very close to the ones which are in pi pi 2 which tends to have a Well-defined set of metadata, but doesn't use spdx so they sometimes in Python you get Much more weird licenses in most case you see how cross it is to get the corresponding license text It's it's actually a challenge in all time in all case if we look at The breakdown by scoring elements in more details You see where the the the biggest Winners are and we see that well enough. There's there's actually a better Average license and documentation of the file level in Maven So it's cross at the top better at the file level in discord Whereas if you think at the look at note packages, it's great at the top and pretty poor at the file level We have also some statistics on 10 million packages But they just pretty much confirm what we see with the top level Packages, so there's no real surprise there. Unfortunately. So now This is a mess, right? Really what this means is that based on all these that are you cannot consume code just by taking the declared license phase value and You cannot consume code Without actually doing some extra work and that just doesn't make sense Not only it doesn't make sense, but everybody is redoing that work potentially Every time you're about to reuse code every time I'm about to use an open source package I need to check its license. Well, maybe it's an occupational hazard on my side, but nevertheless So there's three ways we can go about we can fix all the code You can write better tools and we can educate and fix squeeze campaigns To fix all the code That's the approach taken by the clear defined project that That's incubating at the OSI What the project doing is scanning all the code literally tens of million of packages we scan code it's computing the clarity score and then It's there's a team of curator literally people with Well-versed in open source licensing that's that are reviewing and creating all these licenses The problem there is that it's likely to take forever to complete and it's a very centralized approach Which I think eventually has a lot of difficulties to scale as our approach is to write better tools, it's what I'm doing with scan code and So the approach is here to collect more and more License examples and notice and text and then apply machine learning to spot inconsistencies in the detection So not really using machine learning for detection, but rather spot issues from detection and use that also to Inject more data more samples to fit machine learning and AI It cannot though replace entirely human review, unfortunately And the last way to think about how this can be fixed is using campaigns so the example of the next kernel which I mentioned at the very beginning is is is a good one where We did a clean up campaign focused strictly on the license kernel, which is about 70,000 files and Which was pretty messy in terms of licensing. There were over 700 different licenses Notices just for the GPL So the work has been to run massive scans on the kernel review them and have a community of volunteers help review old license detection that were done and Adopt a shorter cleaner simpler spx license identifier In the end today, we have over 70 60,000 files that have a clear license And there were over a thousand different license together with the GPL license notices before we're down now to about 61 license expression and probably below 70 different License notice total. So that's a big win You go from total mess to clarity talks about two two and a half years two years to get there But that's very efficient So another approach is to to do some education and leverage An example is the python pep 639, which I've started. It's a python enhancement proposal at the python software foundation and What this python proposal is about is to say Let's adopt spdx license expression in python package manifest metadata. It's very simple. Nothing very Complex, but it's it's it's a community effort. It needs to be reviewed and approved by everyone the The impact of that is eventually that instead of having a centralized approach like clear define We delegate the work to every authors. We can provide them feedback when they're creating package and writing package manifest through the tooling that They're not using a proper well-known well-defined and clear license Information that they're missing this and this license permission and we can then gently educate each of the authors before enforcing clarity and Enforcing clarity would mean rejecting the publication of a packet that doesn't meet these strict criterias of licensing clarity and I think in the end it's it's a better approach because rather than Looking at one package at the time with a small team of volunteers You're eventually looking at all the package Involving every authors at once so the leverage there is is significant and slightly a much better approach in the end probably fixing tools some centralized and and a lot of decentralized leverage community by communities likely the best and That's pretty much it. So we'll be trying to start Campaign for other communities to to actually fix the Licensing and I look forward to to discuss with you. Thank you very much. And now we'll take some questions