 OK, our next speaker is Lars Kort. Kort Kort. He's talking about mixed license, boss projects, unintended consequences, work examples, and best practices. Yes. Yes, thanks for coming. I hope this will be an interesting talk. I'm going to cover topics such as how common is it to project some mixed license, some of the challenges around managing mixed license projects, stuff people generally or contributors generally don't think about and their consequences. And I'm going to make this a bit real through some Worcester stories from the exam project. And then I'll conclude with best practice at the end. A little bit about me. You see a few pictures there. I love to travel to weird places. And I'm trying to use some of my holiday snaps in a presentation where it makes sense. I was a contributor to numerous projects. I worked in a lot of different industry segments, from parallel computing to tools to mobile and now virtualization. I'm essentially the community guy for the exam project. I get paid by Citrix, but I'm essentially accountable to the community. And I put this talk together because I led a number of different licensing related activities last year on behalf of the project but also within Citrix. So to set the scene, I really wanted to sort of look at the question of how many open source projects which claim that mixed single license are actually single license. So let's just start out. I just started out with a random number of projects which I tend to work with. I did this for more of them, but we can do endless lists. So I'll just pick four. So Citrix says it's GPLv2. QMU says it's GPLv2. We say we're GPLv2 too. Free PSD, SPSD. And you can make an endless list of those. So then I kind of ask myself, what is actually the percentage of files which are, let's call it, native license of that project? Now you may have seen used the analog picture of sand. The idea there is that if you look at a heap of sand from afar, it looks very pure. And if you look close up, it's less pure. And so there's a little bit of an idea of entropy going on there. And that sort of becomes a little bit of a seam throughout the presentation. So let's look back at those four projects. So what I did there is I ran scan code because that's easy and quick to run. It's not very accurate over the code basis of those four projects. And came up with an estimate of how much of it is actually the license it claims to be. I did put an upper bound in place because there are lots of files in those projects which actually have no copyright headers at all. And the assumption there is that most of the time that's the native license. But then there might be some directories which are entirely of a different license. So that means that it's basically just an upper bound and estimation. But what's interesting is that actually quite a significant proportion within those projects are actually of different licenses. And I didn't just do this exercise for those four projects. I did this for a few more. And it kind of actually led me to believe that most projects, or many projects, maybe even most of those who claim their single license projects aren't, in fact, single license. And then I wanted to look at some of the reasons. Why do projects allow licenses others than the one they state into the code base? So there may be lots of different reasons. You may want to interface with another project of another license. So if you're a GPLv2 and you had a file as a GPLv2, obviously that's a problem if you want to use it in another project. You may want to allow other projects to interface with you. So the same issue. Quite frequently, you may want to import code from other projects. And then coming back to this whole idea of entropy is you may not have clear rules that govern license exceptions within your project. And if you don't have clear rules, that just means people assume that it's OK to add code with another compatible license to your project, which comes back to this whole idea of entropy. If you don't have clear rules, then people will just add stuff. And you will get increasingly mixed license over time. So the point is all these reasons are actually good and valid reasons for license exceptions. There's no problem really with this. The problem comes if you do that and you don't have any guidance, you don't have any best practice, and you don't have any tooling or processes in place to really enforce a set of rules around what you want to achieve. And if you don't have that, you may expose yourself to a number of unintended consequences. And some of them could potentially be quite significant. And that's where we get into the war stories from the exam project. So we'll look at a number of examples where we tripped over this last year. So there will be three major storylines. The first one will be around a major contributor fundamentally trying to, looking at our code base to get their employees agreement to contribute to our project. So that's quite similar to the talk we had beforehand around CLAs and some of the challenges vendors face. And so there's an element of making this easier for projects. We are going to look at a worked example of re-licensing a component and how this works in practice and some of the challenges you might trip over and how maybe you can avoid that in the first place. And then a really interesting one, which is around GPL v2 or vx only and GPL v2 vx or later, where there's quite a few, which is not very well understood generally in the community. And I must admit, I didn't understand this very well either until I tripped over it. But before that, what is the exam project? Well, we developed open source virtualization technology and we've been around since 2003. We have a number of sub-projects, the Hypervisor, Toolstacks, Mirage OS, Drivers, and so on and so forth. So the stats I quoted earlier, they just relate to the Hypervisor itself. And we're Linux Foundation Collaborative Project with a number of big commercial beckers behind it which help with infrastructure, blah, blah, blah. So what are our reasons for license exceptions in the code base? Now, in many ways, they're actually exactly the same ones as I listed earlier. They're just a little bit more specialized. So we want to enable guest operating systems. We want to enable guest operating systems which aren't GPL. So we want to be able to run those operating systems on top of our Hypervisor. And that means we have to provide some kind of interfaces to do that. And we do that through making most of our public headers, BSD or MIT licenses. We want to make it possible for such operating systems to have consent support to use them as control domain. And again, some of the common code for that reason is BSD style or some of it is duly licensed. We also have a big ecosystem around our code base. So we want to be able to have libraries being linked into third party tools, even proprietary ones. And so some key licenses, we use LGPL for that reason. Oops. And we want to be able to import code from other projects, particularly Linux and so on. There's some stuff we have to just import. Now, we didn't have any codified, when at the beginning of last year, we didn't have any codified rules around licensing exceptions. There were some practices and how people in the community operate it. But fundamentally, we just assumed we were a single license project. And we were totally oblivious to the fact that actually there's lots of other licenses in the code base. And the first storyline is really around the perils of not having licensing information easily consumable by lawyers who might look at your code base and who might make decisions about whether it's OK for their company to contribute to your project. So I don't want to name companies. So what I'm going to do there is for each company, I'm going to use a code name. And I'm using code names from basically from Clamplife. So what you see here is a dragonblood tree. And so in 2015, we had a large vendor. We called him Dragonblood, who was reviewing the project with a view to allow their employees to contribute to our project. So this was around October or so, 2015. Dragonblood company, they put an IP lawyer onto our project. And they started an IP and patent review. And the IP lawyer was very, very sorrow. So they evaluated license, looked at all the copying files, came across some inconsistencies. So they started running Phosology on our code base. And then they started to contact the Linux Foundation and later me and started to ask lots of questions. So they picked up a number of mismatches between what some of the copying files said and what was in some of the codes. These were all just relatively minor issues. So in one case, the copying files said, well, our public headers use BSD-style licenses. And then a few files were MIT license. And then the lawyer picked this up. And then he wanted to, and then it started. Lots of questions started. They also picked up all these licensing exceptions with a Phosology review and basically wanted to know why we used another license. And then all these questions came to me. And while this whole process happened, they wouldn't allow their staff to contribute to our project. They wanted to have all the answers first. I don't know how common this is, whether this is an unusual case. But that's what kind of happened and what might happen to other projects, too. So I ended up doing a lot of code archeology to basically answer all of their questions, to make sure that it's a big company such that it would contribute to the project, because I didn't want them not to. Having them on board would have been an asset. So I had to look at a long list of questions. And they typically fell into two classes of questions. Why was a specific component licensed in a specific way? And what was the rationale? Why a piece of code from elsewhere was imported? And where did it come from? And so on and so forth. And after six months of email exchanges between me and the lawyer, eventually they agreed their staff to contribute to the project. So first question is why did this take so long, right? Well, from my perspective, there was all the information which was needed was actually there. It just wasn't easily consumable, right? So we had all the information when we imported code. All this kind of stuff was there. But sometimes it was in commit logs. Sometimes it was in header files. Sometimes it was in another file somewhere in a tree. Sometimes in a copying file. Sometimes in an email, which some of these files refer to, right? So that took a very long time to actually pull together. And as a lawyer, obviously if you look at this, you're just going to look at this. Oh, the information isn't there. They're not following good practice. That's not my problem. Have the community manager deal with this, right? Because there were some inconsistencies, basically that relationship and conversation with the lawyer basically started off wrong foot as well. They came across things like where it said one thing, but the reality was others. So that didn't immediately build trust either. And so I had to work quite hard to eventually build that trust by being responsive. And of course, lawyers tend to work on multiple projects. So every single time there was an email exchange that would be like two weeks of silence. And then the whole process would take a very long elapsed time. So while this was going on, basically our project decided that really we must address this for the future. We're going through this once. We're actually doing all this archaeology right now. We may as well present the information in a way which makes it easier in the future for a lawyer just to look at this and make some decisions. So this is what we did. It wasn't actually that much work in the code base to then basically fix that. Because I had to dig out all the information in the first place to get that one contributor on board. So we actually documented our rules and conventions around license exceptions. And we added this into the code base in Tree. So for example, now each directory, which isn't GPLv2, basically has a copying file in it. It explains the rationale and some of the reasons why we're using a specific license. Higher up, we have copying files which explain the information architecture we're using and some of the general guidance. But some of that is described in a vague way. We're basically not saying, well, our public headers are BST only if maybe there's some other stuff in there such that we don't get this kind of issue around where we say one thing and then when you dig into it, it's not quite accurate, which can build mistrust. Then for every source file, for every piece of code we import from somewhere else, we basically have a readme.sourcefile record. For smaller directories, we kind of have one file. And there we just keep a record for every code import we have from another project. And even for code, which is also GPLv2. So if you import a file or a piece of code somewhere from Linux, we kind of record that in that file. And that file contains information such as why did we import it? Where is it from? And other meta information such that you can follow that up at some point in the future. And then we fix some of the inconsistencies in our documentation such that when a lawyer looks at it, they can't point out, well, you say one thing, but actually that's not quite the truth. So we fix that. So that kind of was the first story. The second one is basically about relicensing a key component within the project. So there I'm going to look at the general workflow and some of the challenges we found around doing this. So the component in question, and you can then look that up, is a patch series called make the ACPI build available to components other than HVM loader. That sounds pretty, well, opaque if you want so. But actually behind it, this is a major change which enables a major new piece of functionality. And actually, all we wanted to do is we had to move a component around the stack somewhere else, but because of that we then had to make a license change. To kind of explain this, I want to talk and introduce a bit of terminology and taxonomy such that that whole thing becomes a little bit clearer. This is an orchid here by the way, a rather weird one, which I found in Costa Rica when I was traveling there. So very brief introduction over architecture, and then I'm going to relate that to licensing. So at the bottom you have the hardware and you have our hypervisor, fills with things like configurations, scheduling, memory, all those kind of things. Then we have a virtual machine, the first virtual machine which is started up in a system is called DOM-0 or VM-0. That contains a kernel, Linux, or BSD, or some other operating system that has drivers in it. And then obviously we have guest VMs with a guest OS and applications. And they also have drivers in it which talk to the drivers in DOM-0 and then talk to the hardware. And then the interesting and missing piece where the change is, is this thing called tool stack which basically controls the entire system and has some third-party other console or graphical tool on it which talks to this. And then the whole thing is sort of replicated across the board. And licensing terms, right? Basically most of the hypervisor lower level stuff is basically GPL v2. Most of our drivers are GPL v2 as well but some of them are other licenses. For example, Windows drivers obviously can't be GPL. And then I was mentioning earlier some of our utility libraries like LibEx Excel and LibEx C, they're LGPO. And that's here, you see, if you wanna check that afterwards you can basically look at those slides in a bit more detail. So what does this mean for the ACPI builder change I was mentioning earlier? So basically just gotta take everything out of this diagram which isn't really relevant and there we have the HVM loader which is used by the hypervisor and the ACPI builder is used by the HVM loader. And to make this change we basically had to introduce a new relationship where the LGPL code had to use, we wanted to be able to use it in the ACPI builder in LibEx Excel. Now, our options at this stage, we could re-implement the ACPI loader and make it integrated into the top library. But to do that, we would have to really technically do a clean room implementation and that would have been really hard and probably not really been doable because the same people who wrote this stuff originally would have had to write the new thing again or we could just say, well, actually, we're just gonna link that GPL v2 library into that LGPL one but then if you link the two together then basically the whole thing becomes GPL and then that would create some, that would have been too disruptive for everybody who uses this library or we could just re-license it and that seems relatively straightforward. It wasn't that much code, probably about 30 files or so, probably about 30 contributors didn't really sound so bad. So, we decided to re-license this thing from GPL v2 to LGPL. So first, there's an observation that actually refactoring an ongoing development might require unanticipated license changes if your project has multiple licenses. If everything's the same license, obviously that's never gonna happen and maybe this could have been avoided if it was a bit more foresight and planning. So, I guess when you architect your system and really introduce new components, maybe it makes sense to sink a little bit forward about where might this actually theoretically be used at some point in the future and then choose the license accordingly. Probably in our case, so that wouldn't maybe, that component was introduced more than 10 years ago. We probably, at that time, wouldn't have really anticipated that we may need to be able to use that thing further up the stack. So, let's look at the workflow of re-licensing a component. So the first step is really identifying all the copyright holders. And how do you do that? Well, you're gonna look at the copyright headers of your files. You're gonna look at the Git logs at the authors. If you have a DCO or something similar, you're gonna look at all the sign-off tags and you may also have to look at the code import history and then just piece the logs from different projects together. So, to make sure you really get all the copyright holders. In our case, we did all of this, but then it turned out that the list of names was the same as just the ones as came from the DCO, which is what we would have expected in the first place, which is a good thing. So then you get the list of copyright holders. And typically, what I did is I kind of then, divided that list up into individuals who aren't the copyright and companies because to get the approval, you would handle this differently for individuals than for companies. And I'll get to this a little bit later. So identifying copyright holders is really easy, right? Well, maybe not. So you could trip over tooling issues. So our repository, we first used Mercurial. And then we converted the entire tree to Git at some point, like six or seven years ago. And Mercurial doesn't do code motions very well. So when you track the history and then you convert it, when you move files around, your history kind of stops, right? So we had to then manually stitch this together. The same can also happen in Git if you do your code motions wrong. So there's some technical issues which you could trip over where you have to be careful about. Another thing is, well, actually, was the code you want to relicense or some of it imported from elsewhere, right? Now, if you keep good records, you may have all that data and may be able to stitch that all together. But probably it's safest to run some of the physiology on another tool over it and then basically kind of establish an entire chain of where the code came from. Now, if yes, if it was imported from elsewhere, then obviously your list of contributors and the list of people who might need to get approval from grows and gets bigger. In our case, the code was originally imported from Linux and we nearly missed that. There also could potentially be issues with CLA's. If some of the code you imported has a CLA, then obviously you may have to ask some other people like a foundation to get approval as well. Another thing which was interesting was we had some issues around the use of private email addresses by company employees. And so even so we knew that some people during that time had worked for a company that used the private email address. So we had to ask them privately and the company because we didn't know whether the change was made on company time or not. So probably you wanna have a policy around that within your project as well. Just in case you have to read license a component at some point. So, back at the workflow. So what did we do then? The individuals who contacted them all by emails. For the company contributions, basically I tried to find reps in that company to make a decision on behalf of the entire company. That was quite easy for us because most the companies who contribute on our project are basically actively engaging with it through our advisory board. So I could do this relatively easy. Other projects might find that quite difficult. And then it was a lot of chasing because individual emails addresses might change, people change jobs and so on. So I had to sometimes go via LinkedIn and track people down and so on and so forth. Now, companies can be a bit of a problem if you lose a relationship with them. How do you get approval from somebody where a team which originally introduced a change? If that team doesn't exist anymore and maybe hasn't existed anymore for seven or eight years. Anyway, so if you get the approval, then you can do the commit, keep records and everything and if not, you need to find a workaround. Well, in our case, this is a picture of an orchid, a dindrobium, so we have another code name come up. We had a con reader, let's call him XYZ, who worked for that vendor called dindrobium and we couldn't track him down and we couldn't get his approval. He did this unincorporate address, but first we thought, well, let's find out whether that guy basically still works for that company. That failed because the person was based in China and LinkedIn doesn't work in China and we just couldn't track that person down, but fortunately we have a lot of Chinese companies we work with, so they asked around, so eventually we managed to track down that person, but that didn't really help us because basically then it turned out the team didn't exist anymore. None of the people who originally worked on that stuff worked for that company anymore and so that didn't actually lead us anywhere. Then I started thinking, well, actually, maybe I can search for some other open source contributions from the company to other projects and get to somebody who cares about open source in that company, well, as it turned out, there were no recent contributions from that company to any project. And then I just started searching on LinkedIn generally for open source staff and CTO office staff and so on and so forth and eventually I got a hit and it got back to me, but at that point we kind of thought probably this is gonna lead to anywhere, you know, we're never gonna get approval from that company. Let's look at a backup plan, but luckily we did get that approval, it just took some time until they basically decided what to do with that contribution and that was around 15 lines of code really, right? So we made a contingency plan and we used the fact that binaries and not source code are licensed, right? So we took the ACPI loader code base and built two variants of it from it, one which was used at a lower level where we needed everything, which was, you know, GPL and the other variant would just use part of it and was LGPL. We were, initially we were looking at, well, can we just, you know, take that code sort of out? But that was just too, that would have been too complicated to do because it happens so far in the past. So we built these two variants and we made sure that, you know, all the GPLv2 code was clearly separated such that we could maintain that going forward in the future, but actually this would have been really ugly, not easily maintainable and so on and so forth. So we're really lucky that we could actually track down that company, got the approval and didn't actually have to go for the workaround. So what were our pain points around this whole process? Well, tooling, right? You have to be really careful, you know, around particularly code motions. You know, if you don't do this right, then you might lose some part of your history. In implementation, we also tripped over the fact that we didn't have a record for some code imports from Linux in this readme.sourcefile pattern and we nearly missed that dependency, right? Which would have meant we would have possibly missed some copyright holders whose approvals we needed. This wasn't so much an issue with this particular relicensing, but I've had other ones where this has come up a lot, is sign off some company times or people use private email addresses. So you probably want to have a policy around this. In particular, if your project allows alias, many projects like Linux kernel gives contributors kernel.org aliases. We have the same, we have a send project.org alias. Well, actually, what does it mean if somebody signs, you know, with that alias, right? Is that a personal contribution? Is it a company contribution? And so on and so forth. And then, you know, the whole process around getting the approval. That was rather painful as well. So, you know, there we implemented the backup, but ultimately it wasn't needed. That leads me to the third story, which is all around unintended consequences of mixing. Well, I'm not gonna read this out. I'm just gonna talk about GPL and use it as a synonym for GPL or NLGPL. Version X only, you know, which means that you take the license and you remove the end or later in brackets and specify a specific license. And, you know, the variant of the EGPL where you leave this in. So you remember the Dragonblood 3 and that company? Well, part of that whole approval process meant, you know, besides the license review, they did a patent review as well. And the company was rather sensitive towards patents and GPL v3. And basically that said, well, if you have any GPL v3 code in your code base, we're not gonna contribute to you full stop. But then they discovered that we had a number of files which were marked GPL v2 or later in a project. And because you can take those files and copy it into another GPL v3 project, they basically said, well, that may mean that we can't actually contribute to your project, right? And I didn't really understand this very well. You know, this was something which was relatively new to me. And I was really surprised that we had, you know, files like this in the code base as well. So I kind of was asking, why did we actually have GPL v2 plus, you know, or GPL v2 or later code in our code base? Was it a conscious decision? You know, how much is there? And actually it turned out that this was purely accidental because people would just go to the FSF website, right? And copy the license template and that license template says GPL v2 in brackets or later, right? And then I copy that into the code base and, you know, we had no mechanism in place to really prevent that or look at it. You know, it was just something which happened. And actually, then I asked myself, is this specific to us, right? Do other projects also trip over this? And, you know, those four projects I listed earlier, they all have GPL v2 codes. I was asking the question, how much GPL v2 or later code do they have? So what is the proportion of the GPL v2 code which is actually marked GPL v2 or later? And, you know, Linux 14%, QB9, same project 10, you know, FreeBSD32. So that really implies that there's actually, you know there's not a lot of knowledge around that and not a lot of what this might actually mean, right? So what did we do? So first question is, can we actually go and fix this, right? So I looked at this and it didn't really seem to be a president and it wasn't really a lot of clear guidance on how you would deal with, you know, changing these licenses. Would you have to go through a full process and ask everyone for approval? So I started to talk to lawyers about this but at the same time we also started to talk to key community members and actually it became clear that this was potentially a very divisive issue so I didn't wanna go there. So then we thought about this and maybe you thought, well, we have this problem right now, let's just not make it worse, right? So the first thing we did is we added templates, license header templates which don't have the or later in it for all our GPL and LGPL code. It raised awareness amongst committers and maintainers to make sure that I picked this up. And ultimately in this particular case, the issue went away because, you know, I did some of this research and could point out to the lawyers that actually by the way, some of those projects you already contribute to and you've given your engineers approval to contribute, they have exactly the same issues as we. Why do you treat us differently than those other projects? So the issue went away but obviously it means, it may mean that that company has instructed their staff maybe not to contribute to files which are labeled GPL v3 plus, we don't know. Or there might be other companies who've been through this process and haven't talked to us who just looked at our code base and decided not to contribute for that reason, right? So fundamentally there's a slightly bigger issue hidden behind there, right? Around this GPL vx only or later issue. So basically, you know, this is our case, right? If you're a version two only project and you allow version two or later files, well, you could scare away some contributors. If you're a LGPL v2 or later project and you allow files which are specific to a specific version in your code base, well, you're actually then diminishing your capability to upgrade your code base to a new version at some later point in the future. So really what this means, if you do the one or other, you should have mechanisms in place to enforce that your code base stays pure because otherwise you end up with the worst of both worlds. You may be able to not, you may not be able to upgrade and you may scare away, yeah, I know. I only have two slides left, so I'm fine. So basically you may end up with the worst of both worlds and if you look at all those big projects beforehand, that's the de facto situation we are in right now. So I just wanted to summarize some of the key best practices and lessons we learned is really if you wanna stay single license, a single license project, you need some tooling or some processes to actually enforce that. Because if you don't, really what happens is a law of entropy comes in and by default you become mixed license. The same is true if you're basically a L or GPL, VX only or a later project, you probably wanna have some tooling in place or some mechanisms in place to avoid a mixture of those two. If you are, if you do use multiple licenses, you probably wanna document license exceptions. In particular, the rules around it, your conventions, you know, rationale for license exceptions, for particular instances, you probably also wanna provide copyright templates for the common licenses you use. You probably wanna have some mechanism to record imports from other code bases in a consistent manner and that for all imports into your code base. And you probably wanna have some conventions around what does it mean if somebody, around company and personal sign off, particularly if you use a DCL. I mean, the same issue was kinda raised in the previous talk around contributor agreements. And basically DCLs kinda have the same issue in many ways, but that usually isn't very well documented anywhere by any of those projects which use DCLs. And, you know, a plan for the future. If you introduce new components, think about it with licensing spectacles on and what somebody might wanna do in the future and then, you know, adapt maybe the license accordingly. Such that you don't have to re-license it and go through this pain at some later stage. And that's it. So I'm open to take some questions. Is somebody maybe gonna, yeah, but we'll start there. There's somebody who has a microphone. Yeah. So my question was, I saw that you were recording it, but you were recording it in a readme.source that's like markdown kind of file? No, it's just a text file, right? I mean, we could use something more sophisticated, right? Yeah, so my question is why didn't you use SPDX? Because I actually have to look at your project now and it was neat that it was written down. Yeah. I can't machine read it. Well, SPDX tells you, you know, you can make any, it doesn't allow you easily to record where the licenses come from if you import it code. Yes, you can have the license comments to it. So it's easily there. I mean, so the reason is I actually wanted to use SPDX, right, but there's some of the key people in the project have issues with SPDX for whatever reason, so I couldn't actually get that through, unfortunately, right? We should talk and figure out what those issues are so we can fix it. Yeah, actually, I might get you in touch with the guy who's particularly objecting SPDX. There was one further down. Thanks. What if the company that hold that 15 line code base had closed down? Is that a problem or that's better? Well, we did get the approval in the end, right? But if we hadn't got that approval, we basically would have, it would have technically been a problem because 15 lines are still IP, right? And we couldn't, not a company actually existed, so we managed to track them down. I have no idea, that's an interesting question, right? I mean, typically you have maybe a company who buys them and you can find somebody, but if somebody, maybe you have to find the individuals who worked on that. I'm not a lawyer, so that would be a question for one of the lawyers in the room. What do you do in that case, right? I don't know. A question up there. Maybe one more. One more, and that's it, yeah. I think I got a yellow t-shirt. All right. Okay, so find me afterwards. Thank you for joining the talk and... Giving me the time to talk to you.