 Professor in the CS department at University of Victoria in Canada, and he's going to be talking about the challenges of licensing in Debian Hi, buddy. So as I was introduced I'm actually a professor and I started doing a little bit of free software when I was a grad student I used to help in in the non-project and Then when I became Professor some of my research research actually moving to looking at the way that open source and free software is developed and And what we can actually learn about it and then how we can actually help the developers who are doing it So let me give you a little bit of my credentials. So I submit patches here and there and Two of the ones that have been working lately is selfish the window manager that I use and and sort of which is actually the application that that you see here and But I also maintain a library and most people never actually know which libraries they use and The the main application that uses is is Hugen. So I see Hugen as a big community of applications is intended to do Panoramas and so I'm one of my hobbies is photography. So I do also some work in that And but in terms of my research so in the last two years we have been heavily involved with issues relating to licensing and as they apply to free and open source software and It's it has been a very interesting two years because before us nobody had really looked at the implications Of licensing from a software engineering point of view It was the realm of the lawyers and the lawyers argue about it and the software developers They dealt with it, but from the point of view of us the researchers. Nobody really Care about those those kinds of issues and in fact, it's still a little bit interesting that some of my colleagues They actually still ask why should I care? Why is it important and so it's something that They're just waking up to you. By the way, so Almost all the pay all the papers are available in in my website And if not that email me because some of the copyright doesn't allow me to actually make them Available, so but I'll email them to you directly so one of the things or at least from from from a cell point I Strongly believe that that that free and open source software has fulfilled the goals of component of the shelf Systems and and and the best part is that in most cases you don't have to actually pay a dollar to be able to use it and But it comes with a price and the price is is usually the license and Many people know how to deal with licenses many people think that they know how to deal with them But they don't and those are probably the most dangerous ones and then there are people that have absolutely clueless about how this happens and So I believe that in in the next years Teaching licensing should actually be one of the priorities of software engineering programs across the world because now it's actually one of those of those Aspects of knowledge that everybody everybody has to have at least some foundation to be able to understand What are the implications of building a software? So the typical use cases I think I divided in three. So it's my system honoring the licenses of all its components and Almost nobody today builds a software system from scratch Perhaps the kernels are some of the few ones that they do but even they actually require C libraries for example and Or given my intentions For example embedded systems can I actually use this component while I own or the license requirements And another one that is becoming very interesting too is if I copy the code Which we in the research community call cloning if I clone parts of the system or entire system Maybe maybe I will not tell My my customer But when the customer receives that code is very important for them to know whether this code has been cloned and whether actually the Restrictions of the original software has been followed or whether the copyright headers have been replaced by new ones Or just simply removed and this we are seeing actually as a more and more important problem So I'm Software is complex. This is actually this little blue dot here It's actually open office and this is the dependency graph bill from Debian dependencies and in the spec files and So of course they're up there optional dependencies that actually start to actually branch But this gives you an idea and these graphs are getting worse and worse as time passes by Which from a pop from the engineering point of view is actually great Because that means that every single one of these dots is actually doing a very specific job is well encapsulated It can be tested. It can be maintained But from the overall point of view it becomes very complex to determine whether all the IP restrictions are being properly followed by by open office in this case so The one of the one of the areas we have been looking at is the issue of auditing Intellectual property and we have focused primarily on copyright and We haven't touched anything else patents is a big issue. There's also trademarks as you will know Debian has fall West has has been affected by some of them and So essentially if I have a system say Debian or have actually an embedded system am I honoring the the the license of all the components that I'm using and the first problem that people actually face is what components am I using and And once I actually know the components, I need to actually know the license of the components But how do I know the license of the component? Well, I have to actually look at the license of the files But once I start to go all the way down to the to the license of the files And I actually have to go start going up and it finds we find that well the promise is that Files interact in many different ways and the way that they interact will actually put constraints on on on the on the component and the way that the components interact actually puts constraints on whatever it's built on Top of them. So there are a lot of small issues at play that have to be actually understood And so what components am I using well the three all source of this data And this is actually research that we started doing around four years ago We should just look at the spec files and then actually see how they start to propagate and as I mentioned before it gets a little bit complex by the fact that they are optional dependencies and and of course those dependencies Are fulfilled in many different ways, although usually with the most commonly Used software so for example the base devian system actually may means that many of those Optional dependencies will be satisfied by whatever is actually in the system But the main problem is that this data is really intended for different use which is to be able to actually run this application Where else do I need so I'm actually contain more information that is needed And that's actually not bad from the point of view of of package maintenance But from the point of view of IP you really want to know exactly What's in there and nothing more and nothing less But the other problem is that they don't really tell us anything about the type of interconnection for example if I use grep and I Executed via the command line. It's a very different thing that if I use read line And I actually dynamically and statically linked to it so the constraints are different and So that's why the data in the space is valuable, but it's not really a perfect for the kind of IP clearance that that we would like to do So the other thing is that when you have for example dynamic linking, what about plug-ins so plug-ins Constraint or allow Different are different restrictions in terms of intellectual property for example Eclipse says if you use the plug-in architecture You can have whatever license that you want But if you actually start to use any other parts of Eclipse as And call them then you become a derivative work. So that also becomes very very Valuable so but in binaries you can actually in binaries and libraries you can actually use the command lines commands to actually extract What are the libraries they depend upon so at least that's visible you can actually inspect the dynamic linking tables to actually see which functions have been executed and But that's not always possible with every kind of system and then ultimately you have the build data So you can actually look at at CMake and auto make data files and and then try to actually see what are the specific Components that they're being used and I'm using the word component rather than package because I'm actually I will I will I will talk a little bit more about the distinction between binary source and component and And then of course the idea that you have you might have to parse source code And you might actually need to know what the source code is actually doing with each one of these components that you have and ultimately the the mind tenors and the package of sorry and the packages and are the ones who are who who really know what's going on and So they are the ones they might think it's particularly they are the ones who might actually understand very well how the system is built But all all they know is one level of dependencies. They don't know ten levels down What they know is what they are using directly and what happens sometimes is that the component that is being used It doesn't even know that it's been used and how it's used by its by its applications that exploit it For example our library. We don't really know exactly how it's being used We just know some of the functions that they are being exploited So I'm knowing the license of the component It's not trivial and I'll actually come back into this But what is important is that at this level is that I before I can actually argue about the license of the component I need to argue about the license of each one of the files So I'm let me actually Define some nomenclature so so we cannot you can you can follow me and And I don't confuse you because he might be a little bit different from what you're used to and So I perceive this the source file as the as the smallest licensable unit, okay, and You can call it documentation Whatever is in the source in the source code any file that is there that is actually used to create it the derivative work I'm sorry the the the installable object and If it's broken down for example somebody comes and extracts a function put somewhere else That's a derivative work and therefore it has there some restrictions that apply That I don't need to really fall into that You know licenses are I define them as a simple license like I told you this is GPL version 2 Or this is GPL version 2 or any after or this is GPL version 2 with bison exception Very well defined very well known license and then a license can also be a disjunction of licenses Like pearl that says that we can have Artistic license version one or you can choose GPL version one or any one in any after Or it can be a conjunction of licenses, which means all the licenses apply And if I define it recursively that means that I can actually have conjunction Conjunction followed by a disjunction followed by another license and we have actually seen things like that. Okay, and Then the source files create installable objects that they are the ones that actually do something the source code Doesn't but in some times they are identical like for example a scripting languages in which The pearl source code will actually be the same as installed one in many cases or they have to be compiled So source packages are composed of source files and binary packages are composed of installable install installable objects and and and components The license statement is the comments at the beginning of a file. Well, typically at the beginning of a file that contain the license of the licensing of of of the file and and and They're categorizing to two ones by inclusion Like for example MIT that you have to put a statement of the MIT license into the into the into the file Or by reference in which you actually say this file is License under the following and you can find a copy of the license here or there The GPL for example the Apache version to The Eclipse public license and they basically say this is the this is the lessons of the file and you can find it somewhere else Or in many cases we see See file copying in root directory. Okay, which makes it a little bit more complex So what is the lessons of a file of a file? Well, it's not always trivial Because there are too many licenses and So you go to OSI and OSI says well, they're around 65 licenses in use plus the ones that have been Removed or replaced by newer versions and but the reality is that there are many many others that they are being used in Systems and and many of those ones. We don't even know which ones they are and Because if we haven't seen them, we don't know they exist. Okay The other problem is that many developers modify licenses the GPL for example has the exception It has the ability to add exceptions to it From my point of view if if a file if a file has the GPL with an exception, that's a different license It's not the GPL anymore It's one of the children of that version of the GPL But from a practical point of view it acts as a different license because it cannot be combined the same way as with software without that exception Sometimes they actually fix grammar Who did they modify the spelling? So the the British is the American spelling that happened So sometimes they actually make spelling mistakes Without intention This is a major problem. How many license are there in a fire? When do I stop and So it's one two three four five. We have found I think that the worst one we have seen is nine licenses and In just one single one and once we know the license is there within a file How do they interact do they say this or this or this and this but not this one or We have seen it sometimes that says this file is not licensed under the GPL okay let me give you some examples and Just maximize a little bit here to actually make it more readable Redistribution sometimes the optionally put the s at the end Sometimes they add a comma or sometimes they add without modification Quotes what is a quote? Well, it can be this or it can be this or it can be this Merchantability or merchantability Some people write it in different ways Some people put the hyphen in non infringement Okay, so I give you an idea in terms of actually that just just Determine the license of of of a file by just doing textual analysis. It's actually difficult Many of the tools before for psychology Was released what they will do is just simple regular expressions on terms like gpl if I see gpl It's a gpl license file Well, we have found that if the file says this is not in the gpl does actually say The country This is a much more complex example and this actually comes from a Java file and from son Basically, this file says that it's actually under the CCDL and The gpl and you can choose one of the others and he actually Specifies the conditions in which you can add one of the other and you can actually remove one of the lessons from the other So it makes me actually wonder and I'm not a lawyer and I should have to say that from the beginning Whether for example this closed down here makes this the gpl plus ccdl a Different kind of a license than saying one or the other Which is basically see gpl plus a CDL with very specific conditions On top of that we have that There's an Apache license down here Apache version 2 and some of the purists from the free software foundation will say that cannot happen Because it's incompatible with the gpl version 2 But that's what is there Okay, so we cannot we don't really make a case for whether it's right or wrong all we say is that's what I'm seeing Okay, like one of my colleagues says we're more like the police We just looking for inconsistencies and then we let the judge actually decide What's going on? So this actually some of the some of the issues more summarized that we have found They the challenge of finding the license they're mixed with text and So which text is actually relevant to the licensing statement on which one is not Files my reference another file where the license is is located and It's not very standard how this is done. For example, we are the one that I hate the most is The license is in the file search, which is in the root directory And and as the code actually moves around you find that there are several of those files in different root directories and Language related and the spelling errors Different ways to actually refer to to a to a license and an assumption before spelling grammar Customization MIT and BSD have to be customized before they can be used and License source modify add or remove conditions to well-known licenses and and or they modify them for different intents so we develop a tool and We we started using for solitude and it was not it was not good enough for some of the analysis that we wanted to do so we develop another license identification tool and and they should that should actually be out as GPL version 3 plus in few days and The tool is lightweight is a bunch of of scripts with a lot of regular expressions There's a paper in my website that actually describes the methods used to it And one thing that we did was actually trying to evaluate How good it was? Can you explain very briefly what? Phosology is I'm not familiar with it. So for solitude and tomorrow will be a talk It's an integrated environment to actually do a lot of this licensing Analysis and one of the most important parts of physiology is being developed by HP is license identification Which basically you give it a file and it tells you with some probability which is the license that it that it contains or the licenses and The problem is that for solitude was making mistakes and it was actually not capable of the of of of saying I don't know which license this is and And and those were actually things that we really wanted to do we wanted to say when I tell you the license I want to be accurate. I want to be precise and I don't want to be miss make mistakes But at the cost that at for many files. I will say I don't know which one it is With the with the with the idea that you can then concentrate your efforts in those ones that they're actually different So in in our small experiment that we run So we actually found that we had seven incorrect licenses because most of the time what we did is that we did not identify Every single licensing detail into the file. So we'll say there's a gpl version too, but we missed an exception for example and On the other hand we have tools that they're actually widely used like O count and o s l c that actually Are very they have very low precision because they don't tell you the precise version that they actually detect And when you do License analysis and an IP clearance you really need to know that the version because it's not the same to have GPL version 2 that have GPL version 3 and it's not the same to have GPL version 2 plus So that's actually what's very important So without then we started actually running some analysis and and I should clarify that the Statistically speaking there's a difference of around 3-4 percent So all of this are really statistical ties Okay, and in there been 5.2 zero. So this is what we found of Of the files that we could identify a license We're capable of saying no this file doesn't have a license and that's something that was actually very blurry with the other Tools that we wanted to be able to do as you can see No license is actually the most common one and in in around 31 percent of the files and and then the gpl version 2 One thing that I found very interesting is This license doesn't exist The lesson gpl version 2 It should be library GPL version 2 But norm has some files That it started with that and guess what people start cutting pays license headers Without thinking about So this has been propagated Okay, whether that's a problem most people don't think so But actually tells you a little bit of the challenges Notice for example that license that I was showing you the cdl or gpl version 2 It's in 37,000 files, but it's only two applications and So there's a big in balance in that respect. So This is because essentially son is the only one that uses it and son uses Use it for glassfish glassfish. It's actually really a lot of cloning of other applications Particularly from Apache. Okay, and they have actually replaced some of those headers and some of those headers actually are Incorrectly replaced, but that's a different story So by application This can actually tell you so gpl version 2 is still in the lead, but now you see the C file and These are License appears at least once on each one of those applications. Okay The one I really like is the same as pearl And I think so same as pearl basically says the license of this of this file is the same as pearl an Indirect our relationship. There's several reasons why I like it One of them is that it's actually very practical the day that pearl moves to the next version All the modules will move with it. No need to modify them And the other thing is that with pearl when you actually create it the template for a new module It actually creates the license statement for that and most people don't change it So that actually part of the reason that that is is there and it's so common Doesn't that create a lot of ambiguity though about I mean if the version of the license that pearls Uses changes from pearl five to six or something And that same module is used in both cases I mean doesn't that I mean it's unclear what the licensing is but remember the beauty of this is that all you have to say is grab up So when you get confronted by the problem all you have to say is grab a version of pearl that has the license that you're looking for And say I'm wrong and if your module runs with that Then it satisfies the condition Okay, so it's very pragmatic Okay, and it basically differs the analysis to the very last stage which is deployment Okay, but I agree with you. It actually adds new complexities, but that's actually the beauty the beauty of it is that it's very very adaptive so We couldn't really argue anything about those Systems that have many licenses because we don't know how the files interact So we said let's look at those files in which every single file is resolved Either it does not have a license or it's a very specific license and all the files have that So we found that they're actually a few number of them and Many many of them actually come with GPL version 2 you can actually see same as pearl 2 and So this basically give us a Baseground to say this once we can actually we can actually argue about it how they are created etc But they're relatively few compared to the number of systems that exist So that's that's with respect to to to the to the source code now What about the installable object and the packages and So I said before so if they all have the same license then it's trivial Okay Although we have seen that maybe new may maybe all the files are GPL version 2 plus But the package the maintainer decides that is GPL version 3 and That's valid Okay, so this is it's not as simple as it could be but it's not difficult to actually say Whether yes or no that's actually valid the problem is when For example, the so the source package is placed into different binary objects and each one of them has a different license We see that a lot with with libraries that the that the executables are under the GPL and the libraries are under LG PL and Sometimes the license are in the documentation and so you really have to actually read the documentation to understand the constraints that exist and the worst part is when there are errors in the license and Many cases by the developers and many cases by the packages so we have to look at Fedora because Well, they've been hasn't really taken the job of trying to determine the license of of a package and Fedora does is that's part of their business model and For every binary package, they state its license and we call that it's the clear license so we'll say this is what you can install and this is the license that comes with that and And and they do this this job They they try to actually look at the source code They try to actually understanding the intention of the creator. They talk to the creators to try to understand What the license is? so This is this for example packages that have all the files with the same license and that again is the way that we can simply Doing an automatic analysis and comparison. So notice actually some things that they are very peculiar one of them is BSD BSD for Fedora is any BSD Which creates problems when you're trying to do analysis because BSD for is not compatible with with the GPL So you cannot actually mix it so it will be very valuable to have that decision, but they don't have it There are different types of MIT's and that's actually one of the challenges we have seen and but for them They're just MIT and then you actually have the old also the problem of naming ASL 2.0 and we call it Apache version 2 Okay, so it's it's and this is actually a problem that that is being addressed and I'll I'll mention that briefly later and That we need basically a uniform way just to indicate a light at the name of a license So we can actually do some top some types of analysis So we we tried to also so in another one of the works we did we tried to do Automatically How did the licensing and then compare results against the ones in Fedora and try to actually see what we learned and and one thing that it was it was clear is that If our tools did not identify correctly the license We will get actually false positives So that's actually one of the biggest challenges for example This one we actually didn't identify that there's an exception So we actually saw that yes, it was GPL version 2 But we did not identify the exception and therefore we were not actually We couldn't actually do it properly So This for example is patches with one license that is consistent with the declared license and so This is this interesting sample the source code all over this EPL version 1 very nicely formatted But the declared licenses EPL and CPL notice that there is lack of version. There is actually version 0.5 That is not supposed to be used, but we still find it once in a while and the CPL and What happened is that the documentation actually said CPL somewhere And because they use probably just grip They actually found it and this all this has the CPL okay, and This is another major problem that we have we we came across by accident It's actually very pervasive and it's it's one of those big challenges today as Software changes licenses it puts pressure above below and sideways and then people have to actually migrate and So we found is that Fedora probably did the analysis sometime before and then it was Apache software license version 1.1 and then he moved to Apache version 2, but they didn't actually update it and So that was actually very common. Another one is we found this VSL and we couldn't actually find the license anywhere We just don't know what actually that license was Then we saw we'll let's look at BSD for and see what happens actually it's be is it's BSD for used within GPL Well, it happens that yes, it is But when the copyright owner is the University of California or net BSD they have issued another letter that is somewhere else That says it's okay. Hey that that's actually as a BSD 3 You can essentially drop the statement But they don't tell you that you can actually remove it from the license They basically say it actually acts like the other one So you have this file still moving with a BSD 4 that if you attach this extra letter becomes a BSD 3 okay, and Sample code this was actually in in examples We actually saw this and the the file didn't really have any any Importance in the in the in the creation of bash But it was just one script that it was actually doing an example and it was under the BSD for And then we found actually some suspicious files that that had it and but We don't really know exactly what what what what's going on with it? We reported a lot of this Fedora Fedora was great Fedora was actually very happy that we that we started doing this they help us in many ways and we we Submitted our The problems that we detected today many of them they have been fixed So other ones they have been looking at them many of them were actually sent upstream in some cases the problem had been fixed and but it just hadn't propagated to Fedora yet Lance evolution you can see it here that the clear license is an older version than the current version and So it happened over and over and over again So what we have learned is so they're they're applications that have errors in their licensing and in many cases files don't have a license that they should have it and Or the licensing information is somewhere else For example, Linus Torvalds has a clarification that says if a file doesn't have a license is GPL version 2 plus or version 2 I don't remember and But it he has the clarification saying if lack of license, this is the license that applies This we have seen it many times for example in selfish, which is the window manager that I run Some files have GPL version 3 but 3 plus and So fish is GPL version 2 plus so I mailed the maintainer and I contribute and he said So I asked him so are we planning to move into version 3, but you haven't told us I said no I was some mistake. I cut and paste the wrong Heather and Then he promptly fixed it In consistent losses license closes This is actually more dangerous because we don't we're not really the lawyers We're under don't understand that and the example that I show you with Apache version 2 against GPL as As a conjunction Actually be one of these things that makes them in inconsistent The incorrect name of the license as I mentioned the lesser GPL version 1 version 2 that doesn't exist But the intention is probably that it was a library GPL version 2 and Then the other problem is that once you detect that there is a bug in the license Then it can only be edited by the corporate owners unless they have actually gave you a specific permission via the license to be able to actually do it and Many of you might be familiar with cases in which System wants to upgrade from one license to another and they start contacting developers and Sending almost these bottles in the ocean saying do you know this guy who used to work in this place ten years ago and Go by this email and this name because we need to contact him so he can actually give us the right to change the license I really want to hear license stories if you have any of them I'll be very very happy to actually to listen to them and that's actually one of things that won't actually collect how Defacing licenses exist. What are they and they how they affect everything else? We have actually quotes by important people in in in our in our environment that basically says that licensing errors are are Very very unique and very difficult because they cannot be addressed by it by everybody You have a specific people who understand what is going on that they are the only ones who can actually address them In many cases if they were to cooperate owner, they might not even be able to solve it Which is very different to from the rest of the source code So what we need in terms of license maintenance, we need tools to edit license statements And as simple as that sounds The best that we have in the industry is the scripts that run and replace Apache and Mozilla they try to mark the headers so you know from from where it starts to where it ends So they can do some regular expression matching and replace that if they want But those are really cumbersome. What we really need is a way to actually tell emacs Look the files in this directory are under the GPL So make sure that I have all the time actually updated and when the address of the free software foundation changes go and modify it If I'm the corporate owner, but for these other files, I am not the owner so they leave them like as they are Okay Verify the validity of the license statements and and that's actually part of of NINCA That it's actually moving in that direction part of physiology that we can actually run tools that tells the lessons is this or is that We need unification of licenses and by unification I don't mean actually the idea of of of removing Avoiding the license prohibition problem, but they're rather actually saying all these licenses are really the same license They're just different wordings of the license and that's actually work that cannot be done by us It has to be done by by the lawyers Summarizing licenses in source code files and that's actually work that case to work is actually doing with some people from Industry and so there's a standard towards an XML Specification to actually Document the licensing of files with a source code file And it's called SPDX and and kid will actually talk next week in in the Linux com about it And what is what is the goal that it will be by the end of this year that there will be the spec, right? Okay, so but that's actually towards that that we will actually have a way to precisely document it and a way to try copyright owners and And I know the issues of privacy are big, but We need to know who the corporate owner of a file is and Copyright ownership is so badly described in source code for example say a Lot of code from eclipse s the corporate owner is IBM plus the contributors listed below and And then you go into into a huge list of of contributors It's not actually clear what they contribute what kind of claim they have on the files in many other cases People just actually remove them over time because they say oh the copriner is the copyright is here But this is actually just extra information. Let's actually remove it that information is very easy to get lost So how can David maintainers help? So I think that a debut at the advantage has the credibility and the way to be listened to and it's a lot of hard work But somebody has to do it and and David maintainers. Well, you're actually not the ones that will run away from from a good fight these are some suggestions and One of them and perhaps one of the most important ones is to incorporate licensing into the process I know that you have a copyright file But if you look at the way that corporate files are created now, they're all all flavors Some people just cut them paste every single license that they found a single file and they put in the copyright file Some people actually clearly say this system is on the search license, etc So it needs to be actually more more more documented as part of the process. For example, you have the patch tagging guidelines Why not to actually add license To a patch so you it's clear actually what is the license that is being submitted with a patch Verify the licenses on the source code and you do that and and you have been great you have done a lot of work in that direction and But needs to be done a little bit more. For example, force them to use canonical forms and If a license says this file is on the GPL Forced them to actually do the wording that the free software foundation does and Because that actually helps everybody and it's not a matter of a saying you have to do that But actually explain the benefits that everybody actually gains by doing that Even big use if I let this ambiguous then get clarification from upstream and actually make it part of the lesson statement of the files and There will be many cases. It cannot be resolved Then document that because if you go through that it's likely that some of the people will actually have to Go through that again, and there's no point on duplication of effort and that's part of this PDX effort to try to avoid that Document how the binaries are created and otherwise it's actually difficult and but which source code actually creates a binary it's very important to actually be able to know that and And how the source code actually interact and flag the files that are not using the creation of binaries Like saying this files. They're part of the distribution source code, but they don't really impact the deliverable, which is the installable binary Break binary packages. So each component has the same effective license Because we often see the actually packages that have many different licenses I know that there are some packages like image magic is one of the worst Okay, each file is actually listed with its own license and there are many many different flavors But there are cases in which you have a gpl version 2 with MIT and you can say well This is basically effectively works as a gpl version 2 and everything that you find in here will effectively work as a gpl version 2 and But if we have some files that Bineries that are gpl version 2 and sometimes they are gpl version 2 plus that creates a problem for the people of that They are using this dependency because if it's gpl version 2 Version 3 they will be able to use it by gpl version 2. They will not be able to use it Or they maybe they will be able to let me just finish this and They will be able to use it for a while, but they will not be able to actually upgrade. I'm actually almost done Okay, and Document the types of dependencies. It's is this a library in in the spec files This is a library a plug-in and executable configuration installation So we can actually do a little bit more analysis from that data okay Just just to end and so if if you have interesting license stories Then then let me know and if you have actually interesting license problems that you think that we can actually help then let me know I'm actually very Interested about it. Okay Very interesting. Thank you I've got a few questions about this page in particular first of all where you've got MIT and gpl v2 and gpl v2 and gpl v2 Plus I didn't see any difference between those because both of those you can use under the gpl v2 and nothing else In the case of the gpl v2 and gpl v2 plus you can treat everything as gpl v2 in that case Which is the same as the previous case surely that's right, but let's say that you have two components within that binary package One component is gpl version 2 and another component is gpl version 2 plus You want to use the gpl version 2 plus only? Okay Now I'm the I'm and I'm an application and I'm using and I'm seeing that as a library and my version is gpl version 2 Fine perfectly everything works But tomorrow I want to migrate to version 3. I cannot do it at At least not in the way that it's being packaged If you I don't think copyright follows package dependencies I think if you have a single package that contains two libraries one of which happens to be two plus one Which happens to be two and you're linking against the two plus. There's no problem with that. I understand it makes Life difficult for people like you who are trying to analyze these Interactions and dependencies, but that's not the primary reason we know I understand that and you you need to actually see it From the point of view of the people using the packages Okay, so we're we're intermediaries We're just messengers the real problem is for the people who are actually doing at their analysis and saying well This actually looks like we cannot migrate to the next version of the gpl And then they have to look more carefully into why and if they were just split they would say oh, that's not a problem We can actually migrate automatically Okay, so that's that's mainly the the rationale that then you can do automatic analysis of this The last couple of points first of all Have you noticed that there's currently a proposal for a machine readable copyright format for Devin copyright files? So whilst at the moment. Yes, everything's free form. There is actually proposal for making them Standardized with with with fields and stuff. So I don't know how that interacts with your And the last thing was with copyright contributors. I know Certainly a lot of Upstream projects as well and not just when we're summarizing things Will when you've got you know patches from hundreds of different people to different parts of different files It's not it's not particularly feasible to be able to track these lines here were contributed by this person's actually these lines here Might be contributed by several people and so on and what they'll tend to do if they do it at all We'll keep a top-level file and say look, you know, these people have contributed to this But you know, it's not feasible track all of the individuals and I think that version control is the solution there and That the version control you basically say And I don't think that any of them should control systems actually include the license as part of them But I think that it's an attribute of the of the Delta right who actually made it and whether that's a corporate owner And I think that the way that that the Linux kernel is doing that it's in that direction Okay, and but I agree with you. It's actually very difficult now What we have to remember is that the metadata in Devin is very useful for Devin But I think it has to be pushed to the applications upstream because then everybody can actually benefit from that So on Monday Bradley was talking about a solution for in this case of documentation To be able to be used in a broader context that it would be destructively licensed CC by SA and GPL And I'm wondering if in your analysis you've you've come across Destructive licensing or if you might have best practices for how to do that So you're so basically future analysis can grok that properly I think that in terms of the disjunctive licenses What I think is the best practice this to this day is when our organization says this is our license Which is a disjunctive license? Which basically lifts us into this model that I call recursive Because they clarify exactly how each one of them works and Mozilla for example has done it very well And they say how each one of the licenses interact with each other and which conditions you can use one and all the other and That but the way the way I think it has to be so is by saying we're no longer dealing with three oars or two oars But rather with the license that is subdivided into the following Steven another thing to think about here as well as the fact that you have to be cognizant of Whether or not you're forming a derivative work is necessarily jurisdiction dependent. So it depends upon which Copyright regime you're operating under and that's something that complicates this and it cuts in different ways depending and the different ways it cuts Changes depending upon court decisions. So not only it is depend on the regime It depends on the current legal outlook your current risk assessment And also it there's also underlying this the question of copyright ability I mean whether the work actually forms a work that is capable of being copyrighted and so all these things make any type of automated license analysis for the purpose of Demonstrating whether it's okay or not. Okay to use a work very fraught I mean you have to at the end of the day if you want to totally manage your risk You you have to do it manually or I mean use tools like this as a starting point At the end of the day you have to go at the end you said exactly the answer. So What we really want and I when I say we I don't mean actually I Research it because I really don't care right ultimately because I do write research and tell whether they're bad things or good things We as a community what we need is tools that allows to concentrate our efforts where the efforts need to be so we can say those Packages they have just one license And we don't really have to worry because they are all nice and everything is nicely packaged But these are the package they have such an international licenses that they become a line land my mind Field and and we have to be very careful how that is used and that includes actually issues of Location around the world and how they as yesterday And we're talking about that how actually at the legal framework changes with time So we we can actually be aware that those are the areas where we have to concentrate. I Can see why this stuff doesn't get challenged very frequently and Maybe I'm just I don't know the answer But should we also track either the entity owns the copyright actually is a legal entity in terms of let's say person Is okay, right? He exists There be an even project because it doesn't own the copyright Any team Right, what is the legal entity should we kind of go for because many teams instead of going for delaying a long list of Contributors they just go for some meta name, right and they go after it. Is that okay? Should they be somehow clarified? But that's a good question So one of the problems with for example the way that IBM does its copyrights is that they actually Say the corporate is IBM and they put as a contributor their employees Which I actually know the owners of the copyright because it's actually IBM itself. So we have extreme formation So to be precise and give you an answer I don't know and I'm not actually the person who's capable of answer that I just think that we need to worry more about where the source code comes from So we'll be able to actually go and ask them later in the future When something goes wrong Just a question along that line and I guess to Don's point also I mean if I submit a patch or something I don't want to be connected to it at all and I'm just wondering if there's any you know me I don't want anyone to follow up. I want some way to just give it away Completely Right so so do you see dangers there? said again, right if The corporate owner is allowed to actually say this is putting to the public domain, right? Yeah Okay, so in a number of jurisdictions it is very hard to put things in the public domain specifically in this jurisdiction The only way in the public domain is if you die and then wait 75 years, or if it's funded by the federal government That you cannot say this is in the public domain Well, I mean some other jurisdictions there is no concept to the public domain So it's never a safe thing to do so so let me actually put it from a pragmatic point of view But we need is to document it say this we don't know where it comes from It's better than to actually say of all the patches I need to actually go through each one of them and see which ones are the ones I know and which ones I don't know I Guess the issue there seems now you're locking up now now. No one knows what to do with that component I guess I'm looking for a way somehow that we can just I don't know. I know what you're saying legally. We can't put things in the public domain, but just Just get them out Can't you just assign copyright to the whoever you're giving the patch to that's that's the other solution many have used right so assigning Copyrights to somebody else. I mean who whoever Yeah, but I mean if I mean you you're gonna have to declare something, right? But notice that there is the difference between the between the philosophical the philosophical issues and the pragmatic issues Right and I think that we have to actually find a place in the middle And basically what I'm saying is that this information has not been properly documented up today And I think that that's that's the next step that we need to actually move forward and whether this is the solution or not That's just basically a proposal. I think that ultimately what we want is the information to be available So the people who care they can actually find it and we don't have this duplication of efforts everywhere So I'm I'm happy to see your proposal for ways to improve and it would be good it definitely would be good to have some sort of you know Put up make public some sort of concrete proposals for package maintainers to be able to follow so and specifically I'm curious Do you Do you recommend things like just trying to convince Upstream and package maintainers to just use boilerplate licenses that exist referenced as files on the system or should Should all files include the full license and the header the full License statement should be in every single one of the source code files because source code files are copied and moved around So the full gpl v3 copy no But the license statement which is the reference to the license that will say this file It's right But then you're referencing something else that that is in an unclear but everybody knows how that is done So what I'm saying is that the canonical ways in which licenses are referenced Eclipse publish the the manner Apache publish the manner Mozilla does it the Jeep the free software foundation does it? So all that they has to be done is to say you say that this file isn't the gpl But it's not the canonical form. So please do it in the canonical form. This will go a long way on helping this analysis Yeah, I agree because actually at least for the gpl the license reference is very clear It says as published by the free software foundation blah blah blah blah blah, so there's no ambiguity at all I think that's right. Yes, and then we said we find files that say this file is on the gpl Yes, any more questions Oh, yeah, that's okay. We have yeah We're not missing too much of much just one more additional question Has there been or are you aware of research where people are actually looking at? Ignoring the license statements and files the actual physical lines of code and their origin To connect them between projects to look for the type of cop. I mean, this is an unfortunate Just a question of I mean derivation Yeah, so I'm I'm working with people in Korea now doing Origin analysis of Java Java is horrendous because there's no real package management at the installation phase So what we are finding is that there's a lot of people that they cut them pay with the sorry that they clone entire subsystems and include them as part of the source code just look at that glassfish as I mentioned and And in many cases they actually have these tools that go and landly replace copyrights on every file So actually find a lot of source code files that have actually the wrong copyright owners when you actually look at the entire history and so I'm convincing a lot of my colleagues in a field called clone detection that this is really Where where where the problems are and I think that that's actually going to happen in the next two or three years Other people actually looking into this. I call it massive clone detection But I really want to see and and we were working in some algorithms actually for doing that It's server servers that you can basically say Tell me anything you know about this file and it will actually come back and say well that file is actually a Present in such and such in such place on those functions actually present in such and such place And so you are able to actually track that because I think that that's the the plagiarism It's potentially a big issue with the Java people. All right, let's thank the speaker again