 So, in our last talk, we just heard all about the problems of working with, working with licenses and code and what this talk is going to start to do is start to paint the picture of how we can finally get to the stage where we can make the compliance information automated and make it a lot easier and a lot less manual going forward. So our challenge is how to automate this information. We've got two different points, right? We've got, in FOSS projects, are actually a lot of commercial projects, products today. And it's a commercial entities that generally care about making sure that there's accurate summaries of licensing. The people who are developing the software want to make sure that they, you know, what their intentions are, they'll put in. But how it gets put in, what they care, you know, what it's actually there in the is right at the moment at the time it was put in, but doesn't always follow over time and doesn't accurately, necessarily over time. We've been finding that someone will contribute some files to a project under one license and then the project, those files will move to another project. So, you know, you, someone may say, okay, see, see license.txt file at the top for understanding the licenses. Well, they've moved it into another file that doesn't have a license.txt file, as a copying file. So you have all these really interesting games to play, to try to really understand what the licenses are. And so what we need to know is, you know, is the source we're working with, what, how good is the licensing in it, is there a quality of it? And we need tools, specifically open source tools to support the tracking of this code. Companies can, you know, pay money to BlackDuck or Palameda or so forth into people who have businesses for this stuff. But for developers' perspective, working with this with open source seems to be the right way. So, right now, our commercial tools are focused on product development and auditing code bases. And they've got limited support for helping developers track the license changes as they develop a code. Our open source tools also are mostly focused on auditing existing code bases manually and summarizing the information of the tools that are out there, physiology has been around the longest and is what we're basing some of this work on. So we need to give away to get the open source projects to have machine detectable, accurate licensing information and be able to keep it up to date as the software changes. That's sort of the goal. And so we're sort of making some progress now. With any hard problem, you have to start breaking it down, okay? And the steps are, we need to be able to accurately identify the license associated with the file. Some developers like to just say it's just at the project level. That's been showing more and more to be a luxury we no longer have. Files move between projects and tracking things down over time becomes a very, very manually intensive operation if you're trying to discover the base root cause of things. We need to have something at the command line tool to summarize the licensing in a source file. Okay? You've got to be able to, you want to be able to do the reps and things like that to understand what's coming up. You need to be able to accurately summarize the license associated with every project at the time the project's built. Depending if there's multiple options and conditions, there may be implications of that. You've got to be able to share the summary results of the license information with the project with others. And you have to have some code command line tool that can summarize the file level licensing information for projects you know is a single license. Are there multiple licenses? What is the story here? So, and then for this to really be effective over time, it's got to be into CI loops. You've got to have it in build environments so that you can keep up as it goes along. So at the very start of SPDX was to address parts of this. And at the time, it was focused mostly on getting a standard together so that we could accurately identify the licenses associated with the file. While the SPDX license list has got 300 licenses with standard identifiers, and you can put a one line comment into each file and identify it without too much pain from a developer perspective. Command line tool is up a question. We need to be able to accurately summarize the licenses associated with a project every time the project is built. Well, if you can generate an SPDX 2.1 document for each build of a project, that has a signing of the files, a signing of the project itself. And then you have all the information there so you know things have changed from revision to revision. So being able to generate the SPDX document can address that. Being able to share the results with different consumers. Most of the developers, I know, would prefer to see things in a tag value format, or more the Linux side, anyhow. Most of the Java guys prefer the RDFA. And most of the lawyers refer to a spreadsheet. With the SPDX, we've been very careful to preserve so you can translate back and forth between all formats and look at the information. So you sort of go, well, SPDX file. Well, actually, it's an RDFA, is it a tag, or is it a spreadsheet? And if it's not the one you want, there's tools out there to convert it. And then having this command line tool is also an open question. So obviously, we want open source tools, and all open source builds upon prior open source. So we're starting here with the place of the theology, which I will stress is in testing mode right now. We're trying to get ready for a release. And if anyone wants to go and help test it, that'll be very welcome. The source is up there. And it handles, is everyone familiar with physiology in this room? Who's not? OK. Then, physiology is a scanning tool that will basically take a file, or point it at a file, or like a project, and it will scan through with your choice of one or multiple scanners and look for specific expressions that match licenses that are common for finding licenses. And then that's, so it's got NOMOS and MONC and NINCA, which are three scanners that you can choose from based on where you're, what you're sort of looking for, how strict and things like that. Each of them has different heuristics. So you've got different ones to play with, depending on what you're looking for, and how strict you want to be. And then it basically gives you an interface if you're working with it interactively, and lets you recognize things and assign things and effectively go through a clearing process to say, OK, I've gone through all these codes. I've recognized everything. It makes it easier for you to sort of manually look at a project and then generate out the information so you've got confidence you can ship it with licenses are. With the release of the 3-1, well, in the 3-1 version, it's now able to generate tag value. And so you can get a file license list. And but however, it does need to have some better heuristics for the command line side. The CP2 FOS needs to be able to automate the decisions that are going on. The interactive works well, but this is where we're serving an area to focus on for us. There's also the SBX tools, and that lets you convert between formats. They're out there today. And mostly, it's just to basically validate that you actually have a valid SBDX file. And one of the things that's looking like it's needed is we probably need to get something that takes an SBDX file and just summarizes the licensing information. There are other research projects out there that are saying, is this licensing consistent? Does this work well? But so there's always things you can add. And then the other open source tool we're looking at for this is the ELBA system, which basically takes and builds a Linux root file system based on Debian packages. And it needs to be able to have an accurate summary and status statistics of license each time it does a build root file system. So this is part of trying to pull all these pieces together that are out there and see if we can get something useful. And with that, I'm going to turn it over to my colleague. So let me give you a very quick information about ELBA itself. It's not a distribution. It's not a tool to build a distribution. There are a lot of tools out there to build distributions, Yocto, PTX Desk, whatever you name it. So the problem for a lot of companies, especially smaller companies, is once you build your stuff with Yocto, you become a distributor. That means you are responsible for tracking for bug fixes, for security bugs, and all this kind of stuff. So we took another approach and said, we want to have a tool where we can leverage a well-maintained and existing and well-maintained distribution. And that was ELBA does. It builds Debian-based root file systems for embedded system. It generates a Debian-based development system, matching the target for application developers. It's highly customizable. And it's fully reproducible. There is the link to the project homepage. It's open source. It's written in Python. Here's the quick overview. You have an XML description of the project. That means that's all the packages and the rules. You can downsize after packaging when it generates their final disk image. You get the packages from either the Debian repositories or your private repositories. You can integrate your applications in a separate repository. It generates the application development kit and images at the rebuild CD, the source code CD, and the licenses file. So here is a slightly better overview. You have your customized Debian package pool. That's basically what the application developers create. That's the official Debian pools or a mirror. And then everything is converted into the target system, the developer system, and the various surrounding information. So in the license reports, Elbi generates a license report today by parsing the Debian package information. That's not really accurate. And it doesn't give us a lot of statistical information about the accuracy of what we are doing. So there's no SPDX support. There is an SPDX branch which generates the existing information in the SPDX format. But that doesn't help for the other things like statistics. So we looked into integrating SPDX generation into the build process, but generating the SPDX information by scanning the source files. Every time you build a root file system, it's pretty much, it's too time consuming. It's really overkill. And so we thought about, because the Debian source code doesn't change every day, we thought about downloading regenerated SPDX files to be the solution. So we wrote an SPDX generator. It looks for the nightly Debian update, feeds the updates into Phosology, generates the SPDX file, generates statistics, and uploads the SPDX and the statistics to a public server. So Phosology is not really optimized for automated workloads. We had to write quite some wrappery. It's pretty ugly. We need better horistics for automatic conclusions, which are not there yet. The generator is working progress, and we really have to clean it up before we show it to the world. It's horrible. The generator service, though, we are going to provide that continuously. So if you create a root file system, you get the binary CD-ROM ISO, the source CD-ROM ISO, your image, and now you get, as a separate information, the SPDX. It's 7-SIP because there are people working on Windows, and they hate torbols. 7-SIP just works both on Linux and Windows, so we're happy to do that. So OK, the LB integration for the SPDX torbol generation is available in the LB Q3 repository. There's an LB SPDX branch now. And the repository of the SPDX repository is available at our home page. It looks like a Debian pool. It has the same directory structure. So it's analogous where you find the source files in the same directory name. A path name is the SPDX and the statistical information. So no, there was something else I wanted to show. We generate statistics that's license statistics. That's Linux kernel 3.1639. So I ran a lot of recent kernel. There are numbers of a very similar. So yeah, that's stupid. OK, it should be 40. OK, that's silly. But that was me reformatting the spreadsheet. So we have 1,400 references to GPLv2. We have 10,000 references to GPLv2 plus. So that's 40.8% and 90.6%. So the files on the web page have actually the real numbers. So the other, where is the other file? There is the SPDX file, which is generated. It contains all the file names. So that's what basically comes out of Phosology. And then it generates another statistical information. We just use the MIME types, which are available in Phosology, which are not really nice to read. But it's pretty clear. It's text xc, so the c files, cata files, make files, assemble files, and whatever. We have actually c++ files in the kernel. So you see the total number of files. That's the total files in that kernel repository. That's the files which have actually license references in them. So we have, OK, that's 63%. We have total license references, 34,000. That means we have more references than files with references. That's because we have a lot of files which are dual license. MIT and GPL or a BSD and GPL and whatever. But now here we come to the c files we see. We have 43,000, 34,000 c files and 26,000 half license reference in it, which is machine identifiable. So that's roughly 75%. I think the current kernel code has close to 80. So we are adding more files which have license references, but nobody cleans up the old ones which do not have. So we are trying to get that sorted. So, OK, that's it. Questions, no questions? Not a question, hopefully a useful observation. You asked about tools that will scan a source tree and tell you what licenses are in it. I needed to do this for Mozilla. I confess I tried using Phosology and couldn't understand it. So I didn't use Phosology. I can't understand that. Yes. I hope maybe there are some people here who can understand that I didn't understand Phosology. But so what I did was I wrote something called SLIC, which stands for Speedy License Checker, which looks through a tree of files and tells you the SPDX identifiers for every file in there, using various regx magic. And it's on github and github.com slash jerv slash slick. Yeah, but the problem is if you have SPDX license identifiers, it's easy. No, no, no. It doesn't just extract them from the files. It reads the files and works out the identifiers from the license text. So it has a big database of different sorts of license text. That's basically what Phosology does as well. For those who can't use Phosology, I'm just saying there is another alternative outlet. Right. Yeah, there's quite a bunch of tools out here. And we need probably one tool set which works. Can I do first? No. So people are consistently bad at kind of putting licenses at their GitHub repositories and things like that. So do you see a chance that we move to a world where you just have your license specified in the file and then some automated tool will in the end kind of generate a project license, which is then usually not just one license but a combination of more automatically? Yeah, I mean that's the goal. So if you look at the statistics, and that's one of the reasons why we wanted to generate the statistics is basically you can feed statistics into, you take the package statistics and figure out how bad a project actually is. And then you can create sheet lists, public sheet lists, which works very well. So people get their act together and clean it up. So that's one of the things. But today, people do not have metrics. They even do not know how bad their code base is in terms of licensing because nobody tells them. And they have no tools to look at. And they don't care. But when distros start to care about that and say, OK, here is the information. We are not longer accepting projects with non-specified licensing issues and whatever. So please get your act together and clean it up. So that might help. But we need the tools. So people packaging packages or whatever can actually get numbers out of it and get a reasonable information how bad it is. People want to see. One of the things you should do is, on the link we're giving, we've literally got the entire Debian distribution with a first pass of what the SPDX tools are generating. And so if you've got specific packages you're interested or you care about, go look at it. Yep. Said it wasn't a question over there. Yeah. I was wondering whether these tools could work with pre-compiled or binaries. Can they scan or do they have functionality like this can for Solergy see that for what you were using FFM bag, but you have already compiled? No, not at the moment. So that's a different class of analysis. We are looking at really source code. That's what we're looking for. There are scanners for binary matches out there, but that's a totally different playground. There's a question behind here. Thanks. A question as a lawyer, what I'm interested in as far as I get you a tool, it's just as good as the people who are maintaining the directories in terms of the license. For example, if I just say I got a GPLV3 third party component here, but I missed to add the copyright information, the SBDX wouldn't make any binary search and tell me who the copyright owner is, all right? In the SBDX file, the copyright owner information can be recorded based on what it finds in the file at the source level. And that's about it. Thanks. I mean, there are a lot of gaps in the tools, but we have to really use them and then fix them. Said more questions? OK. No, there's one. Last minute. So you said that you were looking for a command line tool that did something, but then there are a lot of command line tools that do different things. So what exactly is the tool that is missing? What should this tool do that all the existing tools don't do? So first, if we were having a proper command line driven interface for continuous integration checking, none of the tools I'm aware of actually does that. There are, there have been approaches for a doctor to do that, but the doc, what is this? Deusox V2 tried to do that. Yeah, they tried to do that, but it's not really useful and it's unmaintained as far as I know. Phosology is really GUI interactive driven. It's developed for manual license clearing, and we can reuse parts of it. So it's, it's, and then, of course, there are other things missing like specific heuristics to conclude license information. And then, so there are a lot of bits and pieces out there. We have to connect them together and have, feed them into a framework which actually is usable for all kind of workloads so that we don't have 55 different scanners out there, which all give you 100 different results. That's the state of the union at the moment. If you try 10 scanners, you really get 12 results. Well, with that, we're out of time. Thank you very much, Kate and Thomas.