 My name's Mark Hatley. I work for Wind River and they've given me the opportunity to work on some licensing things. Specifically, integrating the SPDX format, which Wind River is behind and a member of the SPDX group, into the Yachto project for automated machine generation licenses, as well as related things to be able to help track software licensing and things eventually. So some of you may have been in at ELC this spring when a little bit of this was presented, but that was before we had actual code. Some of you may have seen two years ago where I presented a prototype where it was a prototype and it was nothing that ever went in because it wasn't useful from that. It was just a proof of concept to see. We are finally beyond the proof of concept stage. So if you have seen any of those presentations, please bear with me. I'm going to re-explain for about the first 15 minutes all of the problems that people are having, so that you have the context to understand why we did what we did and what the direction is moving forward. So to start with, these are my opinions. I am not a lawyer. I'm not giving you legal advice if you have a question. Contact your own lawyers. However, I'm happy to tell you about what I think the way things should work and the way that we have implemented various features and why. So here's my disclaimer so that my company does not get mad at me because this is work that they allowed me to do. It is not work that they ask me to do, which is a nice opportunity. So just to introduce what we're talking about, there's really two projects going on here. There's the Yachto project, which most people are familiar with by this point. It's the cross-development build system to create a distribution for, a custom distribution for an embedded operating system. And then we have the SPDX. And all the SPDX is, it's a standard interchange format that allows you to, or allows somebody to specify the licensing for the individual files, as well as the software as a whole for a package or a source code packages. The big thing there is it's finally a specification like PDF so that everybody can look at it and go, okay, at least I understand what these fields mean. Now we have to get the right data into those fields. We have to have enough people using it to get enough momentum so that we can actually make it useful over time. And the momentum's been the hardest part with the licensing. Everybody wants licensing. Nobody wants to do the work because nobody else has done the work yet. So hopefully this will help bootstrap some of that process. So first thing, what problem are we actually trying to solve? So what is the software license of your product? What makes up it? Is it open source? Is it proprietary? Did you write it yourselves? Did you contract it out to somebody? Did they use code they shouldn't have? Did they link to code that they didn't realize they were linking to, which may have license ramifications? The end question is what are your actual obligations when you ship a product? So example, Busybox. It's obvious, GPL2, right? Well, no, because if you actually look at the source code to Busybox, you will see that they copied sources from other things to make their own package. So is it GPL2.0? Only your lawyer can actually answer that, but this can help. So Busybox, like I said, is interesting because it consists of files from many projects. An example, SignalFile says I'm GPL2. The RunShell file says I'm an MIT license. The Math file says I'm a GPL2, but I'm also a BSD3 clause and I'm also an MIT license. So why does this matter? Well, when you compile Busybox, you're going to compile those three items, and of course many more, into your application. And so your application inherits the source code licenses. And so in essence, you now have this wonderful map and you have to put it together and explain to your lawyers what software I'm using and what the actual license is. Which gets you down to this, the actual license for that particular configuration. You have to meet the requirements of all of these licenses together, each of the individual software licenses. And so what you would get in an SPDX format is the individual files for a particular configuration would look something like this. You would also get something that indicates the maintainer declared this as GPL version 2, which definitely matters what the maintainer of the package thinks it is because they're one of the people that would go and cause a problem if you did not meet the obligation. And finally, there's a concluded by whoever reviewed the license, which would probably be your legal organization. And in this particular case, with this particular configuration, it's roughly agreed that it is GPL version 2, is what you have to meet in order to release this configuration. Fairly simple example on a fairly complex package. So now we look at your actually, here's your file system image that's made up of all these components, which is part of a product, which also is a bootloader and a kernel, and suddenly you now have a wider mess. How do you coordinate all of these things together to figure out what the actual software information is for your product as a whole? And that's really why the SPDX was there. We've got to have a common interchange so that we can create tools to help automate this because we have to bridge the gap between engineering language, legal language, our marketing departments, and anybody else who's involved in these things. So quick overview of what the SPDX files themselves actually contain. There's really five sections here. The specification information, just which version of SPDX we're using and a little bit of information around that, who created the individual file that we're looking at, and then the actual stuff that we care about, which is the packaging information, the file information, individual file information, and the licenses that it found in the system. So I'm only going to give you an example of what I think is the important information, such as the package. This is one I actually pulled out of some zealob automation. And as you can see, the system said this is where I downloaded it from. It says no assertion. And the reason why is it was automatically generated by a system. The system cannot assume what the license is. It can't conclude it. Only a person can conclude it. So no assertion is the default. The declared license is what the package itself said, IMA, and it's license reference zero and license reference one, which we don't know what that means. So eventually I will show you that it is defined in the bottom of the document. The next section of the document is the individual file information. It tells you the file type, source, binary. There's theoretically some other types that you can put in there, but source or binary is really what you pick. And then the file, the license for that file as determined. Again, concluded is something an automated system would not do. That would be a reviewer would do that. Check some to verify that if that file changes, we know we have to re-review the licenses. Copy right text that was found in it, as well as finally the file name in that package. And now here's the magic license information. So the license information itself says, I found a file where the system did. It says, please see the online publication. Well, that's not terribly useful in terms of this. So a reviewer would go in, read that text and say, okay, I need to go find what they're talking about, read it in context, make a determination, and they would end up probably manually changing the section. Then the extracted text says, GPL, something, something, something. Okay, this is probably GPL, so the system pulls out and says, okay, Zlib has something in it that's GPL and something in it that is likely a Zlib license. And again, the reviewer, someone's got to make the determination at the end. Most likely what I'm used to seeing is this GPL reference and something like Zlib isn't the license of the overall software. It's the license for configure or config.guest or something along those lines. And that's when you, as an engineer, need to work with your legal organization to explain to them, no, really, the thing that says GPL here is never used on the target. We never actually deploy that component. It's only used for building. That helps them make a determination on whether or not you need to meet the GPL on this particular package. Okay, so I gave you an introduction on what the SPDX is, but how do you actually generate it? There's really three ways. There's a good way. It's good enough for most people to get started, and that's the automated machine generated way. License information is generated cheaply. We can process thousands and thousands of files. The computer just goes off and does it in a corner. It doesn't matter. It just takes some electricity. However, it's only good. We need a better answer. And the better answer is, human comes in, looks at the output of the automated machines, verifies it, resolves any of those things I just mentioned. Like, it says that there's a license somewhere else that I should go view. They go view it, read it, interpret it, and then handle that. But they don't actually look at every single file in the system. The best thing, though, from a legal perspective, that I have been told is that the humans should look at every single file, interpret every single license statement, copyright statement in context of the source code, and make a determination for every single file. However, best is unlikely to happen, simply because it is incredibly time consuming. So the reality of the situation is, you're going to probably generate the stuff automatically. You're going to have somebody with knowledge review the questions. And then if there's a component that is questionable, does not have good licensing statement, that's when a human would come in and review the lines individually. So machine generated is really what I'm talking about today. And the best way to do that today is phasology. The phasology is a thing, if you're not familiar with it, that was contributed by HP a while back, and I think they're on version three or four now, that goes into pattern matching, does scanning for keywords, things like that through the source code, and identifies the licensing for the components. Phasology does not output SPDX natively. Phasology just goes and does it, generates a report and the metadata that you would use in order to build the components. But the University of Nebraska, Omaha, went and they have a research grant that allowed them to create a module for phasology that allowed them to generate SPDX based on the output of phasology. And so that's what the phasology SPDX project is. Again, real-time scanning using phasology with an output that is in SPDX format or can be easily transformed into an SPDX format. And it's important whether it's SPDX or transformable because SPDX is very text-based, it's difficult to process in some ways. So they use the JSON format or the plain text format, the examples I showed were plain text. But the JSON format is simply much easier to process by automated tools so that you can then convert it into SPDX or do further processing on it. I'll go back. At the bottom, if you're interested, is the actual phasology SPDX project, and this is their API. One of the things I want to mention about the phasology SPDX, and I didn't realize this going in because I've been working with these guys for a while, is they're not the CS engineers that you're used to traditionally coming in and working on things like this. The phasology guys are being trained at the University of Nebraska, into information sciences. How do I process the data? How do I find the data? How do I turn it into something that's useful? And so the components and the way that it's implemented right now, if you put it in that context, it's actually very good. However, if you look at it from a CS person, there isn't a lot of multitasking, there isn't a lot of parallelization, and a lot of the components are very, very linear. And so this is where we come in as engineers and say, you know what, it would be better and faster if you made the following changes. And they may come back as the maintainer of the project and go, yeah, but I don't really know how to do that. And then, of course, the open source kicks in and we contribute and everything else. And so this is an open source project. So let's go back to the Octo project. This is really why I'm here. The build system, if you are not familiar with the Octo project, is roughly this. Standard embedded build system starts with fetching the source code, goes in, patches the application, configures, compiles, installs into a temporary directory, we then process the output, create some packages, move on, eventually create images. Very, very basic stuff. One of the things that's important to figure out is where in this process do we need to capture that source code? We could capture it right after the source fetching, but as soon as we patch it or modify it, then we have to go back and do it again. And so what we came up with, an answer is instead of doing it on the raw sources, we patched the sources and then we add a new function into this, which is the SPDX generator. That way we're always verifying the sources based on the patched version and within the OECore environment. The patched version may be different from one machine to another based on various configuration options. And so we know we're always doing the one that will end up being compiled, will end up being transformed into binaries at some point. So the do SPDX function that was added is added via a BB class, which if you're familiar with the system, you simply add it using user class SPDX, adds it in, it's automatically added in the correct order to build system. The test plugs in, and the rough areas that it does, and again, this is very linear process because that's what you do when you do data processing, it's very linear. Cleanup Olog files, you create some temporary information, you get the SPDX from a local cache if you have it. If you don't have it, that's when you have to prepare it for sending up to a remote server, or potentially a remote server. You then tar up the sources that were not in the cache. You then send them to the Phosology SPDX server, wait, data comes back after it's been scanned, you then process that JSON data, and you store a copy of the JSON, you process it, and you also store it off as an SPDX so that it can be modified by other tools. And then you just clean up after yourself. So this is their official, there being the University of North Carolina, not North Carolina, University of Nebraska Omaha, this is what they're working off of. Simple, very, very simple, at some point in the thing we break in, we do the SPDX processing, we send it off, scan it, come back, or we pull from the cache, or we do both in order to get the data through. The eventual output is the SPDX, or the manifestor, and the manifestor is simply the cache information. So here's an example of something that was actually generated through the SPDX system. In this particular case, you can see that the versions, SPDX1, the data license, in other words, of this SPDX file was set as Creative Commons. We've got some very basic documentation that says what this is. We have the creator, Phosology-SPDX, when it was created and who created it. Get in the package information, get in the file information, and eventually there will be the license information that was pulled out by the Phosology-SPDX system. Okay, so let's talk what this actually is. In the OctoProject 1.5, we did add this during the development. It is not release quality. This is the first step towards getting something that's going to be useful. So I consider it to be a prototype, maybe even beta quality in some mind. However, what's in the release right now does not work if you just pull it out and try to activate it. You have to apply a few patches on top of it, and those patches will go into the OctoProject 1.5.1 release, which will be in a few weeks. Basically, to enable it, you need to set up the Phosology server. You then have to add the Phosology-SPDX module. It's very important to know about the SPDX modules that you have to make sure you set Apache so that it can access enough memory. Because if you're processing GCC sources, for instance, you can end up needing 700, 800 megs of memory for a single PHP process to run, which for a normal web service would be denial of service. With the Phosology, you may have to have that ability. Timeouts also matter. It can take 3, 4, 500 minutes to process a GCC, whereas it takes 30 seconds to process BASH. And so you've got to have enough timeout time in there. And there's other configurations. They're all in the documentation for the Phosology-SPDX, but based on how it's implemented right now, this would work fine internal to a company, but I would not want to expose a server. The other thing why I consider this to be prototype and beta quality is the Phosology-SPDX module is very linear in how it processes, and the Phosology system itself is also very linear how it processes. Phosology has a scheduler, but the scheduler only works on a package basis. And so the scheduling gets the same amount of scheduling priority if I send it BASH, which is 500,000 files, or GCC, which is 500,000 files. And so you send the data up to get processed, and it finds a node, finds something available to process it, starts processing, and it processes file 1, then 2, then 3, then 4. And so it's a very long process no matter how fast your machines are. And so if you're going to be doing this in the near term, you definitely need to have a very fast I-O machine. You need to have a lot of memory, and you can use this if possible for the Phosology SPDX side. The other thing to enable, like I mentioned before, is add the SPDX module to our SPDX BB class to your user classes configuration, and finally configure the settings for the SPDX class. If you look in the Metaconf license file, at the bottom of the file is all of the information that explains how to configure it. At some point we will add this documentation for real world results. What did I test? How did I get this? So I set up a Phosology machine which was, it's a little bit older Intel Xeon with 3 GHz 8-core. 48 gigs of RAM, and I did use the RAM disk after initially trying the RAID, and it gave me about a 10-20% performance boost by going to the RAM disk. Build Machine, if you notice, the Build Machine is actually a little bit more powerful by cores, but it wasn't really multitasking. I wanted the 3 GHz versus the 2.8. More RAM, etc. The other thing you'll notice in my Build Machine is it is intentionally very old. It's Fedora 13, and the reason why is when I do builds, I want to make sure that the build system is not impacting what we're processing. I also built Core Image Minimal which is one of the smallest images that is just a default template out of the OctoProject to OECore. So, like I said before, it's prototype quality, it didn't work right out of the box, so I had to make a few changes just to get it to work. Two of the first changes here came from the University of Nebraska. Additional license information processing what this does is it knows about some additional tags that come down from the server and gets them transformed into the right settings. And finally, the JSON. Originally when this was implemented to the OctoProject, it was in a tag format, not a JSON, and it was very, very difficult to do the data transforms as we needed. They quickly made the change, but it was too late to get that patch into the OctoProject for the 1.5 release. So, I had to add that patch into the system. Like I said, it should be in the 1.5.1. There are a few general fixes I had to apply, and then I said, okay, wait a second, we're doing something wrong in the caching. It's a package name as the cache file, but if I build for if I have a multi-lib configuration build for my 32-bit and my 64-bit at the same time, I'm having to process everything twice. If I'm building something for my host system and my target system, and it happens to be the same source code, I'm processing these files twice. So, a really simple optimization was use a different name, and that's the BPN if you're familiar with the OctoProject. And then finally, we have to process the same thing multiple times. We weren't locking, and so if we hit the thing at the same time, we would end up still processing three or four copies of something. And then when they came back, they would all overwrite each other. And so, a simple lock file was all that was added. And so, this is the first results that I got. And from what I've been told, I'm one of the few people that has actually run this outside of UNO. So, the first set of graphs. Hey, I can use a laser pointer. First set of graphs. The dark blue is the real time that it took to execute the build. So, in this case, it was about 25 minutes. The light green, I guess it is, up here, that's the user time, and the light blue is the system time. So, this is really our benchmark right here. As soon as I turned on the system to be able to call this remote phasology server and start doing processing, and I did not do any copyright filtering, I only pulled for license information. The user and the system time stayed about the same on the build server. But the real time jumped way up to a little over 200 minutes. Now, I'm going to activate the copyright scanning, and suddenly, we're way up to 500 minutes. And so, it's a very long process. This is not something you would want to do on every single build that you ran through, unless you were using the cache as well. And if you notice, both of these two results used the output of the two caches. And they're almost identical to the no-spdx run. So, like I said, these guys are really good at the data transformation. They're really good at figuring how to read and process the data. And so, most of the additional work in these two places is data transforms. But it didn't slow down the build process, which I'm actually quite impressed by. I was expecting to add probably another 10 to 20% on the build, and it just didn't do it. First performance enhancement, because I said, okay, I'm watching the screen, and I'm looking at this thing, and it just does not look like it's doing anything. And the first thing I said is, okay, where's the first blockage in the system? And so, I had an experimental change. I removed all the native packages. And if you're unfamiliar, there's really three main types of packages in the system. We have native packages. These are things that we build on your host system in order to support cross-compiling. We have target packages. These are the things you're going to ship with the target to a customer, potentially. And also, SDK items. And the SDK items are things that you would ship to an application developer. Natives are never shipped in the normal case. And so, as part of the experiment, I removed the natives from this because I don't care about the licensing as a developer, because I'm not going to give it to somebody else, so I don't have to worry about that part of the licensing. And so, if we do that, you'll see that the graphs drop. The no-native drops a little bit. It's about 10%. The copyright one actually drops pretty significantly. It takes about 40 minutes off the bill. And off of 500 minutes, 40 is pretty good. And again, the cached versions don't really change. And so, I didn't add them to the graph because it's very quick as it is. Okay, but if you look at the workflow, we are blocking the system as we go through. As soon as it gets to the SPDX generation, tires it up, sends it to the server, and again, goes back to something that needs to change in their design is it's post the data up to the server. It keeps the connection open, and it waits for the data, the response to come back. And on 500-minute parse through GCC, it has the connection open for a very long time. And obviously, we all know that doesn't. That's not a good thing in the end. So, they're going to be reworking that so that it pushes the data up, closes the connection, occasionally, until it gets a response, and pulls the data down. But, the point is that we are blocking the build here, and you get into deadlock situations. And so, my change was instead of doing it all in one shot and blocking, let's try doing it in two individual components. We still run through the steps in the same order. We just happen to now prepare the source code so that we know it has not been modified from the patched. We then unblock the system, allow it to start piling, and then in parallel with the system, we then send it up and wait for the response so that we can later process the SPDX information. And what we got from that is a significant drop much more than I was thinking it would be. So, in the copyright case we're suddenly under 400 minutes. So, that's it's amazing difference, and it's all because of the processing time for GCC and the kernel. So, my results from this. The biggest problem right now that we have in the Phosology site is performance. Us build engineer guys are always trying to make the build faster so that people have to wait less. And the cache is great, but on that initial build it is painful. Again, Phosology is very, very single threaded, and somebody's got to step in and give these guys some help and figure that part out. Long, long connection times because of the processing overhead. And currently if the connection closes the system automatically retries which means if you have a time out of four hours and it takes five hours to process something at four hours the connection closes, it then says, oh, I'm sorry and it sends the data back up to the server and starts over and it will keep doing that forever until you run out of memory on the server. It's not a great thing. And like I said, they understand this I've explained it to them, they're going to try to fix it. But they're not web server designers, they're data processing guys. On the Yachto project side we have a problem that we discovered. We have recipes that do not have archived source code. They have individual files which are copied into the destination location or they have source code that is generated in place and compiled. Very, very tiny apps. However, because they have no upstream software that was unpacked and patched there is no source code to be scanned and so you get an error generated. So we have to do something within the Yachto project to identify this information. Make sure that we can key off of or change those recipes so that is no longer a valid way to do it so that whatever source is going to be compiled is processed and sent upstream. And finally the Yachto project has a thing called the estate cache and what this is for is it's basically to optimize the build. If you haven't made any changes since the last time you built all I should have to do is unpack a copied version of that component and I can start as far into the system as I can. The SPDX stuff does not yet work with the estate cache so if you have cache components it's going to completely skip the SPDX processing and you won't get that information out of the system. We will be working on fixing that sometime in the 1.6 development cycle for the Yachto project. Future work and this is probably where most people are really going to start looking into this stuff and making it useful. On the Phosology SPDX side they understand that the machine generated components is not good enough. It's good, it's the starting point it's good enough probably for developers to get started with this and make good decisions on can I use this component or can't I? But it's not enough in my opinion to release your software. So they're going to be generalizing many of the services beyond the Yachto project so that they could be used by BuildRoot or they could be used by SUSE's build system or Fedora's build system or whomever's build system but the Yachto project is just their testing ground. They're also going to be integrating the service as part of those build systems as they find people to help them do it or as they find desire to do it and this is all part of the overall SPDX initiative. For the human part of the thing they're integrating a system they're calling currently the dashboard and what this does is it interfaces with the physiology side and gets the data back caches the data and provides a web interface to view the cache data and make manual changes to it and track those changes so that you can first run your automated system go through it, okay then you can get your lawyers involved and say I've picked these 50 packages over here 50 sources to use, the lawyers will be able to go to the dashboard whomever, view the stuff in context and actually make tweaks, comments changes to it in order to be able to say hey this is really what I want. Part of the design behind this isn't that it also has to be physiology, this data could come from Black Duck or any other commercial source as well and so you get into a situation where you can get a centralized repository of licensing data so that you can help check this stuff, the lawyers can flag things and say hey don't ever use this piece of software I can't verify it, I don't accept it etc and it can help the developers very early on in the process avoid legal bugs in their software and obviously performance improvements are a big thing that the physiology guys need to work on the global SPDX sorry there is a global SPDX cache as it's processing and if something goes wrong it just leaves the files there well we can't do that because if we're running off a RAM disk we're going to run out we need to make sure that as the processing occurs we're cleaning up after ourselves that the the long connection time is not a long connection time it's whatever it takes to post it whatever it takes to pull it down and we clean up after ourselves multi-processing I keep mentioning it and it is something that the physiology is going to have to move to and the physiology SPDX integration is going to have to better schedule itself because it simply it takes too long and it's unreasonable unless you just want to go off and say start this on a Friday and come back on Monday and cross your fingers that it's done based on those other charts I had my guess is that if you built a large system like the Sado image it would probably take about three to four days to process all the source code and that's not a good way to start a project is to start a build and come back three days later and see if what you built worked YAKTA project on the other hand we've got slightly different requirements we're not pushing the SPDX we're pushing the traceability of the software and so this is my personal vision of what I think the YAKTA project should be doing and so far I've gotten pretty good buy-in from the YAKTA project folks the idea in the end is that we need end-to-end traceability within the system we need to be able to well what we have in this now we can trace the original sources to a license and that's pretty much it the YAKTA project has a way that you can trace the packages to the binaries and potentially the binaries to the overall source that the system has but it can't do individual this binary use these 15 source files so I can read my SPDX files and know these are the 15 SPDX entries that match my binary we need to get that traceability we also want to get the traceability that the image contains these packages contains these binaries these binaries contain these sources these sources are these licenses and so you can print out a definitive list of my image contains the following sources my lawyers have looked at this and said my responsibilities are I must release these 15 things of source code I must buy 5 guys beers because that's what the license says and I've got to notify these other 4 people that I've shipped their software and offer for the GPL and everything else and that's the end goal that we really have is to automate this process of going through it and telling people what they have to do to be compliant with the licensing based on these SPDX files we're not going to make a legal determination that's not the point it really comes down to give you the information so that your organizations can do what they believe is correct under the laws and future work Dr. Project More Concrete Future Work I should say we've got to fix the SDAT cache we've got to make sure that we integrate that JSON temp files into the SDAT information so that we don't have to regenerate all the way from source we can start at the binary component if there's a change there or we can start on another level up if there's a change there I've already talked to Richard Partie the maintainer of OECOR and he's already told me how to do it quickly and quickly within the 1.6 development cycle within the Yachter project we may want to come up with some tools that allow humans to more easily modify and review these SPDX files because it's very beneficial to us and if you're not familiar with the Yachter project very much one of our goals is to create tools that anybody in the embedded space can use not necessarily folks that are using open embedded in the Yachter project distribution creation so if we can get to a point where our members are seeing that the tooling to modify, review visualize SPDX correlations, license correlations comes in that would be a really good place for us to start working to go with the end to end configuration we have to have a concrete way that we can do binary license determinations so do we come up with some automated tools or do we allow plugins so that others can come up with automated tools so that we can say this binary use these 15 sources these 15 sources have this license this is the aggregated license that we that it probably is for that binary so that we can give you binary information and that goes back to the busybox problem what is the license of busybox and the only way you can make a determination is to figure out what your configuration is and what source files you used and then finally roll that into the image these are the binaries I've included these are the licenses of those binaries this is now in my image these are probably my obligations and then you can get that information to be able to figure that out and again more tools we need to have tools to work with SPDX files and right now the tools are a text editor or a tag value set some people have some Excel scripts which is great for the lawyers but I don't want to use them so these are all things that the Yachta project can be involved with but I'd like to see others work on these things as well this is very valid in the workstation environment it's valid in the server environment it's valid for other embedded systems even commercial systems because it's the common language for specifying these things and that's it does anybody have any questions about the licensing issues questions, this presentation can be available I've already sent it to the ELC folks so they should be posting this shortly if they have not already and I wanted to make sure this goes out if you have any questions you can certainly email me I'm happy to answer questions about the technical side and give you opinions I can only give you opinions because I'm not a lawyer but I've been doing this type of work for a couple of different companies I've been helping lawyers understand what the open source people mean when they write certain things so that they can make the determination on what they need to figure out, what they need to learn and how potentially they need to do this stuff in the future yes yeah, a question was does physiology do any caching so that all you have to do is give it hashes and you don't have to send all the files up and wait for the processing the main physiology has some mechanisms for sending data up to store the results I do not know if they can be retrieved by hash or they have to be retrieved by the initial package the physiology SPDX module does not use that mechanism at all it only uses the underlying scheduler and processing mechanisms and so it is a one-time thing my understanding is it was designed as a one-time processing mechanism specifically because people that are using it may have proprietary sources that they're processing and they don't want them cached correct, you don't have to cache the sources but they don't want they're erring on the side of caution on the physiology side but the dashboard should be able to plug in in the middle and act as a cache and so that middle point would be able to go in you'd potentially push either the hashes or the tar ball to the dashboard and the dashboard would then look up the hashes they've already been processed and it's going to short-circuit that and return back potentially human modified data which is the goal but the dashboard only exists right now as a static web page just to get an idea of what people would like I have no idea what their schedule is on implementing it unfortunately but I'm guessing it will be a year or so yes I have never used the block duck stuff my, oh sorry the question is is this similar to block duck, the physiology item and I have never used block duck myself and so keep my comments, have you okay well there's a mic right beside you yeah you can help me answer this question so block duck matches source code to other source code and then tells you if you know if you have that I think what I'm not sure what so physiology is if you've ever used any of the Bayesian filters for spam filtering and things like that it does a lot of that type of identification what's it spitting out exactly it's, I don't even know what the format is that it finally spits out because I don't see it until it gets to the spdx format yeah absolutely I'll let you answer yeah yeah so I'll repeat that just to make sure everybody heard basically physiology's purpose is to scan and it's to scan for keywords, it's to scan for other things that look like a license, look like a copyright yeah I think that's the best I've heard it described because I've always I have a good idea of what block duck does and how it does it but I've never heard anybody explain it quite that good it's basically he said block duck is an anti-plagiarism tool and I think that's excellent and yeah that's a great summary it's very short and sweet compared to some I've heard the the issue is there are things especially when processing proprietary software in this mechanism that block duck would be excellent for because it would be able to look for these things look for telltale signs of the plagiarism and add that to the spdx information because sometimes the plagiarism is acceptable it's public domain it's bsd whatever but you want that information as a lawyer to pull this stuff in so that you can make the determination that my engineers did the work correctly or at least legally and then it and then you can augment that with something like the physiology like the manual reviews to be able to get the actual determination license and get the information the lawyer needs to be able to make the right determination legally for product shipment or whatever and every time I talk about this stuff I always want to say legal is no different than any technical bug it's a bug if the system isn't put together right and the only way to make this a cheap bug is that you want to find these problems early in your development and you want to make smart decisions as a developer and avoid the problems later on because the lawyers never see the product until the product is supposed to ship and then unfortunately that just seems to be the way it works and it's just like documentation people that don't get the documentation notes until the day before it's supposed to ship and so let's my opinion is let's empower the engineers to make smart decisions so that they can help the lawyers understand why they made technical decisions in lawyer language because when you talk to a lawyer nobody's speaking the same language and as long as you can do that and you can avoid introducing the bugs into your system early it's going to save money it's going to save time it's going to save effort and it's going to save a lot of annoyance when you have to reimplement 15 APIs and you have a week before you're supposed to ship so I see this as one small piece of the tool but it's the first step that we have to take in order to get to that next level of processing so yeah you're out so the question was about supporting the human generated spdx in the earlier today in the Yachter project birds of feather session somebody stood up and said hey I'm working on u-boot and we're trying to put this stuff in the headers of the thing and that's exactly what we want that is the ultimate goal of the spdx community is we want people to tag their source code in a way that we can just take the tags right out of the source generate the spdx and there is no pattern matching there is none of this fuzzy logic we know that that is what the author intended we know that everybody who has patched that source code that is what they intended and it it solves the problem of having to review the code into a problem of just simply reviewing the licenses and that is a much easier problem to solve in the end and so if somebody like u-boot starts this process comes up with something very very useful then hopefully others will look at it and go yeah I can do that too I can add a five lines to every file because I'm already adding the license statement anyway and it just becomes a lot easier in the end linking with a library that changes your license question is how do you handle linking with a library that could potentially change the license that's not handled in what's done so far because what's done so far is most purely on the source code the one of the next steps is to do that determination of what source is made up of binary and there's two linkage cases one of which is the dynamic link case and that will require simply tools that say hey this thing is dynamically linking to this other thing and note it in some fashion the hard case though is this thing is statically linking to this other thing and I need to know that so that I can make the determination if my binary is okay there are some tricks that we can do in order to determine what it's statically linking against and as an engineer that's used to debugging software you've probably used them before and it's called dwarf symbols your dwarf symbols already have a reference for every line of code you compiled back into the sources and if we read those dwarf symbols out and we go this file use these 15 sources and we can correlate those sources back to the original source and then be correlated to the SHAs into the SPDX files we can determine what the license is of the binary and that includes conflicting licenses I don't know if you will ever see a tool at least I won't be creating one for multiple reasons that would tell you hey your license is conflicting but this would give the information to the lawyers to make that determination so if you're not familiar in the US we have basically if you're not a lawyer you can't give legal advice or you can be sued and this is especially true for companies and so if somebody creates a tool in the US that says these two licenses conflict or these two licenses more to the point don't conflict and somebody relies on that tool the author is potentially liable and so you won't see a tool like that from us from the company I work for for myself it's just it's not worth the effort but I will give you all of the data you need to either write the tool yourself or to give it to a lawyer and have them review it and make the determination on your behalf but you will not you won't likely see a tool that says these two are incompatible or they are compatible more to the point because it's making a legal determination in the US it might is not good enough when it comes to lawyers it's exactly and I would not be surprised if you did see tools that were programmable that went in and said okay I have my detected or concluded field that says GPL and BSD and proprietary wait a second there's two licenses here which I have programmed for my organization as being a red flag that I would expect there would be tools that people will come up with and commercialize and everything else but the determination of what those things those combinations are that are or are not allowed would probably at least in the US end up being determined either by the lawyer of the organization or through some type of a lawyer community within the organization people that are willing to make the legal determination take the liability behind their answers so it's a problem and it's a problem that as engineers are like this is simple we can just do pattern matching and we can end the ors and all the rest of it but it's easy for us it's not so easy when it comes to liability so any other questions? yes but licensing of the patch itself is not in the patch sources anymore but only in the patch itself questions talking about patch sources and licensing the patches I have seen patches that have licenses applied in two ways one way which I consider incorrect is in the header of the patch the correct way to patch the source if you are adding a license or copyright is to actually have the copyright or the license information as part of the code change and that's how the phasology and how we've set up the system so that we process after the patch is applied so if the patch adds a copyright statement or changes the license from GPL2 to GPL3 we catch the patched version of it because the patched version is what we are going to compile into the binary yes we only catch one of the two cases and from an organizational review point it's important that when you review the patches themselves to be checked in people don't do that and this patch makes this thing GPL version 3 in the Yachto project we would never allow somebody to make a patch like that the most that we have done within the OECore Yachto project is we had a component that was listed as GPL version 3 in the software it was a component that we wanted to use and did not want the GPL version 3 some of those exclusions on it we contacted the original author and we said are we allowed to change this one component can we get permission from you and there was only one author so we didn't have to go to a community one author and the person said I understand why you want to do this yes you have permission you may use it as a GPL version 2 we sent him a copy of the patch we created which is the header of the patch is the email conversation we had granting us permission and the patch itself actually changes the line in the license file to GPL v2 again he reviewed it signed off on it and that's what got checked in so not only did we add the comment explaining why we were allowed to do it but we also had the actual change in the code so that's the only one that I know of we had a license change within the sources within the patches the common case though is somebody would be backporting something from a newer version in that case they really are responsible for backporting the license statement as well if the license changed or is revised or the copyrights changed or revised and that becomes more of a procedural point on the developer and there is nothing perfect about it and this is one of the cases where something like black duck would be very good because they could look at the patches and say hey I'm doing this thing over here potentially and be able to find that type of a plagiarism I mean it's not really plagiarism because it's probably documented but the idea being that you would be able to determine that this came from a GPL v3 thing and I inserted it into GPL v2 needs to be investigated by somebody so we are almost to the end of the session if anybody has any further questions or one-on-one stuff they scheduled me down I didn't tell Booth to answer individual questions they're calling it a chalk talk but basically I'm going to stand there and if you have any questions about this or the octoproject or anything else I'm happy to answer them so I'll see you guys down there in a few minutes if you have any questions