 Good morning, everybody. Thank you for coming to the early morning talk. My name is Mark Chalabwa. I'm a director of engineering in corporate R&D at Qualcomm. This is Reshmi Chitrakar from Qualcomm. She's a senior staff engineer working for the Qualcomm open source technology group. So we wanted to talk about some work that's been going on between the two groups for SPDX generation via Yachto and a new open source license scanner called LID. So just so everybody's familiar with the terms, I don't know if everybody's familiar with all the things that I'm going to talk about. So I just thought I would go through some of the different projects that are related to what motivated this. Then talk about some of the current limitations in Yachto scanning, a new source map layer that I created to address some of the limitations that I found, and then some issues around reducing scan times overall. And then Reshmi will talk about LID and license scanning in general, phosology, some of the work that was done to compare phosology with LID. And then we can talk about the directions that we want to go, take any questions, and hopefully figure out a good way forward. So some of the terms, there's SPDX, which is if people aren't familiar with the standard format for documenting license information for files and packages. Phosology has been a tool that's been used to do this license scanning. It was originally from HP. It's now a project at Linux Foundation. And then Open Embedded is a way to, it's not a binary distribution like a Ubuntu, or Debian, or SUSE. It's a source-based distribution where you have recipes and you build the entire distribution. And so it uses a tool called BitBake, and that was derived from the Gen2 Portage system for building from source. Then there's the Yachto project, which I'm sure most of you are familiar with, which is, again, layered on top of Open Embedded and is a Linux Foundation collaborative project that, in addition to using Open Embedded, provides SPDX support using a plugin called USPDX, which currently is based on or uses Phosology. And then I am currently the technical steering committee chair of Drone Code, which is another Linux Foundation collaborative project. And what motivated me to start all this was Drone Code wants to release images of pre-built flash images that you can put on a drone and release them right from the website of the project. And in order to do that, we want to do license scanning so we can make sure it's compliant and make sure that we're not violating any licenses. So this is a small organization. It doesn't have a big budget. What we're looking at something is tools that allow an organization like that to be able to do a license scan and then have confidence in releasing the binaries. So the existing Do SPDX class or the SPDX BB class that's in Yachto today is fairly simple. It inserts this Do SPDX stage during the build. So when you bit bake a package, it goes through a fetch unpack, patch, configure, compile, and so forth stage. So the Do SPDX will insert after the patch stage, look at the source directory, and then hook up with Phosology to do a scan of all the files that it finds there. But it doesn't really capture any package dependencies. It will generate an SPDX report for each package as it goes through each package. But it doesn't chain the package relationships together. It also doesn't capture any build artifacts. And it was also very, very difficult to set up. I tried to set up Phosology to work with this. I don't think the version of Phosology that it uses is actually supported anymore. I don't think this plugin is generally supported actively. And it used some homegrown method to generate the SPDX instead of using the supported SPDX tools from the SPDX project. So this may seem a little tangential, but I'm going to get there. So there is what I came across in doing this, something called an archiver BB class in Yachto. And the purpose of it seems to be that when you are releasing an image that contains something like GPL and you have requirements to release the source files related to your image, then this is a way to package up all the source related to those files and have them in a way that you can then distribute them for compliance. So it's not really so much about license compliance the way it was intended, but it may actually be useful towards implementing something for license compliance. So it, after the unpack stage, archives all the original code that's there before it does any patching. Then at the patch stage, it has a bunch of other steps that it adds to unpack and patch the source code. It has different ways that it can bundle it. It can save everything as the original source and then a set of patches or it can save just the patch source or it can save the configured source. You have different options that you can set. And then in the deploy archives, when you're building there's a deploy directory and it will put these saved packages. It can generate a source package if you want and save it there as well. Nice thing is it also saves the recipes themselves. So sometimes you are saving the source, you may be saving the additional files that are being added like patches or config files that you would be scanning, but often you're not actually scanning the recipe itself. And the nice thing about this is it will archive not only the recipe, but any include files and anything else that are pulled in the creation of the package. So again, some of the limitations of Yachto-based code scanning today, the SPDX BB class, it only scans the source directory after patching. So the one problem with that is you don't get a lot of leverage. If you, for instance, have an upstream source package and you had scans of just the pristine package and then you had a bunch of different Yachto builds say that patch that upstream package or you had a custom build that not only took the patches for that package, but added your own BB appends and then patched that even more, you don't have something to start with that you can then just scan the differences of. What you have in this particular system is something which is already scanning the particular custom patched version of the upstream package. The archive.bb class provides a great way to store original source and then provide that source out to customers for license compliance. But it's not really intended or integrated in a way that's useful right now for code scanning. The fact that it does archive the build information is very useful because I think that should be part of the scan as well. And the phosology integration is certainly very, very difficult to use. I would say it's not maintained and it would have to be actively maintained for this to be a viable way forward. I know some people have hooked up the DOSIX 2 instead of phosology, but that's something that I have not personally tried. So the new layer that I added is this source map layer, which the intention of it is that there would be a scan of not just the Yocto package. So let's say something simple like get text. So it wouldn't be a scan of the patched version of get text that Yocto builds. It would be a scan of the tar ball for get text that you get from upstream with the package specific information for that. And then a scan of the Yocto get text BB file and all the associated files that it brings with it as its own package with the other as a dependency. And that allows you to have anything that uses get text in the future can leverage all of the scan information from the pristine upstream package. And then you would only be scanning the Delta for the changes that you've made. Initially when I've done this, the do spdx approach was a source-based build. The, my initial version of this was also just scanning the source, but I had conversations with Kate from the spdx project. I've learned that the spdx 1.2 spec, there are some additional things that are needed. So the artifacts that are built by the package need to be there and then the relationships for the particular artifacts, like from an executable to a dynamic library. Those relationships also need to be captured. So I'm going to have to go through a build phase, which I was hoping to not have to do, but I can't be compliant with the 1.2 spec otherwise. So the other problem with the do spdx part was it was doing everything in the context of the Yocto build, which meant it was very hard to parallelize. It was very hard to do any kind of sort of debugging analysis as it was going on. The intention of this was to make it all out of band, so that you could go through, do the build, it would generate all the metadata, it would index where all the source files are, it would keep pristine versions of the source files, and then the patch versions, and then you could run the scanner on all of that code afterwards, and you could throw it in a parallel build, you could throw EC2 instances at it, you could basically take what could be up to six days and reduce it down to whatever amount of time you wanted. Also parallelizing the hashing and other things as well. The approach is general, so it could be used with any code scanner, so it's not necessarily related to lid, but this was designed to work with lid. So the initial approach was to insert, again, a step after the unpack stage to take the original source, save the dependency information, unpack the source somewhere so it can be scanned, go through the patch stage, and instead of scanning the patch files themselves for licensed information, it makes much more sense and it's more accurate to scan the patched file, so the entire file that's already been patched. So the next stage, it will generate a list and index of the actual files that were patched, and it will only scan those as part of the package, as well as the license or the recipe information and others, which I intend to add. There was then also a do source map all target that was generated so you could just do that and it would traverse all of the different packages related to building a particular package. So for instance, if you had a package group that you wanted to build or a final image, then it would go through all of the related dependencies of the image and do these additional stages. So the way it would work is you do a bit bake of bit bake minus C to say run this stage, which is the source map all for a particular package or package group or image. That would then generate a bunch of metadata for where the source is for all the different packages and the package information. Then there would be, after that's all done, a script that you could run called source map post process, which would then work with the license identifier part. That's what hooks into lid. That will generate licensed metadata and then that's all in some intermediate metadata form that you could then run a script that can take that information and generate XPDX out from it. Or it could generate, for instance, a Excel file if that's what a customer wanted to have instead of, for instance, SPDX. The thing I'm going to have to change is now the source map all is gonna have to be done after the probably do package QA stage. Because in order to add in the dependency information for a package for something that was built and relationships of dynamic libraries, at the yachtel layer you have an art depends variable, which will say this package art depends on this other package, which means at run time I have to have a library. This is gonna provide a run time library that this package actually needs to run. That's at the package level. SPDX wants to know this particular file in here has a link to a dynamic library satisfied by this other file in another package. And that's the sort of level of granularity I need. The do package QA phase of a yachtel build actually will go through, and at least Morty in later releases, will go through and see if a dynamic dependency of an elf file is satisfied by other packages in the build. And so I hopefully can hook into that directly to be able to generate those relationships for SPDX. So the other problem is that when you go through and you generate all of this information and you generate information, license information for every file, you can get thousands and thousands and thousands of lines of information, which I can go here. Here's the stack of files that you can go through, which is not necessarily very useful to someone if they don't have a team of lawyers or someone else to read through all the information. So typically what's digestible is something more at the package level, I find. And so I've looked at this and for my own set, you can't probably read this, but what it has is it would have a target, and in this case it's like get text native, which has, which is the information about the bit bake file in yachtel for get text. It says what the licenses are, which is GPL v3 and LGPL 2.1 plus. It says what it provides, which is get text native, and it satisfies a virtual dependency, which is actually not captured in yachtel. So if I have packages that provide virtual dependencies, there is no way to specify that in yachtel. So for instance I could have a kernel headers or a libc or something like that, that many different things could satisfy, and you could change them at runtime, but there's no way to specify that in yachtel today that I'm aware of. It would then specify what the other dependencies are for other packages. I summarize what the licenses are of those packages, so from a very, very small view you can see, okay, well this package depends on a bunch of other packages, and those are the licenses that I can kind of eyeball that that actually looks like it's probably okay. It looks like those licenses all work together and there's no major red flags that tell me that I have incompatible licenses. It also tells me what the licenses of the patched files are, and then it would go through. So I have another entry here called download, and that's the upstream get text package, probably from GitHub, or it actually has it, well there it is there, it's ftp.gnu.org. So it says there's the URI associated with where that package came from, where I have the source file, and then I would have an unpacked repository for the source, and then hash of that particular thing. Then there would be all of the file level information associated with that upstream tarball. So the top level package would have all of the license information for the dependencies and the patched files, then the other file would have just the pristine upstream project and the files associated with it. So why do I wanna make that separation and difference? One is how do we reduce the scan times so that people who are doing this, many, many, many people who are doing this don't have to all scan the same files over and over. It's very expensive. License scanning can take actually days if you have a very, very accurate scanner. And so when you wanna get a product out and you wanna make a release, you don't wanna then have a six day wait before you can actually get that product out. So what would make sense is to create a commons of the scanned files so that all these things that are being done in Yachto for all these upstream packages don't get scanned by everybody sort of at the leaves of the entire ecosystem. They can all be leveraged from commons of pre-scanned information. But then what data we need to capture per file or per package? So some of the things we obviously need are at the file level or the file hash, but then we need to know the file type. And some of the things, some people have looked at MIME, but then MIME you need to know whether it's binary or not. Some people wanna know if it's binary, but what is a binary file? Some of the checkers that check things actually say that a UTF-16 file is a binary file and then you wouldn't actually parse it for the license information. There's something that Thomas Glickster was using called Pigments, which he found was actually fairly accurate for being able to classify files. And then for the license information, when you're scanning a file, say that you find an SPDX identifier, a license identifier, but you actually then find license text as well, but what if they don't correspond to each other? Or what if there's a license conflict? Then you need to know that. You wanna know what the confidence of the license is based on the information that you scan. You may wanna know the region in the file where you found the license. And then what is the license name? Is there an SPDX identifier associated with that? So all that metadata should be captured. And then what is the file path of that particular file within the package that you're scanning? And then what is the context of the package name? Because the problem could be you have say package A with file A inside it, and then you have package B with the same file A inside it. They have the same hashes, but the two packages have different licenses. And there's no license marking inside the file itself. So where you got the file from matters, because you can't just collapse everything down to one hash without the context of the package that it came from. Because what you're getting is you're getting the file and the license for that file in the context of the package because the package license is what now applies to the file. So we wanna make sure that we're capturing that license information in case there isn't a license information associated with the file itself. The other part for package metadata then is we need a hash of the package. So hopefully once all the stuff is worked out for the files themselves and all the discrepancies are worked out, all people are gonna care about is what's the hash at the package level? Am I using the same hash? I'm good to go. The license information, also again, similar issues, but we also need to get license information from the top level. License file, the copying file, the notice file, the readme file, anything else that's in there and make sure that there is no conflicting information there. What the package URL is, package name, the version of the package, all the source files associated with it and then obviously the build artifacts that are there and the relationships between the build artifacts and other packages. So I will pass it on to Reshmi to talk about the lid scanner and phosology and all the things related to that. Perfect, thank you, Mark. Cool, so we're now in the realm of scanning, right? So let's start with some background and motivation of what led to building lid the way it is right now. So we look at two scanners from the phosology offering NOMOS and Monk. A little bit of a background for anyone who's not aware. So NOMOS is a regular expression-based snippet matching tool. So think of regular expressions being licensed patterns that you would find in source code files and you'll get a hit against these patterns. Only snippets are matched and occasionally even if exact strings are not matched, NOMOS does a pretty good job of saying, hey, something smells like a license here, go figure. But I can't tell you which one. We found that it's pretty accurate in detecting your common open source license types, about 80% of them, but as far as verbatim SPDX coverage as of our evaluation, which was late December of 2015, it covered only 2 thirds of the SPDX licenses verbatim. We then went ahead to create a real world evaluation data set. This contained Qualcomm proprietary code, containing some files that had some open source license text, sometimes with standard SPDX licenses, sometimes with non-SPDX licenses. And NOMOS did pretty well in that case. It caught about 94% of the open source licenses in our real world evaluation set. The challenges we found with NOMOS is, just adding new licenses is not pretty straightforward in that you need to add a new Regex rule, a recompile the underlying C library so that you can rerun it against this new Regex rule. Handling corner cases, of course, this is a regular expression-based tool. So any deviations from your license patterns or any unexpected characters, those are not caught. And computationally, Regex doesn't appear, is not cheap. So for good reason, the NOMOS guys didn't account for all diversions from your standard license patterns to keep it computationally reasonably cheap. But that means it has the downside of not catching certain licenses and deviations. On to Monk. Monk is a sequence of words matching tool. So pretty much it is built to catch full license text, but if there are any deviations, there is a configuration that you can put in where you can say, hey, skip X amount of words before bailing out on the match and saying, I don't know what license this is. Like I said, it does full license matching, but as of our evaluation, it had pretty low coverage, like about 20% of the SPDX licenses were caught. So if you see this Venn diagram here, out of the total of the 292 SPDX licenses as of our evaluation, NOMOS caught about 207 Monk caught 85 and neither caught 83 SPDX licenses. All right, so here are the goals with which we built LID, which is a license identifier tool. We wanted to be able to catch open source licenses. We absolutely wanted to catch everything that the standardized SPDX organization publishes as licenses, headers and exceptions included. And we wanted to catch full license text in source code. We wanted to keep it easy to set up and keep updated because one of the things is to be relevant, you wanna be able to bring in the latest licenses as recognized by the organization, SPDX in this case, and you wanna be able to do it easily. We want it to cater to different applications. So some applications may be sensitive in that I'd rather have false positives but give me everything that smells like a license versus others might say, you know what, I don't care about false positives, give me the real hits. So we wanted it to be tunable to tolerance of different applications. Of course, our goal was to aid in licensed compliance due diligence. And finally, we wanted to be able to generate SPDX out of whatever we found in source code that way. So what does LID do? It scans source code and generates, it identifies the matched licenses and the licensed regions within source code. Like I said, we use the standardized SPDX templates. We also support headers and exceptions as published on the SPDX.org website. And we also allow you to add your custom templates. If your organization had certain deviations and it happens all the time, you could add some custom templates so it knows to recognize for those patterns. Underlying, what is the secret sauce, right? So we use natural language processing's bag of words approach. So what we do is we pretty much break up your templates into unigrams, biograms, and trigrams. And that's our training set of terms to look for, right? So we then compute something called a jacquard index, which is math for how similar these two files are. So I have a source code file and I have my training set of templates. Jacquard index pretty much says if A and B are my two sets, it's A intersection B divided by A union B. So what's common in A and B divided by what's the entire set of words across A and B? So that gives you a sense of similarity and that generates what we call a score. Now of course, in addition to that, we use a weighted distribution because we consider biograms to have more weight than trigrams. So when you have a sentence like a rose is red, rose is red appearing together has more weight than just rose or is or red. So using this logic, we compute what we call a score and this is where we can configure a threshold to say, okay, if one is my perfect match, where in license text in the source code file is exactly the same as the template, then you can say, okay, for my application, I'll set it to 0.06 and we found anything about 0.06. It gets pretty accurate, but we have applications that go all the way down to 0.04, which again, gets you a few false positives but it catches more than less based on the needs of the application. In terms of detecting the license text region, that's part two of the algorithm. Part one is to detect what license it is, part two is where does it exist in source code. We use something called edit distance metrics which is also called a Levenstein distance metric, pretty much what it does is, what does it take to transform this string into this string? So you can read up more on how Levenstein distance works but that's how we figure out an optimal start and end position within a source code file. So here's an example output of how LID represents hits in a file. So green regions are pretty much where it sticks to the template and read our deviations from the template. In this case, if you see that the match says it's a GPL 2.0 based on a 1.0, this is again something that we built custom. This is not your standard template found on SPDX but it says hey, what lines it caught it in and the original score is the score that I was talking about which is the similarity measure and the region score is the result from your Levenstein distance. All right, time for some comparative analysis. So as I was talking about the real world evaluation data set containing mixed code from Qualcomm proprietary with some standard SPDX licenses. We try to see how these tools compare. The first criteria was coverage. So if you just looked at how many SPDX licenses does each of these tools detect, of course, latest built on the SPDX license template so it catches 100% of the licenses there. At the time of the evaluation, a normal Scott 70% and Monk caught about 29%. All right, let's go down to accuracy in terms of, hey, did it identify the right license and the region within source code? We found in this case, like I said, that for our real world evaluation data set, it does about the same or a little better than NOMOS. In this case, 94% accuracy is what we found. I have additional data about this in our backup slides if you guys are interested to look more into what the exact numbers were. Again, our data primarily contained your popular SPDX licenses. And in terms of the license text region, Lit is built to catch the entire license text. I have some examples in the next few slides. NOMOS, again, built to find snippets. So if you're considering using something like NOMOS, you might not be able to generate everything you need for your SPDX file like that. Monk, of course, is built to find full license text, but like I said, the coverage was pretty low when we evaluated it. All right. So the last criteria that we used was flexibility. Monk talked a little bit about the difficulty he had while using Phosology as an integration using Yocto for Drone Code. Anything you want to add, Monk, on Lit in terms of setup? Yeah, super easy. So I have a Docker image that's published that basically can set up all the dependencies for setting up Lit and running Lit. And then as far as integrating it, it was really simple just to call the Python wrapper around it to create it. And you get direct feedback because you get all the data back. With the Phosology part, it was like I was running it and I actually couldn't tell if I was connecting with the database or not. Okay, cool. Yeah, so I mean, a lot of the motivations again for Lit were based on feedback from Mark and some of the challenges he was having through the Phosology integration. So we did have the benefit of standing on the shoulders of the Phosology integration that way. In terms of adding new licenses, like I said, we do have a feature to automatically update licenses from your SPDX license list. Just add the templates that you need, add custom templates if you may so choose to, and you're off to the races. Again, with Nomos, a little bit difficult, you have to add a new regex rule recompiled. So you need to really understand how the nitty gritties of Nomos works to be able to add new license files. In the area of parameter tuning, I said that was one of our goals again. You can set thresholds for similarity scores depending upon the tolerance of your application. Nomos does not offer any kind of parameter tuning beyond altering your regex patterns as you need to, and Monk does allow you to configure how many words to skip before bailing out of the match, but not really that intuitive. Finally, scores, if that is something that is of interest to you, our tool does return how similar these are, what confidence level we feel in this. So you can of course convert that into, we're actually working to convert that into a rating scale so that your non-technical users can say, hey, on a scale of one to five, how much am I feeling this hit? Yeah, yeah, so I'll repeat the question for the purpose of the recording. So the question was, is there a way to set confidence levels on a per license family because MIT has a lot more flavors than your GPL family of licenses? Not at this time, right now. You set it on a per run basis, so you can almost say, for this run, for this project that I'm running this for, here's the confidence level. But that's interesting, I wanna explore that because we've found cases where the BSVs two and three, it'll trip it up a little bit because they're so similar. So I just had a bug right before I came to this conference that I have to start looking at. So maybe setting a confidence base in a family of licenses might be the way to go for cases like that. Thank you, thank you for that. All right, a quick example. So we took an example package and we used DOSsocks V2 to generate SPDX using its underlying default agent, which is Nomos at this point. It found five licenses as mentioned there and really that's the example SPDX output that DOSsocks two spits out. One thing that you see because it uses Nomos is it does not spit out your entire license text, that's the nature of the Nomos logic there. But if you see the similar package, the same package using lid, it spits out that output. The license text of course is complete. You can choose to use that directly and just have to do a product distribution that way. It did also catch additional licenses on top of what Nomos did in the same DOSsocks V2 integration. All right, so we are at the point where we wanna start talking about what the status is and what future work we see in this space. I'll just start with saying, hey, this is available as a project on our Colorado forum site through Qualcomm. But we are working to make it available through GitHub so that we can get a lot more collaboration going. That's coming in a few weeks from here. But if you wanna check it out, that's where it is. It's distributed under the BSD3 new clause. So yeah, please try it out and give us feedback and I'll let Mark talk about this. There is in fact a fork of it right now on GitHub. So if there was somebody who wanted to make a contribution or make a comment or file an issue, they can certainly do it there as well. Because one of the maintainers is the one who made the fork. So the initial source map layer stuff that I've been working on is on my GitHub account. So github.com slash msharlab. There's also Docker files there that will set up the entire environment for running lid and will pull lid from CAF so you can set up a simple environment to run it and it shows all the dependencies required. So the next things to do are the source map and lid integration. I didn't go too far down that road because I was just still getting input from Kate and others, we just presented this at the open source leadership summit last week. And so we got some good feedback there as well. I'm looking at re-architecting it and I've already started to put it on top of the archiver BB class so that I can take the stored code. Right now I'm creating a copy of the archiver BB class. The problem was, and I don't know how many Yachto people are here, I can't actually figure out what the license is of the archiver BB class because the license information for Yachto is very, very unclear. It says that the files, it gives a license about what metadata license has and it says other things. Those are things that are metadata have MIT license and other things have a GPL license but it doesn't clarify what metadata is. I think that they mean that the BB files are metadata and BB classes are metadata but it's very unclear, yes. Yeah, but the license needs to be clarified for people to actually stand on that because the way that it's written you'd have to interpret what metadata means. Okay, well, Python code in a recipe is not in most people's language definition of metadata. Where's that clarified? Okay, okay, great. Build artifacts need to be added so I'm trying to figure out how to hook that into the QA phase of the packaging QA. Do you have a question? Be here. Okay. Yeah, then the creating optimal code setting, being able to parallelize as much of this as possible. Lid does the parallelization but not distributed computing. It will do it on a machine so it uses the Python framework to actually create parallel analysis when you give it a directory or you give it a set of files, use parallel threads, but we need to do a way to spawn multiple lid instances across multiple machines to be able to crunch this in a much faster way. And then using the license information that's already there in Yocto but it's not really part of any of the license scanner right now. So the Yocto will generate you license information and it will put it in the deploy directories and you can have it all there but it's not integrated in any way into the license scanner output currently. Yeah, Behan, go ahead. Lib magic, no. So the question was about identifying files and MIME types and that kind of thing. I had listed some things. Behan had suggested looking at Lib magic. Yeah, we use that for some of our MIME type determination for the other scanning we do with code scan. So I like Lib magic quite a bit. It also lets you read the source code and almost the full file if you want to determine the MIME type beyond just extensions and things like that. That's great to hear. Yeah, yep. So I'm just gonna repeat what he said. So people listening into the video can get the response. So Lib magic also uses magic numbers in addition to just MIME types. I love Lib magic too. I mean, we use it to determine binary files and source code for some of the scanning that we do and I love it. So yeah, please try out Lib magic if you want. Yeah. Yeah, yeah, Python. We use the Python mining for that, yeah. So one of the open challenges, I'll come back to you. I just wanna make one quick point. One of the open challenges is about binary files and scanning and what to actually do with them. So if you identify a file as a binary, it's an executable. Do you scan anything? Do you scan the strings? Do you ignore the file? Do you say this is a binary file? I'm not scanning it. So that, and then when you have file identification, you say great, this is a file of type X, but what do I do with file type X? Do I ignore it? Do I scan it? Do I scan the strings? There's no policies around that right now. Yeah. Correct, right. Definitely helpful. Appreciate the input. You had another question. I do again. Yes. Correct. And so in order to, what I'm trying to say is if nothing has changed, reading the document, you've already need to reuse the scanning that should be stored in the shared state document that ties it to the rest of the system, including in the shared state mirrors where you can actually distribute the document and have done that to an inspired team or a... Correct. So Bhan's comment was that there's a shared state facility inside of Yachto today that allows the teams to collaborate and when some team member has done a build or a scan or something, if that's captured in shared state, then it doesn't have to be redone by other team members. Correct. And then there's ways to use the shared state. And I think the DSPDX and the archiver class both use shared state. Yeah. But unfortunately that doesn't scale beyond a team. And the problem is there are hundreds and hundreds and hundreds of companies using Yachto that are all doing the same thing over and over and over and over. All right. Oh, incredible. That's really our next thing on our status and future work deal. Yeah, so really figuring out how to get these comments to be leveraged beyond a team or a company that way. And then how do we handle and share manual review or changes to your automated license data? And you have 100 MBs worth of SPDX data. Do we have a tool? Could we build a tool to review this data? Yes. So my thoughts on this initially had been a design which would have been outside of the build process. And so it wouldn't have been dependent on finishing the scans in order to finish the build. We could do it in a way that or certainly create something that's more like what exists today that would be the scan would have to be completed in the context of a build. It just means your build might take six days to complete if you're okay with that. And then you could certainly put the SPDX information into the build part. You just have it be an artifact and have the packager include that into the... I mean, you don't necessarily even have to put it in the RPM or DEB or whatever. You could put it as another file in there that's in the deploy directory that would go with the DEB. But if you wanted to put it in the DEB itself, you'd almost have to do two phases because you're gonna have to put it back into the image directory that's being used to create the DEB file. So you can do what's done now, which is the top level, package level, source level information, and you can create the SPDX information, but you cannot get all of the build artifact information and license analysis of that, I don't think at the time of build unless you're doing it like in a two phase approach. Yeah. Yep. Kate, where are you? Yeah, because the SPD... Oh, repeat it, sorry, do you want... Yeah. Kate was saying that there should be a way to take the SPDX information that's generated and then be able to put it back into the package. Is that correct? Yeah. Okay, so Kate was mentioning that, yes, the SPDX file should be able to be incorporated at the package level when it's been generated, and that is interesting future work that we should try to collaborate on. Okay, finally, with LID, things that we think we can improve on. Right now, I'm handling multiple licenses, we can do it, but with large files, the performance gets pretty abysmal because what happens is that, so let's say you have a file that has an Apache and MIT license just for the sake of discussion, and so the way it will work is the first step, which is the detect which license part, it'll match both our Apache and MIT templates, and then step two is to identify the region. During the region identification, the way it works right now is, let's say Apache is on the top, MIT is at the bottom, it'll do the Apache region, and it'll almost pretend like that never existed in the file and restart the first step, and then this time it'll only get the MIT hit, and so it's really, we should be playing in the n-gram space instead of restarting this process this way, so that's certainly an improvement that we have on our roadmap to work on, and then in terms of accuracy, it really is not that great at detecting short licenses, so that's another thing that I'm interested in hearing if people have ideas there. Binary files, like Mark said, that's a challenge, ongoing challenge, to determine what do we do with these binary files? Do we just extract strings for matching? Let's talk about it. Finally, integration into other tools, so can DOS OX V2 offer this as an agent for generating SPDX? That's something that we certainly wanna explore. So one of the quick item, lid's limitation is we don't scan SPX identifiers today. Yes, from our talk with Katie yesterday, one of the immediate actions that I have is to add the SPDX identifiers into lid so that it can detect your one-line SPDX identifier that means this boar. So you're not really going boffin trying to understand what boar is. Everyone's defined what that is. So yeah, that's an immediate action for us. All right, that's pretty much our time, and I think we do have time for questions if you guys have any. All right, thank you very much. Thanks everybody for coming.