 Hi everyone. Yes, my name is Jeff Mendoza. I'll introduce myself a little bit. I work at Microsoft on open source compliance tooling, the kind of things that make sure that all of our open source we are legally compliant and security compliant. Also, part of my day job is to work on clearly defined project, and that's what I'm here to talk to you about today. So I'm going to discover your dependency license information and how clearly defined might be able to help you out with compliance. So first of all, what is open source? What is the license? Of course, the license is what makes open source open source. If you can see the code, but you can't modify it, you can't play with it, you can't redistribute it, it's not open source. Therefore, if we all really appreciate the freedoms that the licenses give us, we should also appreciate and show our appreciation by obliging to the requirements that licenses require. Some examples of requirements that licenses have are attribution, which means just say I got this open source from so-and-so, retaining the copyright statement of the open source that you're reusing, potentially giving an offer for the source if you're distributing the binaries, and then maybe if you're making changes to the software maintaining a change log or marking the changes that you made. Depending on the obligations of the licenses, the particular license that your dependencies have, you may make choices on which dependencies you decide to take on in your project. It may be choices that affect you, it may be choices that affect users of your code, your library, your tools, and then you'll have to make compliance actions based on those licenses and those dependencies that you choose. So two steps to figuring out what you're going to do. First is knowing your dependencies, which is sometimes depending on the language a challenge, and then knowing the license of your dependencies. This is where it clearly defined will help you out, but I'll dive into that a little bit later. First, I want to cover, you may have a question, do licenses affect me? Let's say you're a library crate producer. You just have a library, a tool, some useful methods, you put it up on crates.io, you don't do anything other than that. Well, you're not actually distributing your dependencies, you're not even using them locally, but the dependencies that you choose have an effect on the people that consume your crate. So let's say you want everybody to use your crate, but you have a dependency on something that maybe people don't want to use. So that might be a reason why you want to know the dependencies and the licenses of your dependencies. So if you produce tools, again, you could just put your crate up on crates.io and have people download it, but you might also want to distribute a binary of your tool. In that case, you're distributing also all of your dependencies and then all of the clauses for things that are required when you distribute the code, take effect and you have to comply with all those requirements. Maybe you have a web app that you write and rust and you want to have people use, you put it up crates.io or maybe you run it publicly. Well, now you have to comply with any network coffee left obligations that those dependencies have. Or let's say you distribute a Docker image of your web app so that people can download it and run it very quick and easy. Well, now you're distributing all the dependencies that you have as well. Okay, so jumping back to clearly defined, what is clearly defined? It's a project under the OSI and it aims to be a central pool of knowledge of license information about open source software. License information usually accounts to about three things. The actual license and there's a format that's or a specification called SPDX that we use to be so that when you say MIT, you mean this exact version of MIT, not this other variation. So that's the first is the actual name of the license. The second thing is the attribution. So or the name of the person who owns the copyright person or entity. And the third thing is the source location. So if you get some package managers, you can download the package, but there's no pointer to the actual source location like which GitHub repository this package came from. So clearly defined is a project to get all of that information one central location. And one of the things it does is when it downloads a source package, instead of just looking in the Rust case, instead of looking at the declared license in the create.tomal, it actually runs scanners on everything inside all the source code inside of the package because people don't always declare all of the licenses that are actually in the source code. So it runs scanners on all of these packages and then it shows you here's the declared licenses and here are the discovered licenses. And if those don't match up, then it actually gives it a lower score saying that there's actually licenses discovered that were not declared. And then another central point of the project is that running the scanning is actually pretty expensive and there's no reason why anybody in the world has to run scan code, which is one of our scanners, on a particular package more than once because that package with that hash is never changing. So once that scan is run, it's better to just store the results in a public location rather than everybody having to try to run these scanners all the time. And then the next major thing that comes in is other than just storing the results of scanners is of course, scanners are not perfect. They're programs and you can make mistakes. It can't cover every single case where every single license can be discovered. So there's a community-based curations process in the clearly defined knowledge pool where if a scanner says, hey, I couldn't determine the license of this package, but somebody else can read it and say, hey, it's in a different wording or for some reason the scanner didn't pick it up. A person can, usually a lawyer, can say, hey, the license of this package is actually MIT and then it can be peer reviewed by other people in the community through the typical GitHub PR process and everybody can kind of agree what the whole community can agree with the license of a particular package is. And the end goal of the project is if you do have problems with packages where declared versus discovered is wrong or curations are needed, the owners of this package can actually go and fix their own package and so that maybe in the future, this project won't be necessary because the declared license on all packages will just be very clearly defined. So great, how do I use clearly defined? Well, it's got a website. I was gonna show it to you, whoops, right here. So you can go browse and of course clearly defined supports many different types. So here I'm looking at a few crates and you can just go see here some examples of the licenses that are found for these particular crates. But I know you're thinking, hey, I'm not gonna go to a website and look up all these licenses. There must be a simpler way to use this. So I actually wrote some tooling to hit the API and based on a create your Rust package to show you how easy it is to detect the dependencies you have and then query the licenses for those dependencies. So I'm gonna cover that right now. So the tooling's up on GitHub here and the create is here. So quickly what it is, is it looks at your cargo lock and then it has a few different command line tools that takes the output of your cargo lock and first tool converts that output into a format that clearly defined can understand. And then the second tool, two tools, one will take that format and then query clearly defined for just the license information output in a CSV, something you might wanna store as a artifact with your build. And then the other tool CD to notice will take the output of that and create a notice file for you. And a notice file is something that will help you with when you're redistributing your dependencies. So the very most common requirement of obligation for licenses is that when you redistribute it, you provide the notice and this completely automates that process for you. And I'm gonna go ahead and show you an example of how it works. Sorry. Yeah, I'll show you the code here first. So first thing it does is, like I was saying, it gives you the output of all my dependencies in a format that clearly defined understands. And then here's the one that would go ahead and query clearly defined for each one of those packages and versions and gets you the license. So this is something you could run. The other one thing I wanna show you is I have a, in this, in my tool, I hooked it up to a CircleCI. And then whenever I tag a, make a git tag, I have a workflow that's gonna create a git release and attach the binaries of my tool. And of course, since I'm attaching the binaries of my tool to a git release, I'm making a distribution of all the my dependencies. So I wanna generate a notice file and put that inside of my distribution. So part of the build process is the build of the tools. Part of it is the notice generation and those get merged into the publish here. Right, so here's on the notice generation, doing cargo install of my tool and then running the notice generation. And then on the publish, I'm just tarballing it and attaching it to my git hub release. So here's the actual tarball that I'm attaching to my git hub release. It's the binaries of my tool and then the notice file. So I look at the notice file. I have the, I'm correctly fulfilling all the requirements of my dependencies, dependency license obligations. So for example, Cloud ABI, I have my copyright and all of this information comes directly from clearly defined. So I know it's been peer reviewed and it's been detect, all the licenses have been detected. Here's the crate, okay, okay. And this tooling, actually I wanna show you the source real quick. It's very simple. For example, I'll show you the, yeah. So it's only a few lines of code. Again, it's something that you don't really need these tools to do this if you didn't want to. So if you have, if you're in an organization which already has a CI CD system or a build process, hitting the API and querying for things like notice or things like what is the license is something that is extremely easy to do. The tool is just an example on how you can do that and how you can integrate it into, you know, in this example, CircleCI. The REST API is linked here. So we have Swagger on, the main thing that you'd wanna be doing is querying the definition and the type provider namespace name revision. All the type provider namespace would be the same for all crates and then the name revision would just be the name revision for your particular crate. That's very simple. And then again, the notice file here will generate the notice for you automatically for if you give it the list of your dependencies. So maybe you're thinking, this is cool. How could I help? I think the first, if you take anything away from this talk, it's not go use clearly defined, it's that you should respect the licenses of your dependencies and by respect to them, you should comply with the obligations that they impose. And then secondly, if you are interested in clearly defined, take a look at the tool, take a look at the licenses of the crates that you know or that you use. If you see that an error on the website curate your dependencies. So as I was mentioning, if there's an error, anybody can go submit a curation for it to be peer reviewed. So for example, I was looking at this package and it's discovered a no assertion. So no assertion means that the tool found legal text, but it didn't actually know what the license was. And if you go look at the files, oh, I had found it before, but it's not here. I think it's, oh, here it is. You can go and click on the edit and type in what you think it is and click submit and it'll open a GitHub PR to our curation repository with your curation. So the other thing is, if you see that a package is not being detected correctly, we're actually clearly defined as just running these underlying scanners. One is scan code, one's phosology, licensee. So those scanners are not perfect and they're really great open source projects that could use contributions to help detect licenses. That would be a cool thing to do. Compute power, so we actually, like I was mentioning, the scanning process is expensive. So there's a whole big queue of requests for things to be harvested and then there are machines that go and pull things off the queue and harvest stuff and submit the results back. Right now we have compute resources donated by Microsoft, Google and Amazon. And so any compute power donated to the project would be very much appreciated. And then the last one, which I'll go into a little bit more detail, is make your own crates clearly defined. So how do I even know if your crate is clearly defined? So first thing to do is if you go to the website and you don't see your crate here, that means it hasn't been harvested yet. And there's a web page, there's also an API to queue to harvest, so you just pick the, you pick crate and then you type in the name and that'll be queued to harvest. It might take a few hours to be harvested because it's a backlog of tools being run. And then check the results of your harvest. Are there any no assertions? Are there any licenses detected that you didn't know you had? Are there any licenses detected that you think are wrong? And then submit the curations like I was showing you before. And then the main thing would be to go and actually fix your package, say if there's a problem with the scanner and it's because your license text has a typo, go fix your typo. Yeah, and that's all I had. I'd open up to questions now. Yes, I'm back. Question means I'll provide you by email. So I'm gonna repeat the question. The question is should the scanning or should the double check of is the declared license correct to be a part of the package manager? I would say that would be, I'm happy with cargo as it is, but that would be great, but there are a lot of other languages that are in a way worse space than this right now. There are languages that, for example, when you have multiple licenses, it's very important whether or not it's an and or an or. But many package managers have, the licenses is just a string or an array and there's no way to detect that. Many package managers don't have a field for source location and that's something that has to be detected. So there was actually a talk yesterday here in the dependency management which was package managers, go fix your stuff. And it had a lot of that kind of direction. So I don't know, I mean that's a good question and a good thought, but I think in the first part we just need to have the right metadata as maybe a requirement and have the right language as far as ands and ors and things like that around license type. Over here. Yeah, I have a question. When I'm a user of such an open source controller. Yes. Can I trigger and harvest? Yes. And if there, I find something that's not correct, what would I do? Would I just send something to a player you find or would I contact the original author? Yeah, so the question was, if I'm a user, if not just if I'm an owner, can I trigger harvests? And yes, absolutely, you can put, anybody can come over here and put whatever they want or use the API to trigger harvests on whatever package that they want. And then the second question is, what do I do when I find an error? And I would say do both. The first question was, should I make a curation like what we were doing here which would change the clearly defined knowledge base on what's the information about this package and definitely do that. And the second question is, should I go and open an issue or try to fix the bug in the upstream package? And yes, absolutely, and that's the goal is that packages don't need the scanning and don't need clearly defined in the future. The question in the back. Yeah, so the question was, how does curation work? I don't have it here. So what happens is when you submit a curation, it opens a PR into this repo, curated data. And once the PR is merged, then clearly defined will go and merge that into the central database. And if you see, oh, this might be, we have some bots that do curations too. Yeah, but anyways, you see there's curations on lots of packages that come through here. A lot of our lawyers curate packages because we find issues and we don't see the correct information there. But there's a curation community. We have a, so a little bit more about the project. If you go to docs and then get involved, you can see we have a discord and a lot of discussion happens there between lawyers about, hey, this legal situation is unclear, how should we curate this? And we see a lot of community discussion and it's really great, really cool. Oh, yeah, one more question, two. No, I'm sorry, it was just the idea of having this funny conversation with a lawyer. Yeah, I really like a lot of our compliance lawyers. Yeah? Yeah, I just had a question about the scan. It supports a package that has multiple licenses, which I guess would be the case if you had all of your dependencies vendorized. People use multiple licenses for different reasons. One is they put a sub-component, they just check it in, if you have a vendor and they would have this piece is under this license and this piece is under that license and the license would be in and. Other people put multiple licenses just because they feel it like it, it usually is an or. They say I would like you to be able to use my package under this or that, it's up to your choice. The license conflict, do you find cases where one license actually has something in it that can contradict something that's in another one? And if so, how? So, I'm gonna repeat the question. The question is, what do we do about license conflicts? Clearly defined, the project doesn't really give you legal advice, it's mostly just trying to tell you what the licenses of the package is. And then in that case, we have to know, well, GPL can't be with these other licenses that have different requirements because you can't add requirements to things that are linked with GPL, for example. Yeah, that's not usually part of clearly defined, but if you have, again, if you have a system that's doing the detection and needs the license information, you can get the license information clearly defined and then you can write the rules based on the policies of your organization and your legal guidance. Yes? I'm doing some packaging for the end of last week. Cool. And yeah, as you might assume, most of the work is researching license information. Also, this project is clearly benefit for us. Is there any kind of, track kind of reasoning if the file is curated through the clearly defined system where I can point to tell somebody who is interested in, is this really this license for this file? How did you evaluate it? Okay, yeah, so the question is about the transparency of the curation process. So all the curations are PRs and we do have the full get repository history of all curations that have happened on that package. And again, if you go to the site for a particular package, down here what you can see the raw data and of course it's available a bit API. The definition is the merge of the harvested data and the curations. You can also directly query the curations and the harvested data that is the output of the tools and see where you're getting what from. All right, and that's time's up and thanks all again, I really appreciate coming here.