 Good afternoon. Thank you for joining me on such a late part of your day. My name is Mark Easy, and I'm with WinRiver Systems. And I actually manage all the third party open source compliance for WinRiver. I also help with the contribution program as well. Today, I'm going to talk about a particular topic that I think is very related to compliance, so license compliance has a very similar familiarity, but it has its differences as well. But before I get going here, I wasn't expecting such a large room. I was hoping to have a little bit more of a dialogue. However, let me start out and just get a sense. How many people here had to work with export clearance or classification, working with an export? Who had to work with an export team? At least half. Is there anybody in the room who's on the export office side who had to work with engineering? OK. We can pick on them then. So I needed to put this slide up there. It's a little extra kind of disclaimer because I'm covering a topic that our legal team doesn't really feel great about necessarily talking about, but I rest assured that I'm not going to touch on anything legal, and we're just going to focus on basic processes and tooling. I borrowed this slide from Jim Zemlin, which makes a basic point of this growing trend where the average company uses about 30% of their products are open source, but the ones who are really ahead in best of class are at least around 80%. I can say in our organization, we are very much in the 80% or more category. And it's not necessarily a threat to engineers saying, oh, there's not a lot of code being written. It's just saying that in order to do it right and then do it in a more modern way, you're borrowing and standing on the shoulders of giants who came before you and you're having a much bigger competitive advantage over those who don't. Just I don't know how many people here feel or do they feel like how many people here believe they're in this category? Or if you don't know, obviously you don't know. But anybody here feel like they're in that category in terms of amount of open source in their products? And this idea that commercial offerings have a lot of open source also carries over to open source projects. We often think of Busybox, for example, of having about 2,000 files under the GPL2. If you ever looked inside it and look at all the licenses that actually are in the files, you're going to see something like this, which really just simply says that Busybox came from borrowing from others, standing on the shoulders of other giants. And that's how the open source movement works. It's all a great thing. We just have to be mindful of the fact that that's how it works. Similarly, the Linux kernel, again, generally thought it is under the GPL2. But when you look inside the kernel, you're going to find a lot of other kinds of licenses. Obviously, the dominant license up there is GPL2. Depending on how you configure the kernel, we'll determine whether you have to pay attention to the other licenses or not. But again, it's a perfect example where sharing happens. So as far as I'm aware, there's not a lot of formal discussion around this idea of export compliance. But the fact is, the same issues that we deal with on a daily basis with licensing and trying to get all that right carries over to export compliance. And when I say export compliance, we're talking about software. When you deal with an export team, they're going to deal with a lot of other kinds of products, possibly hardware included. But software has a very specific requirement when it comes to getting export clearance. And it largely centers around the existence of cryptography. And there's this general workflow between the engineering group and the export group, just like you have in the licensing side between legal and engineering. The one thing I want to highlight is, and we're going to cover it a little bit later, is that people often want to talk about export and ECC classifications and all this stuff. But the best way for us as an engineering group to deal with helping out when working with the export team is to focus strictly on identifying cryptography and trying to avoid this notion of trying to make judgment calls about classifications, because you're going to find out every different country has a different set of rules. So we're going to go into that a little more. So it is actually quite similar to the license compliance world in the beginning, especially. There's a lot of, I wouldn't say friction between legal and engineering, but there were different mindsets. And when you're now dealing no longer with necessarily legal, but possibly an export trade group, there's still, even though there is a process, we're going to talk about a basic workflow for export clearance, there's still a lot of room that has to be done. And we have to work with the export team because they're just getting started working with engineering. Again, I have to say this over and over again, we have to avoid whenever we're dealing with export topics. And if we ever define something as such as export compliance, it's like we have to avoid giving legal advice. We have to avoid giving export classification advice. I don't know how many people here are familiar with SPDX. One of the driving forces, SPDX is a standard way of representing the licensing of a given package in a single file. Remember I put up the busy box and kernel tables of licenses? That's kind of like what SPDX is. It's a single place you can go and query and get all the licensing for all the files in a single place. One of the big deals there was when in the old days you'd pass on software in the supply chain, one group would actually, one organization would do that whole analysis and go through all the files. And then they'd pass it on to another company who would then go through the same work, doing the same analysis. So one of the driving forces behind the SPDX standard was come up with a record format similar to patient records. Imagine trying to share patient information between a doctor, hospital, insurance company. You had a standard record to make things a lot easier. So does SPDX. So there really is no standard way of sharing encryption evidence between organizations. And the other thing that's really interesting is export rules sometimes clash with copy left requirements. It doesn't mean they're contradictory. It just sometimes will change the classification. So you might go to your export team to try to get clearance. So you will fill out a form and then they'll come back and say, all right, you're only shipping binary based on that. Well, they give you a more friendlier export classification. However, if you have copy left code in there, that triggers your obligation to give out source code, which then often forces you to go back because they didn't realize you were giving out source code and it changes the classification. So moving forward, I just want to focus our discussion now on the basic workflow between export and engineering. But I want to highlight, again, that ultimately you may work for one company in one jurisdiction or one country. But every single country has a different way or a different set of rules on how to interpret how you handle encryption code. Again, in this discussion, I can't touch on anything that's legal related. And I have to advise everybody to contact their export trade group. And the other thing is that there is a common workflow I find talking to a number of different companies. Again, we're going to have a slide one second and I'll show you that flow. But what I describe as one of the most important things is this notion of a gap or chasm between export and engineering, a misunderstanding that creates a lot of problems. And we're going to talk about several ways of addressing that chasm. So let's talk about that chasm. Typical situation, engineering knows a lot about software encoding. May not know a lot about filings with governments and all the regulations. And likewise, you have export, which knows a lot about the filings and just ways of classifying stuff, but doesn't know much about software. Not even necessarily how to even identify different types of cryptography. But one thing there is this is common API that most companies have between engineering and export. And it's usually a form you fill out. It could be quite lengthy. You can answer a lot of questions about your software and, again, with regards to cryptography. And naturally, you're going to have a lot of open source in your products, and you have to answer questions about that open source. The challenge is in the old days, when someone wrote cryptography software and the export team asked them about it, they knew very well what it was and where it was. But today, because people are grabbing so much open source software, when they ask you to fill out that form, oftentimes it's almost impossible to answer the questions if you're dealing with hundreds of packages. So what happens is you end up with a classic situation of you're not quite filling out that form correctly, and you have garbage in and garbage out. And I actually talk to export teams, and I say, do you understand that this is code? First of all, they say, well, the developers are required to know what's in their code. It's a requirement. And I tried to explain to them how open source works and how they leverage a lot of it. And they still turn around and say, fine, but they're still obligated to know what's in the code. It's probably true, because there's no one else who could possibly know. So ultimately, we need to do something to solve this problem. So there's plan A. And it's similar to the back in the old days of licensing. You can have the developers rummage through thousands of files and try to create a table of all the licensing in those files. And then having developers review that code is actually quite unrealistic. And even if they had infinite amount of time and they could review all that code, there's a good chance they may not be able to identify cryptography in all its different ways. Again, remember five, 10 years ago, someone wrote cryptography, they knew it because they wrote it. But now there's so much more of it and so varied, it's hard for every engineer to know all the different types of protocols, algorithms, and so forth. So there's plan B. Naturally, we want to automate. We want to write tools. And that's happening. We're going to talk about that. But again, there's still a problem, because when you build these tools, they do a pretty good job at finding things, but they do an even better job at finding false positives. You're often dealing with a larger number of false positives than actual positives. And again, you're going to have to have at least somebody in your organization who understands this stuff. You have to go and reach out to them. And a lot of times, they're going to get swamped, because all the different teams also have the same need. And then there's plan C, it's the obvious solution, would be to automate as well. But this time, you have to designate certain people or train your engineers to be able to identify cryptography in the code all the different ways. And this is something at some level we've done with, again, licensing. But this one gets a little bit more technical, because there's a breadth of types of protocols and algorithms and so forth. But I think this is probably the most realistic situation, which no one's really thinking about right now, but we have to probably have part of that engineering training the ability to identify cryptography. Otherwise, you're going to have to designate a set of people who can review those results. But in the end, obviously, it's going to take resources and engineering time. So for now, I only see a plan C, but still, we're not there yet. We're going to talk about tooling today, and that's still in its infancy. But then, a missing piece in that process is that training. So next, I want to turn to talking about some tools that are available that would help you. Has anybody here ever used Fossology? So Fossology is a Linux Foundation open source tool. And CryptoDetector is actually a tool that we've been developing that we're going to open source. But when we talk about these tools, the things that they're looking for in the code is evidence, cryptography evidence. The types of things would be this. There would be algorithms, crypto libraries, protocols, and some other stuff, such as comments in the code. Comments are actually quite revealing, too. So with Fossology, it's actually a server-based solution. It was originally founded or developed initially by HP. It's currently being developed and the development is being led by Siemens. And again, it was initially developed to do license analysis, and they extended it recently to do what they call ECC analysis. But again, I think the term ECC analysis is very dangerous. I think they do a good job at detecting cryptography. Don't get me wrong. But a lot of legal groups, a lot of trade groups, get very nervous when you start talking terminology that they're expected to deliver as opposed to the engineering groups. But let me give you a quick demo and show you what Fossology is about. So basically, you can upload any package you want. So we'll come here and we'll select a package. We'll do SQLite. And what you're going to do here in Fossology is they allow you to select different kinds of analysis, engines, for example, ECC analysis. Another one would be Nomus. That's their main license detector. I'm going to choose both. And I'm going to go ahead and submit that. So it's going to upload to the server. It's pretty quick, actually. Once it's done, we'll go check the queue. Maybe the upload is not quick. Well, we'll have to wait a little. Actually, I probably can go to the queue anyway because I have other jobs there. Let's go there. OK. So here, I did it earlier. Here's an earlier version of it. OK. So it gives you lots of different kinds of analysis, as I mentioned. And the one thing we're going to look at today is the ECC analysis. And what it does is it lists out all the hits, potential evidence of things that might be considered to be cryptography. Actually, they have, again, a broader definition of ECC analysis. And I think they do extend their list to have things that would normally not include, such as, I guess, military. But this is actually one of the report you'll get. And what it's saying here is, these are all potential hits of terms they found in the code. And it allows you to look at them, and then you can delete them if you think they're false positives. So for example, click on this one. And it found a hit in this file called random.c. If you go down to random.c, you're going to see some text here, which it identified using a keyword matching approach. So you look at that code, and you determine whether that was a valid hit or not. And then if it wasn't a valid hit, such as a false positive, then you can delete it. So I believe these are sometimes false positives. They're just ECC terms. I don't know if I can even find them. OK, let me go back. And we can look at another example where it comes up empty, actually, a package bzip2. If I go over here and I look at the ECC analysis, you're going to see zero hits. So in that example, based on its keyword matching approach, it found no encryption. Is that a guarantee? No. But what I'm finding is you're getting pretty good success rates when it doesn't find things. The problem is it tends to, again, identify a lot of false positives. OK, so that's phosology. The next tool I want to talk about is a tool that we ended up developing internally. And how it came about might be similar to what happens in other organizations. So different teams all have the same need. And naturally, everybody likes to write tools. So we had several different initiatives happening around the company. And again, this is largely due to what I call the silo effect when you follow an agile process. What we decided to do was create internally kind of an open initiative. In other words, bring together all this code, bring people from different teams together, and have them work on a single tool. We used an internal GitHub repo. And we planned on open sourcing the tool. Now, the irony of this talk is that I was supposed to open source it at this talk. I was going to go into GitHub and make it public. But I didn't get my export clearance done yet. And the reason why is they got really nervous. What happens is actually it doesn't have that much encryption. It has some hashing of files for identification purposes. But when they looked at it and they understood what it was trying to do, they felt it needed further analysis. I gave it to them about two weeks ago. My hope is that we'll have it available in the next week. But I need to wait for that clearance before I can move forward. As I mentioned, we have it teed up on GitHub. And we're just all ready to go. We just need that final word. But I'm going to talk about it a little bit so people will understand exactly what it is. So it's written in Python. It's a Python script. This is largely done to make it easy to integrate into other applications, which I'll show you. Now, Phosology is actually very powerful. It's a server-based solution. One nice thing about that is you can just bring it up and upload all your stuff. But you don't have as much flexibility trying to integrate it with other solutions. It's not completely true. It does have an API and a command line interface. But the Python script approach, I think, is a lot easier. By the way, I'm not trying to pit one tool against the other. In fact, we've talked to Phosology. They might try to incorporate this tool as another agent in their analysis. What we did, though, is we chose to do two things. We're using keyword matching, which is the common way of doing this. But we also decide to also implement API matching, which you can imagine is simply identifying a whole collection of cryptography libraries and being able to match for those as well. The output file is where you get a lot of the flexibility. It's put into a JSON format. The idea was to create a really simple, well, simply defined structure so that other tools can suck it in into their build system or wherever you need it. I'm actually not used to the big stage. I'm used to looking up there, but I do have a teleprompter. I'm just not used to it. Let me give you a quick demo. So I'm going to go back. And what I'm doing is I'm running it on this one package called Curl LPP. It's going to take about 50 seconds to get through. Actually, let me go ahead and pull up an example while we're doing that. So while it's doing that, here's an example of an output file in JSON. And what it's trying to do is simply for every time it finds a hit, it's identifying the line, the string. But more importantly, the method used, in this case keyword and also the type of match. And what we're finding is a hierarchy of different kinds of encryption based on the type and the algorithm. And so we think it's important for working with the export team to get that identification down to that granularity of information. Let me pull up the list. And I'll show you the keywords that we use. Hold on a second here. So what you're going to see here is for the different kinds of matches you can make in the regular expression you can have. And we do this on both the keyword matching as well as the API. You can actually specify at the language level the type of keywords. You can easily extend this file as well. But you'll notice that you have the path of the different granularities of the level of the classification of the type of hits that we're getting. And so our hope is by open sourcing this tool, obviously we'll get it out there and it will be very useful to others. But we're hoping for others to participate and contribute and help make the keywords and the classification levels more extensive. So I'll go back to the tool. OK, hold on. So here's another example where you can see it here. So for each hit, we have the line of text that it matched, the few lines before and the few lines after, as well as specific information where the hit took place, as well as the type of hit and the method used. So back here. What's really worth highlighting is how we're integrating this and how we're using it to analyze different large sets of packages. What you have here is a spreadsheet which tries to do an analysis on about almost 1,200 packages. And what you have at the top here is the total number of hits, the hits that were found only in the source code. This could have included documentation as well. And we're trying to figure out if we just focused on the source, do we get a big reduction in hits? We're trying to reduce the amount of false positives. But what's really interesting in this table is you can see that there's a fair number of packages in Lenox that do not come up with any hits at all, right? Until about this is, again, about 1,200 until about here. What this means is, if you had to do a large set of packages and you were able to run the tool and then load that data into a spreadsheet, you can easily streamline the process by quickly writing off the ones with zero hits. Now, of course, there's no guarantee. But we have found that a large percentage of times when you have encryption, at least one hit, it does occur. The real challenge still exists where you still have out of 1,200 packages, you still have about 700 plus, where you still have a lot of hits, or at some hits, at least one hit. And someone has to go through and determine whether that's false positive or not. Now, if I go all the way to the end of the list, not surprising, you're going to find certain things that you expect with a lot of hits, such as OpenSSL. The other thing that we're doing is WinRiver delivers Lennox a commercial grade Yachto to its customers. And one of the things they're asking us as part of that delivery is to help them just get a high level view. So from that same file output, we're able to create very simple executive summary reports as well so someone doesn't have to go. And so if you're going to deal with your export team, you can just give them a quick summary of the packages and whether a particular algorithm or protocol was found. And then finally, again, using the same output file that the tool gives, we're able to generate another drill down kind of report. And I think this is where you'll understand the benefit of the format. Here again, it gives you this high level executive summary of the kinds of things they've found or percentage. But then you can easily write tooling on top of that to hold on a second. So what this is, is the report. And if I expand this, again, you see the different algorithms. And then you can click here. And again, this is all based on the output file. I can drill down and it'll show me this particular line. It found potential encryption. This is very useful when you have someone who has to go through, even if they're the expert on cryptography, to save a lot of time by doing the drill down. Again, you're going to see this. And again, it gives you on a file by file basis. And the goal is actually the next step is to allow someone to go through all those packages in Linux, just keep all the positive hits, store that in a database. So the next time we get a later version of the package, we don't have to go and go through this whole effort again. Of course, you always run that risk that a new version might introduce in cryptography. You could always do a diff between the two and look at the diff only. So our goal is to create that opportunity where we can give good, clean reports on the level of encryption on each package that comes with Linux. Yes. What level of detail? The export team? Yeah. I actually, I don't know, I'm not an expert guy. No, in all fairness, I try not to. What they typically have is this. They get this high level report, and they should be able to see things that are concerning, certain kinds of protocols or algorithms. And this keyword list is actually being developed in conjunction, working with them. What I have found, though, interesting enough, when I talk to some export people, they really don't have a strong grasp of exactly what the different cryptography algorithms were. And if they were looking at the source code, I don't think they would be able to identify it. Yet they're expecting you to do it. I think there's certain things like OpenSSL they get and it triggers certain things. I would say that this is still kind of early, this approach. And over time, through going back, you know, more of an exchange, we'll probably fine tune the list and we'll probably educate them. And they'll educate us about what level of information they need. I would describe this as a first attempt, that higher level executive summary thing. The other thing they do is they will take that high level summary and they'll actually file that along with their other application for export license and let the government worry about what it means. Again, yeah. So, we're gonna end a little early here. But in summary, export compliance is still relatively new. No one's really, I don't see anybody talking about it as a more formal process, okay? I'm hoping this kind of talk will begin that initiative. I'm actually part of what's called the open chain working group. Would open chain, how many people have heard of open chain? Okay, open chain is basically a way of having a set of requirements. It's like ISO 9000, it says, all right, our company follows a certain set of requirements that ensure a certain level of quality in manufacturing, for example. And they even apply ISO 9001 to software, which means you have certain controls in your software life cycle process and certain guarantees that you're gonna be disciplined. Well, open chain is also there to help say, well, here's a set of requirements for open source compliance, license compliance. If you wanna be open chain conforming, you have to verify that you've met all those requirements. One of the things we're talking about is introducing the idea of having another program in there would be export compliance. As I mentioned, it's really important for us to be careful and have that discussion focused on cryptography. If we move away and start talking about the lingo that's used by the different governments, we're gonna probably end up in a bad situation. Actually, what I'd like to see happen is if we can all kind of agree on it, a certain kind of format, and we deliver what we would call encryption evidence in a way that allows everybody to apply their own rules to it. We have to start that process also to work with the export trade groups to help them get better at process. I'm finding, actually, I've seen one disclosure form about 24 pages long. That's just something that has to go away. I'm gonna admit that the tools are relatively new, but the good news is this is an opportunity for us to start developing new tools. The tool that we want to open source, the crypto detector, actually allows you to add new methods in as well. So right now, I told you it supports two. It supports the keyword matching and API matching, but if someone would come along and do a binary matching method, it's been designed to allow anybody to come along and just come up with a new method. Obviously, also take existing methods and improve it. And that's pretty much what I have to offer you today about expert compliance. I don't know if you guys have any particular questions. Yes. The answer is we haven't yet, but I think it's a fantastic suggestion. Quite frankly, our first step was we wanted to get two methods, two decent methods out, and that would work. We wanted more than one because we wanted to prove this idea that you got multiple methods. But this is exactly what we would like to see happen. I think the real value here is when we open source this thing. There's so many more eyeballs that's gonna just become a much better tool. And I think that would be a perfect example where some of them might come along, make that proposal. Yeah. So I did quickly mention it, but I didn't do it at any length. There's no public database that I'm aware of. What we have done is we have an internal database. What we do is we keep, we call the tribal knowledge database. And what we do is anytime we learn something about an open source package, whether it's related to vulnerability, license issues or whatever, or in this case, a list of positive hits, we store it in the database. A database is probably the next logical step to go. I think it's critical because the problem with tools is you scan it and then the next time you get another version of the software you're gonna do the whole routine again, right? And I think to your point, if we don't start storing this information in a smart way, we're gonna be doing a lot of rework. So one of the things we'd like to take as a next step is to take that database. And again, I mentioned if you have a newer version, just do a diff on the older versions you scanned, understand the difference and then just do analysis on those and give that to a human. So this is really a combination of automation and human analysis. They have to both be brought together to solve this and then start in the database. I think that's the right direction to end up. We're not there yet. Yes. Okay, so you're saying, for example, that if you configured it a certain way you may not bring in certain files, right? No, that's a perfect example. So I quickly mentioned the SPDX a little earlier and what that does is it stores all the files and all their licensing information and what we wanna do next is store the cryptography information for every file. It's like, you know, you go in there and there's been some work that's been done with licensing information. Similarly, where you query the SPDX file but your spot on a very useful step would be once we store this in a common format like we get into the SPDX standard, then once someone hands you that data then yes, you can build it and then you can get probably, you won't be restricted as much by the regulations. So that's really where we wanna go. But we're not there, but I believe two things. Having this first pass of having a tool that generates data in a standard JSON record that allows you to store it is the next logical step to getting us there. Mm-hmm. Thanks, we got the whole process. Yes. Yeah, yeah, no. You're thinking exactly in the right way. And again, people, the most important thing to take away from here is that one, there is a real problem. It's an okay problem, right? It's getting bigger because of the amount of open source everybody's using. But if we apply what we already know from past experiences and we start building the tooling, we're gonna be able to deal with it in a very powerful way in the ways you're recommending. But these are the steps, this is the beginning. I'm hoping this talk is like the beginning of a more formal way of thinking about this problem. Any other questions? Okay, thank you very much.