 All right. Thanks everyone for coming to my talk. I'm Asankya Sharma. I'm the head of research and development at Source Clear. Source Clear is a software security startup that is focused on helping enterprises manage open source safely. That's the topic of my talk today. To give you a bit of a background about myself, if I had to summarize last 10 years of my professional career, what I've been doing in a single sentence, so then it would be I've been focused on building security tools for software developers. I have done it at large companies like Microsoft, where we built CAD.NET or the code analysis tool for.NET. At Source Clear, where we built the Lightman scanner. I've spent some time in academia during my PhD and soon after working at National University of Singapore. I also have a couple of open source security tool project that are of interest to developers. Most recently, I've been involved in the design and implementation of security graph language, which spans all three of these areas. So let's try and set some context about why the use of open source and third party dependencies is a big risk to enterprise. So there's some data which is picked up from various central repositories like Mave Central, NPMJs, RubyGems, etc. It's tried to show over the last five or six years what has been the growth of the libraries in these repositories. So on this side, you see that as of 2017, there are over 1 million different libraries. If we take this and we extrapolate it over the next five years, what we say is by 2026, they're likely they're going to be like 400 million plus different libraries and components. There will be more libraries than there are people on earth. So obviously it's little tongue-in-cheek, I mean we just extrapolated the growth, but if you think about different ecosystems like NPM, there are libraries for things like is positive, is negative, left pad, and so on. So, but the idea is that there are a large number of components that developers can choose in order to build or assemble their applications. If you look at each individual application as well, there is a lot of complexity in it. So this is data which is aggregated over scans then by source clear customers. Typically on average, if you look at a Java application today, when you add one library in your application directly, you pull in four other libraries indirectly on average. Some languages are different than others. For example, in a Node.js project, the problem is worse if you add one library directly, you end up pulling nine or 10 different libraries transitively. I don't know what's happening there with Python that seems to be like a different outlier in our group, but this has been consistent across different languages that we have seen our customers use. To give you an example of a particular project, and this is like a fun burst of the dependency graph for Apache Spark, which is a really popular open source project. To look at two things here, one, so the dial which is inside here shows all the direct libraries and then their dependencies and their dependencies and their dependencies and so on. If you look at one particular library here, which is this Spark SQL snapshot 2.4.0, that library in turn ends up pulling four or five other libraries. That's one. Then if you notice that this particular library AOP Alliance is pulled in through two different ways. So it's pulled in through this library and it is pulled in through another dependency in the project. So the problem is complex because each dependency pulls in multiple other dependencies and they may pull in these through different parts in the dependency tree. So if you build a new web app today, what ends up happening is you build it something like this. You don't write a lot of code yourself. You start off with something which is like your operating system or the framework or the database, which is actually most of the time open source, or the web server, Postgres, Apache, etc. Then you write a little bit of custom code or your business logic, and then you'll use a large number of libraries that are available to you. Most of them are open source or freely available and third party. So a typical application today can contain as much as like 90 percent of the code which is open source and third party, that is the code that you did not write yourself. So this is a really interesting visualization. This is a single graph representing the entire Node.js ecosystem. So all libraries and all their dependencies that are there are represented in this. This is a good visualization I picked up from the site. You can go ahead and play with it. But the basic point here is control over what goes into your code has shifted from you to either developer tools or package managers or build systems to open source code and third party developers. So all these libraries which are available are not written by you. They may have different maintainability, quality, et cetera, characteristics, and they represent the entire attack surface or over 90 percent of the attack surface of your application. So let's take a look at what are some of the common threats that can come in through use of these third party components. So one of the most common ones is that these components or libraries may have vulnerabilities in them. So these are like known vulnerabilities in popular libraries. Just because you're using a version of the library that is vulnerable, you will be exposed to that vulnerability. Other threat is it is very easy to publish a library. So there's no checks or balances. You can just write a library, get NPN published and directly it's available for everybody to use. So there are cases where people would try and upload malicious libraries. So libraries which are like malware and actively do something bad to your system. The third thing is typosquatting package names. So this is very similar to how people typosquat domain names. So if you buy a domain name which sounds similar to a popular website, by mistake you might end up on that domain and then they try to sell you ads. So the similar thing applied to the package managers. So you have a popular library called React.js. So you develop your own react with a WE, and then you register on NPM and then by mistake if somebody types it, they end up on your library and through that, you can actually attack the build system. Or sometimes like what happened in the case of Left Pad the developer actually unpublishes the library and the namespace is again available and somebody went ahead and registered their own library in the namespace. Then whenever you built your project, you ended up pulling the other library. Each package manager or build system actually allows some kind of command execution. So you can do things like data expiltration, like you can extract out environment variables which might contain keys like AWS keys, etc. because this is all running on developers' machines. You can also do command executions during build. So these are all potential threats or attacks that are possible through use of third-party libraries. So next few slides, just to try and convince you that this is not theoretical, this is all practical, it has happened in the last two years. I'll just show you a few examples. So everybody remembers Equifax. So this was the single largest data breach in history. The root cause effectively was that they were using an old version of Apache struts, which is a Java web application framework, and they either could not find or patch that thing in time. So if you think about as a large enterprise like Equifax, they don't even know what code is running in which application and which servers. It's so complex. So either they actively did not care or they did not find it at the time and they looked at it, and they didn't patch it, and it was there unpatched and somebody broke through it, and that led to a data breach. So here's another example from Python world. So this is on PyPy. Last year, somebody managed to publish 10 malicious libraries on the Python package index. So these are like actual malware, because like I mentioned, there is no checks and balances, nobody reviews these libraries when you publish them. So they managed to publish things which were actual malware. The common attack vector, like I mentioned is typosquirting. So this is another example from the NPM world where look-alike NPM packages, which were similar to popular NPM libraries were published through a typosquirting attack. This is the example which I was referring to. One of the popular things for these malware or these malicious libraries to do is to do data exfiltration. So this was a package which was cross N where it was stealing the environment variables on install. So why is it useful? Because if you think about how you would deploy your application and product in environment variables, and when you build it, you might actually expose your API key or your AWS secrets, et cetera. So this was a real example from last year where a package was trying to exfiltrate environment variables. And all of this actually breaks or shakes the trust people place in this whole open source model. So you have all these libraries, you democratize the development of components, it's all nice and easy, you can publish your code, anybody can publish things, but it really, what it also means, and that represents a big risk for enterprises when they build applications today. So how can we solve this problem? Or what do we need to do in order to be aware of it? So when we started doing this four years ago, there was no term for it, but now industry is sort of coming to a conclusion of what to call it. So what we're calling it now is like software composition analysis or SCA. This is to figure out or discover vulnerabilities and licensing information for open source components. So in an enterprise context, I didn't really talk about licenses earlier, but license information is also important. So they want to know whether they are using a GPL library or not, for example. And the way to do it is if you're familiar with other kinds of application security testing is you typically need some kind of a scanner and you need some data about the libraries so that you know about the known vulnerabilities. So let me describe very briefly how this could work or what kind of scanning technology you need. So first of all, imagine you have a dependency lock file. So this is like shrinkwrap package.lock, gem lock file, et cetera, where the dependencies are all explicit. So your application top depends on some library with a particular version, that depends on another library and that's it. So you can parse this file and you can build the dependency tree and then you know all the direct and transitive dependencies in the project, right? Or if you're using a build system where the lock file is not supported, like you're using Maven, Gradle, you don't have a build system go, in that case, you first have to resolve the dependencies yourself. So that would typically involve doing a build because that's the only way to really know what dependencies gets pulled in. So you would, the scanner would do the build or generate the dependency tree, resolve the dependencies and then in turn build the dependency graph, right? So all of you must have heard about GitHub. They recently launched this GitHub security alerts feature for JavaScript and Ruby projects, right? So they do something similar to the first case here where they parse the file and then they just tell you the dependencies and show you the details about it. But it's likely to miss certain things and that's the reason why they can't support certain languages where you don't have a notion of lock files where the dependencies are explicit. You could do the same thing if you're wondering on infrastructure. So this is going one level below on the stack. So you're talking about vulnerabilities in runtimes or your applications that you might use like containers, et cetera. So you can do the same thing on containers, on clusters of containers and so on. In fact, there's a talk by Chris from Red Hat later today which talks about how to do security in a container environment, right? But the principles are the same. You have some sort of scanner which detects these packages which are say installed through app get or yum and so on and then you can build the whole dependency tree, right? So that solves the problem of figuring out what libraries or what dependencies are there in your application. So the next question is how do you know whether they are vulnerable or not, right? For that, you need to have access to data. So there are like publicly available sources of vulnerability and vulnerability-related data. So the most popular one is NVD which is the National Vulnerability Database which publishes, you know, CVEs. A lot of big projects will publish their own advisories. So for example, if you go to Spring, they have their own page where they list all security issues and they publish them as they come along and they fix it. You can join some mailing lists like full disclosure, bug tracks, et cetera where they publish, you know, information about disclosures and vulnerabilities, right? However, what we realize, this represents a very small percentage of known issues that are there in these components. If you think about certain ecosystems like Golang or NPM, rarely ever somebody takes the effort to go ahead, register a CVE, publish the information about, and so on. They just patch it and then they move on, right? So a lot of this data resides in these software engineering artifacts like commit logs where somebody might just commit saying, hey, fixing xxs and then move on. Or in bug reports, somebody goes ahead because these are all open source projects. Many of them have like public data boards or bugzilla boards and they just publish certain things saying, hey, you know what, this looks like a potential security issue. We should fix it. Change logs, pull requests, et cetera. So the key thing with the data is that security issues are often not reported or publicly mentioned. So if you're only tracking NVD or CVEs, you're likely to miss a lot of these issues about libraries, right? So give an example of how severe this problem is. So this is data we collected over the last two years where we looked at a CVE for a particular library and then we looked at the first reference related to that vulnerability in one of these places. So either the vulnerability was mentioned in some way in a fixed commit or in an issue or a pull request or a bugzilla entry or a JIRA ticket, right? And this is the average number of days the vulnerability remained unidentified. So it was already fixed in a commit like for 60 days ago and after 60 days, you actually found public mention of this in a CVE, in a GitHub issue, pull requests and so on and so forth, right? So this tells us that if there was a way for us to actually mine all the data about these projects, then we would be able to discover these issues well in advance before they actually are publicly mentioned or become CVEs or are published by the project themselves. So that's what we actually did. So this is the paper we presented last year at FSC which uses NLP and machine learning to identify these security issues from commit messages and bug reports. So I would encourage you to go ahead and take a look at it but it's a standard classification problem. You look at a commit message or a bug report and then you try to use NLP to understand it and then you try to classify whether it's related to a vulnerability or security issue or not. And that's what we did here. So if you're thinking like, hey, this all sounds good but if I want to do it for my own project do I have to build all of this? Of course not. So there are like several players in this space today. There are a lot of vendors who have solutions in this space including an open source project. So this is a dependency track which is a project from OWASP. And you see vendors which are focused on application security like Veracord. You see players which are specific to software composition analysis like source clear, white source, et cetera. So there are a lot of companies which have come up with different products. They differ in the depths of analysis like I just mentioned and they also differ in the coverage of data that they have, right? So the next question is if there is so much choice and there's so many things to consider how do you decide like what tool to use? So last year we did some work on creating a benchmark for software composition analysis tools. We call it EFDA, the evaluation framework for dependency analysis. So it's open source and it's available on our community, DevSecOps community or on GitHub. So it consists a set of test cases for different languages and different package managers with the expected results. So and it's artificially created the benchmark. So somebody actually went ahead, created the POM file with these dependencies and verified it, how many are pulled and so on. So you have expected results for each one of them and then there are certain criteria you can fill in the spreadsheet with what you want to give importance to. So if some language you don't care about, you can put zero and then eventually it generates a score for you to compare. So you can use your favorite tool, run these tests and then you can get results to compare against, right? So the next question is where you should actually do this analysis, right? So if you look at a typical modern software development pipeline, it looks something like this. You have a developer, he writes code on his own machine, commits the code into some kind of source control management system. Could be GitHub, Bitbucket, GitLab, so on. From there, the thing goes into continuous CICD system. The tests are run and eventually it's deployed into one of the cloud services in production, right? So you could actually do this in various places, right? You could do it here just before the code is deployed. You could do it here, you could do it here and so on, right? Right on the developer's ID. But based on our experience, the right place to do this analysis is actually in the CICD system just before, just after you build the code. If you remember the second slide I had about scanning technology, unless you do the build, you don't really know what are the bits which are pulled in. Some of the indirect libraries may be loaded through class path or they may not be defined. So the only way you would know the full picture of what gets built as part of the application is after the build that happens inside the CI system. So that's the best place to do this analysis. So for example, like Source Clear, we integrate with all of the known CI so you could just write a command and then after your build completes, it would do the scan and it would give you the results, right? But like I said, it could be done at different places. So an example would be like GitHub's own security alerts feature that's doing this analysis at source control level. So they don't know how to build your project but they have your source code so they can look at your dependency file and they can tell you about the dependencies, right? So if you're thinking about a modern DevOps like flow, what you should really think about in terms of dependencies is how you can go from kind of how you can build a big security in this DevOps workflow. So the first thing to do is like I said, to integrate the software composition analysis scanning in your CI pipeline, you should create an open source users policy. So the policy actually just tells developers what kind of projects or components they can use versus not. So for example, you could say things like you are always required to be within the first major version of a release of a popular framework like Spring, right? So if you don't have a policy, what tends to happen is that you would end up with code which is out of date very soon. You go back to Equifax example, like they're using struts to really old release because they didn't really have an active policy which goes ahead and tells people to make sure that they upgrade when a new release happens. We can fail builds on high severity vulnerabilities. So we just talked about detecting libraries and vulnerabilities but what is the action that you can take, right? So for example, a developer commits some code and now you run this scan and then you see that there is a library which has a high severity issue. You could actually fail the build. So that's even more of the reason to implement it inside the CI system so that the code doesn't leave the boundary. So just when it is built, you can fail the build and you can tell the developer, hey, do you know what? This library has a high severity vulnerability. You should go ahead and upgrade or you should batch it before you actually ship the software. Gather data on open source libraries, vulnerabilities and licenses. So this is more for sort of across your entire organization or your team. You want to make sure that you know exactly what libraries you're depending on, what vulnerabilities they have, what licenses they use. And this is like a very high level. You can also review what is now called the Bill of Material Reports on what's running in your application. So at any given time, you should know in production what are the libraries or components that your application is using. And this reports can be generated by most of these tools. They have some kind of like a Bill of Material thing. You get a full thing. All libraries across all my applications and so on. That said, I am almost towards the end of my talk. So what are the golden rules for using open source or third party software? So these are the four key takeaways. If you don't remember anything from this talk, just remember these four things and it'll be good. So first, you should know what you're using. So like I said, it's very easy for anybody to publish some code on the internet. It's very easy for you to use it, but you need to make sure what you're using. Think about where it came from. So who was the developer? I just gave some examples, a lot of malicious libraries were uploaded into PyPy or NPM. And it's notoriously easy for somebody to use a typosquatting attack vector to actually convince you to using some library which was not written by the person you think it was written by. Understand what it's doing to know its behavior, whether it's doing some exfiltration, extracting some data, running some commands. And fourth, which is probably the easiest, is to avoid using vulnerable libraries. So if you know that a library has a known vulnerability, make an assessment and avoid using libraries with known issues in them. So that's all. I think we have four minutes for questions. That's my Twitter. If you found this interesting, if you want to really know about how to build these kind of tools, you should look at an upcoming book that I have which is all about building security tools for software developers. So thank you for your time. Thanks. Thanks. Good question. That's more than expected. I'll answer it.