 My name is Sripad Nargora, I'm a senior technical staff member and master inventor at IBM Research. I'm currently leading multiple initiatives around supply and security. I'm going to talk about one particular aspect of one particular project today. I'm a member of multiple security working groups in CNCF and OpenSSF. So if you want to catch me, you can catch me on Slack or their mailing list. I'm here joined today by Caroline. Hi everyone, I'm Caroline Lee. I am on the CSERA remediation team at IBM and I'm also a first-time presenter. So I'm very excited to be here. Thank you. Yeah, so where do we start? So that was the question I was thinking about. So as we used to say earlier, all the road leads to Rome. Today every security discussion we start, they lead to supply and security. And that's for good reason, right? Because supply and security is the foundational security chain that we are all facing together today, right? And the context of today is important here. And that's why, what is the main problem? The main problem is essentially our reliance on open source system today is unprecedented. As there are some stats that suggest we have 92% of our application. The content open source software, right? If you look into a stack, start from the Linux, right? All the way to the code that we write, everything is made up of open source component. And this is just one of the dimensions of supply and security, right? It's not the only one. There are other aspects like secure configuration, best practices. But today we'll focus on this particular one, like the open source consumption and open source reliance to tackle the supply and security. And a lot of time I get asked, right? If we have so much problem with open source, can we just stop using it, right? So open source doesn't mean it's bad, right? I think that open source is the core engine that is going to drive our next generation of software. I mean, it is today, it's driving our current infrastructure or current applications. But going forward, it's going to be the growth engine for our innovations. So open source doesn't mean it's bad. But as we heard in the keynote, right? Open source also doesn't mean it's free, right? So we need to invest in the technology. We need to invest in educating people. We need to invest in building a secure infrastructure so that we can securely consume and use this open source software. That's the important thing here. And it's not that we don't know, right? We have our DevSecos pipeline, our CI CD pipeline, we are incorporating a lot of these checks. So today if you think of what we are doing, right? So we start by identifying all the open source dependencies in what we call... I don't know if you heard of this term, lbomb, where we capture all these dependencies. And then we perform vulnerability analysis. We check them if they are vulnerable. We do license auditing. We do first testing. And once our open source dependency pass through this pipeline, we go and build and start using those dependencies, right? So why is this not sufficient? Right, so why we need something more? So I would like to zoom into one particular aspect, the vulnerability management, right? So if you look into how we are managing our vulnerabilities today, so we typically start using certain version of package. That's package x, version y. Over time, we discover it has some vulnerabilities. The patch is created and the next version is tagged. A CD is announced. Then we rush and we say, okay, we'll upgrade this particular dependency to the newer version. And I start using the newer version. Two years later, again, the same cycle. I call it the vicious cycle of vulnerability management. Because our vulnerability management is kind of reactive in nature, right? We react after the vulnerabilities are announced. Of course, we have our DevSecOps pipeline. We can argue that before we start using, we analyze and we verify whether the dependencies are safe and then we start using. But majority of the cases, these vulnerabilities are discovered on day two, right? We call that once we start using them, we discover it after some time. So if you think about it, right, we run our applications. We often run our applications with potentially exposed vulnerabilities when we are running it. We are using the software. So is there something we can do proactively? That was the motivation. Before I start using any particular open source software, can I get some indications like there is going to be some problem with this particular package? And of course, we don't have Oracle or we don't have a magic ball that can tell us that two months down the line, this particular package is going to be vulnerable. That's not at our disposal, right? So what we can do to basically improve our vulnerability management process. And one particular incident that really motivated me was this incident that happened earlier this year, that the maintainer of the Colors and Faker library, he was able to push some bad code in that particular package and create a new version. Then, yeah, everyone who were using it downstream, the transit dependency, they were using it, and they all got affected, right? All the major 500 companies, they were using this particular library. They all got affected. The applications got broken. Then I started using thinking, right? So why, yeah, we can grant that, okay, the maintainer had malicious intent and he was able to push some bad code. But why I ended up automatically updating my dependencies, right? From a spec, we all discussed that, yes, some developers didn't have their dependency pin. So when the builder applications is automatically resolved to the next version, that's a valid point. But even if I had my dependencies pin, right, I said I'm using this particular version. And now the next version is available. I don't have, I still don't have control way to know why I should update. Is this version safe to update, right? And what I mean the control way is, I don't have the insights, right? I don't know if the best practices were followed for every change that went into this particular release, right? Or do we have any indications, like what kind or what size of changes that went into the release? So we can make an informed decision whether this upgrade path is safe or not. And that is important, right? The worst safe that will dive more detail into this. And this was essentially the one very motivating factor that has started this particular project. So essentially what I was looking into is basically building this control and informed way to update, to basically modernize our update framework or our release framework, right? Where whenever we get new versions are available, we know that what this version contained, whether it is safe to update to this particular version. And one core thing is we don't want to give this more information to the developer, right? This is the more information, no. What we need is basically an actionable insight that they can automate, they can build some policies, right? It becomes embedded into their update framework. And that's essentially something that was needed. And that's how I started with this project gauge, right? To modernize our release framework that would allow us to measure some release insights, provide us some recommendations. When we are using this open source dependency. And yeah, as I was building this particular project, one very interesting use case came along, right? And it came from our CISO organization. And it was not specifically related to the security, but it was related to the compliance. And yeah, I have Karleen to talk more about that. So many companies are... They have to comply with country-level regulations, not just us, but many corporations. So I'm just going to talk about, at a high level, the various regulations that are notable. The first one is the Trade Agreements Act. For government procurement, it is important to report on the countries that our products are manufactured in. And there's a list of U.S. designated countries that we must adhere to. So that's the first one. And the next is the OFAC regulations, Office of Foreign Assessment Control. And these are the sanctions. So these regulations enforce trade restrictions for purposes of national security and also foreign policy goals. So these are things that we have to adhere to. And that is the background of what we will help inform with GAGE. All right, so how do we use GAGE to solve this problem? I'm going to skip ahead one slide just for some visual context. I'm a very visual person. So we want to have GAGE give us information on, we have packages that we're taking in, we have S-bombs that we're taking in as well. What are the dependencies in there and where are they coming from? So we're trying to answer this question with the information available out there. And so the way we're doing this is this first section here on the left is my GitHub profile. So on GitHub profiles, you are free to add your own information. I think you can add like your own little bio of who you are, what area of interest you have, what company you work for, and your location. So this is what we're looking at. And this is totally up to the user to add or to not show at all. And so we want to inform on the overall package. We're taking this location and we're feeding it into an API. The reason why we're feeding it into an API is because oftentimes the location is just a string of something that could be as specific as just the city name or something super broad. Could be the country, could be just a region. So we want to normalize that using the API here. And then the API gives us a response of the country information. And that's the information that we're taking and we're collecting up and generating a report of that package and of multiple packages into an SBOM if that's how Gage is used in that context. Okay, so one step back. So this is overall summary. So that is what we're trying to do. We're trying to determine the country of origin of contributors. And this is just one aspect to build the bigger picture. We are, the way we start Gage and use it, we input a list of different countries we're interested in. It searches the countries using the method I described and then flags the contributors from those countries. And those, the results there is just one aspect of a bigger picture. It does not disqualify a package or a specific user at all in our judgment. All right, so how well does it work? The accuracy says 95%. So that is, if we have found a location, we're able to resolve it and normalize it 95% of the time. So that's pretty reliable. Challenges we faced, as I alluded to, oftentimes it's hard to understand what the location is. It's not as simple as just a string match. Does the location have United States written into it? Sometimes they're vague on purpose. There's this list, this is planet Earth. Sometimes people put that, I think that one is actually one of the more popular ones. 127.0.0.1. And my favorite, I say sarcastically, is this shrug emoji because if there are any developers here, you know how hard it is to deal with edge cases and things like special characters. So this one was just a real fun one to come across after like hours of running it. All right, so next I mentioned API and rate limits. I want to talk about the rate moments a little bit. So we're using an API to go from the vague location into the country. And so first we looked at the OpenStreetMap API. We wanted to keep things open source. However, we saw that the rate limit of the OpenStreetMap API was one request per second, which was really slow. And for us, sometimes we have S bombs with 10,000 dependencies and it's not feasible to have that rate limit that low. So instead we use the WeatherChannel API. WeatherChannel is actually owned by IBM, fun fact. So it was really helpful to work with that team and use their API, which has been super helpful. I already covered that. And so this is what it looks like when you're actually using it. So we have this very big package that it was run against with 6,500 contributors roughly. We have this example of various countries to compare to. And for each country, it's going to have a summary of the number of contributors that came from that country. And also the percentage of contributions. So it's also really important for us to know if it's a notable amount of contributions to that package. If it's over 50%, I think that's an exception. That is very important to the teams that we give this to, I think. Yeah, thanks, Karleen. And just to be clear, right? So this is not my or IBM's point of view or this is not something we advocate, right? To be used as a general practice. Whenever you are using any open source software, this is not the general practice. It's just the spatial circumstances. As we discussed, like if I'm a corporate vendor, I'm doing some merger acquisitions or procurement or federal contracting. Only during those spatial circumstances, whether we like it or not, right? These are the federal regulations that we need to abide. And there need to be some automations and some technical vitally that we need here. And the only intention to discuss is here to discuss it on the technical details of this. That's all, right? Now, pivoting back to the core use case of Gage, right? How do we modernize our release engineering? How do we use Gage as a toolkit to modernize these releases? So first thing, do we need new things, right? There are open source, fantastic open source projects already there. There is Dependabot, there is Renovate. I really love this project, and I use them regularly. So they provide new updates whenever a new version of your dependency is available. But can we augment this thing? As we said, right? We just do need to know when the new updates are available. We also need to know whether it is safe to update to that version and what is going into that, what went into that particular version. We have Scorecard here. Again, another project that I advocate to my team, I personally use it. It gives you the point-in-time evaluation of your repositories. It scans all the best practices and gives you the score, whether your repository is safe or not. So why do we need new tool? So as I said, there are some limitations with using the point-in-time evaluations, point-in-time scanning. Let's say, this is, I'm showing you just the timeline. I enable some scans. I need brand production, I need peer review enabled. They were enabled. I ran the scan at T1. Everything passed. If I'm a malicious maintainer, I can disable this production, I can merge some commit, I can enable them back. And when the next time it runs again, it says, yes, your brand production is on, your peer review is on. But if you see, this bad commits has, or bad practices has been implemented and we cannot detect. And the core problem here is essentially not the this point-in-time execution. It's basically the GitHub or any SCM for that matter. They do not maintain this immutable record of the facts. Like if you make a change to the code, you get a commit ID. We don't have the same thing for the, if I change some security knobs, I don't get that immutable record. Some commit ID or some record that says this security knob has been changed. So that essentially is missing. So this is, I call, a temporal loss because if you look into the timeline, we basically lose some security assessment. And the second is spatial loss, right? If I'm running it against a repository, let's say main or head branch, the scanning, some scanning might not be transmitted to the individual package that I use. So I'm a developer, I'm using certain version of this package from this particular repository. And if I run the scan on the head, it says, yeah, there are vulnerability discovered. But those vulnerabilities were added much later. I'm using the older version. So I'm really not affected by that particular scanning results. So the scanning result doesn't really translate to my current use version. That's, I call a spatial loss here because we are basically looking into the spatial aspect of this dependency management. So that's why I said, okay, we need basically a way to bring this new insights, new data to the surface so we can make more sense out of it. So do we have more data available, right? So I look into the landscape of GitHub, right? And it provides so much information. It provides, it has source code. We have the change metadata in pull requests or issues. It provides security reports, developer insights, configurations, compliance details, release metadata contributors. It has so much data readily available. And there is so much data we can get from external sources, like stack or flow, another where we can basically marry this data to bring some more insight. We started with this. I said, okay, let's start with this particular data that we have. So what we are, what currently I'm doing in Gage essentially is for every release, I identify what are the pull requests or commits that went into that release compared to the previous version. Do they, those PRs commit, do they have labels, right? Do they have linked issues? Do they have, the issues have labels? This gives us, allows us to quantify or classify these particular changes. Who are the contributors who are contributing to this? Are these the core maintainer? Are these outside collaborator? What are the roles of these contributors? Who are the reviewers, right? Are these changes being reviewed? Are there in release nodes available? Are these changes and these releases signed, right? What are the stats and the insights, like top contributors, contributing metrics? All this data, we are curating. And the important thing is not to just throw this data to the developer, right? Provide them the actionable, something they can act on. So let me just, with that, let me quickly show you the demo. This is good. Okay. So let me just start this. So what I'm trying to do, I'm running basically this Gage tool. I'm telling it, I'm using one Python package, which is flask. I'm using version 2.1.1. And this is the Git repository where it is hosted. We have some logic where we can go over the Git repository even without, if you're not provided. And what Gage does, essentially, is goes and find out what is the next version that is available. So in this case, it tells us there is a new version 2.1.2 that is available, which is 28 days old. And you're only lagging by one release lag. Because there are some financial customers that I talked to that I mentioned. They have some policies that they cannot lag behind more than three major versions, right? For any of their dependencies. So we can basically build some policies around it. And then important thing, if you look into this, it tells you how many unique contributors contributed to this particular newer version, right? And if you see more closely, it has something called zombie audits, right? So these are essentially the zombie commits. So these commits are something they are merged directly with the main branch, right? There are no pull requests, nothing. Someone has just merged something to the main branch. And all the commits that went to this particular release they are to the main branch. And this is a risk indicator, right? I don't want to use this particular version because these changes are not reviewed by anyone. And then the second thing is, if I look into the annotations, change annotations, the levels that I was talking about, it tells us that the changes that went into were related to the testing, related to the docs, typing. So yeah, I can make an educated guess if someone is changing some documentation or something, so they probably haven't changed it. So if I'm more open to take some risk, I can say, yeah, that's fine. Based on these annotations, I can basically allow this particular version to be used, right? So this is essentially the kind of, and again, the idea essentially is to provide this so user can build policies on top of it. They don't need to go and look into this report every time they can say, I have policy that I don't want to update any version if the changes are not reviewed. Or if the changes are not related to any core components, I don't want them to be updated. And then this becomes automated. So the other example I'll just show you is this tensor flow, right? One of the most popular use library here. And it says, again, the same flow. And here you can see if you look into, it gives you the same insight. And now it tells you the change annotations, what components are changed in that tensor flow. It's a big library, right? It tells you that there is these components eager that change, core change, what size of changes are there. So from extra small to extra large are kind of changes went into this. And now as a consumer or the developer, I can say, yeah, I need this. I'm using this particular section of this library and I want them to be updated. So again, so this is how we basically currently using Gage. And I'll talk more about it, about how we can use it and employ it. The other thing is, I don't see Gage as limited to the packages on the open source packages, right? It should be extended to other open source currencies, I call them currencies, like any composable open source artifact that you're using, like image or any Helm chart, anything that is version, you should be able to use this on top of that. And yeah, I want to basically substantiate these theories with numbers, right? So we actually went ahead and ran this some survey. I'd like Carolyn to discuss that. Sure. So I helped do a survey of top Docker images over the course of a year. So what we were looking for is metrics such as how often are Docker images released? What are the dependencies in each release and how much are those changes over time and are there, how often are security changes that go into each release? So that was the sort of mindset that we had going into it. So I got to work. I used a variety of open source tools, which were super helpful. So the first step I had, I used was Crane. So step one, we have the Docker images we're interested in. Next we use Crane to get a list of all of the release tags of each image. And we wanted to get a history of how the releases change within a year. So that was May 2021 until now. And once we had that list, we generated an S-bomb for each one and we used SIFT to help with that. And then finally, we used GRIP to generate a vulnerability report with the input of the S-bomb. And then the very last step is to generate a table of all the information that we had gathered of the changes per release of each of the things, of the vulnerabilities, of the different dependencies and of the date of the releases. And so this is the table. You could see overall the average number of days and the average number of changes they're very varied. Just kind of tough to say. So the number of days between releases can be as low as like 19, can be as can be as a gap as long as 79 days and the average number of changes can be just a few between release but it could be an average of 53 such as Python and this is just a very small sample. The actual distribution I'm sure is even more varied than that. And the last thing we looked at was the percent change in vulnerabilities between releases and as we looked at the releases over time and the amount of vulnerabilities reported it was a very solid trend. I don't know if I saw any increase in vulnerabilities which is a very good sign but overall only at max amounted to 18% change in a reduction of vulnerabilities. Yeah, thanks sir. This was really eye-opening, right? When we see like we always take pride that I have the image I always tag it with the shot so I don't have to update it. But then when the next version is available I just blindly go and get the next version and again tag it with the shot. And I don't have the insights, right? What is changing? If you see there are like a 53 package change and well not all these changes are attributed to the vulnerabilities. They are not all 53 packets were changed because they are all vulnerability. There were some other reasons these packages were changed and we don't know, right? We just blindly updating these dependencies and that's essentially something that we need to change, right? We need to have this modernize this particular approach. So how do you put the gauge to use, right? So again, first-degree use as I'm a developer and building my code. I just want to use the CLI, right? So we have the CLI ready and this is again a pretty new project, right? I open source in last month and we are still actively working on it. So I have just created how progressively we can use it, right? So first-degree use as a developer I just use the CLI. I give it a package name or I give it a list of packages from SBOC and I say this is my config and go and evaluate and tell me if these dependencies are fine or not or tell me the new dependencies that I can safely upgrade to. The second-degree use case is let's say I'm a security officer and I want to imply some policies across my my organization. I can put some policies and then whenever a developer is making change and it triggers my CI-CD pipeline in the CI pipeline when I generate a bomb, I can basically find out, get the bomb if I can get the previous bomb and I say okay, these dependencies have changed either they are added or updated and we can run it through gauge whether these changes are safe or not according to my organization policy all the changes were reviewed in this particular new release there were no zombie commits that went into this release and I can automate this and I can basically enforce it across my organization. That's the second-degree use and third-degree use I would basically imagine is to be part of the package managers. We need to get away from these commands package manager, upgrade Y because I see a lot of Dockerfiles where the images were tagged and everything is run and upgrade Y. It doesn't really make sense, I'm pinning my image but then next step I'm basically upgrading all my dependencies and I'm doing it blindly. We need to get away with this and what we can do essentially is we can build our package manager our package manager to incorporate some more commands which says describe me what updates are available I have this policy that there cannot be zombie commit, there has to be reviewed changes and everything validate them whether these updates are validated against this particular release and also recommend me tell me I'm using this particular you know all my dependencies recommend me what are the safe dependencies I can upgrade to what is the safe path for me. I think that's where I see that we need to head to the open source where we can be basically more intentional more auditable when we are updating our dependencies. This project is open source I recently published a blog on Medium so it basically covers the same thing that I talked about so if you want to read more about it read yeah finally I think IBM is a big support of open source we are really involved in a lot of work stream we are very serious about security because it matters to us and we know it matters to community as well and we need to address them together so if you want to know more we have our code cafe on the fourth floor stop by say hi there are cool swag to collect we have some voodoo donor I think it was yesterday yeah so with that I think we get involved again as I said we are working with a lot of open source in the open source community so being on Slack email github whichever way you feel comfortable reach out and yeah that's all thank you Eva questions I don't need this you will talk about zombie community yeah yeah I just want to learn more about the zombie community how do you gather information from the gate and then how do you really help with the security concern so so zombie community essentially as I mentioned right so again this is not a standard time it's just something that came from no verse but it basically signifies that I have some commits and I don't have any history or any metadata associated with it like this committee is not linked to any PR these are essentially merged with the master so what we currently are doing is when I get a release I know the commit of that particular release and the previous release now I basically make a query to identify all the commits that particular that went into between that range right so there is a GraphQL query I make and get that data from Gita now for each commit I basically query it again to see if it is associated with any pull request if it is not that means it's a then I classify it as a zombie commit and yeah I think that is essentially one serious thing with any practice no one should be merging anything with the in the main branch directly without any any preview or any pull request any other question yeah so that so we are basically running getting two more right when we are assessing this release we are not looking into the location location essentially is a separate evaluation that we do separate evaluation we do but to your point right so when we what the data that we have is the only data that that is available with the Gita and as Carlin was mentioning that data is not reliable but that's the only thing we can we can work with so we don't have the information like the age or other PI associated with the developer with that and I personally think right if we we are basically serious about the identity we need to follow the model like Twitter right like it should be the responsibility of platform to validate verify the when they onboard any user it should be the responsibility of developer to put a like if you go to Twitter you have a tick mark that can say we have verified this particular celebrity right so similar things we should be done it should be done on the platform level because all we can do is the data engineering and the machine learning and everything but if you want to be basically intentional and this is again a sincere topic about identity but I think it should be the responsibility of platform like Gita about this thing to verify the identity and provide some indications like we have verified this particular developer and whatever these are the information you can create the other thing that I haven't basically thought about detail but I was listening to that earlier with the git sign right if you have certificate associated with that if you can access those is there something we can do with that right is there any information there in the certificate that we can use because if you are I'm using git sign to basically sign these my commits can I explore that further we haven't done that yet. Yeah and definitely agree with all those points I just wanted to reiterate that the ultimate goal is to comply with the regulations and so we are just performing our due diligence so we are just looking at what's available online we're not probing further we're just doing what we can with what's available and moving on. We have the one minute left sign. Yeah I think that's all that's all the time we had so thank you everyone for attending and listening to the talk and yeah if you want to learn more I'm keeping that in mind thank you. Thank you.