 Hello, welcome to Keeping with the SEVs, how to find a needle in a haystack. There will be Q&A at the end of this presentation. Please raise your hand if you are in person and would like to ask a question. After the session, don't forget to rate the session on schedule.com. And now, please welcome Pushkar, our senior security engineer at VMware. Thank you. Thank you, Colleen. Can everyone hear me fine? All right, cool. So thank you, first of all, for coming here instead of going to the attendee party. And also, one question I have for all of you is, how many of you are Kubernetes end users? And how many of you work for companies that ship Kubernetes or are related to Kubernetes? So good news is, the talk is gonna be helpful for both of you. So let's start. A little bit about me. I'm a senior security engineer at VMware Tanzu, was born and brought up in India, then moved here for studies and been working here for a while. Worked in Visa as an end user and now work for VMware Tanzu. I started working on Kubernetes for a few years now and went so deep into it that was able to write a couple of chapters for Nigel Portland's Kubernetes book. I speak different languages, so if anyone is virtual and wants to ask a different question in a different language, that's fine as well. I am generally more available on Twitter than LinkedIn, so if you want to reach me, that's probably your best bet. All right, let's get on with it. So the main question here is, why is this graph looking like the way it looks? And if you see, pretty much more or less, there is a gradual curve of vulnerabilities going upward every year since last 20, 22 years. This year, there are still about two and a half months left, so there is a chance that it might exceed the count in 2020. Now, this is an objective metric. The reasons behind why number of vulnerabilities are more, that's probably a discussion for another talk, but we also asked our CNCF end users as part of tax security work, and we asked them, what do you do to secure your Kubernetes and container environments? What they responded with is, we do image scanning, and I felt, okay, great, so that means they're managing vulnerabilities really well, but we asked another question to them, and they said, managing vulnerabilities is one of the top most concerns. So that kind of doesn't make sense to me, where if you're doing image scanning, the obvious conclusion would be, maybe you're doing well with managing vulnerabilities. So the question again is, what's really going on? And for that, we're going to have a completely fictional account of what might be happening in real world. So let's say I'm a CISO of enterprise customer, and we want to get a product into production, and my security engineers scan different images that the product is shipped as, and we found lots of vulnerabilities. So the CISO goes and says, why do you have 300,000 vulnerabilities? And then the platform architect and the tech leader are called into the meeting. The platform architect goes and says, that seems an irrational number of CVEs. And then the platform architect then looks at the tech lead. The tech lead starts looking and is not sure, I am not really sure why this is happening. So let me circle back with you. So hopefully by now you're getting the pie jokes that are happening here. Now after that, the tech lead takes responsibility and says, okay, I'm gonna, is this, all right, yeah. So the tech lead takes responsibility and says, I need to talk to my customer success manager who is going to ship the product. So the tech lead goes and says, okay, what's going on? You need to tell me. The customer success manager says, I'm not sure too. So everyone is unsure, and the customer success manager then says, okay, I'm gonna go and talk to my security engineer and maybe that security engineer will know what's going on. So, all right, so there is a delay. So the security engineer then hears this and says, well, they should have used a better scanner. So what you're seeing here is basically complete communication breakdown where everyone is right in their own way, but the product is not able to go into production and the customer success manager is not able to help the customer. And the CSO cannot say this is fine to go in production because there are so many vulnerabilities in front of, in the report. So for that, what we will actually now do is come up with a hypothetical vulnerability scanner which looks something like this. So I have a scanner which scans container images. It has database which has all the different vulnerabilities that are public. It has another database which has list of patches for each of those vulnerabilities and the scanner will report CVE IDs. It will say what is the severity of this and it will say whether this particular vulnerability is fixable or not. So with this hypothetical scanner, we'll now try to triage the vulnerabilities for those five people who are actually going through a lot of pain. So let's look at, we'll look at three vulnerabilities and the assumption here is I have a Golang application that runs on Debian image as my base image and I'm basically scanning that image and I find three vulnerabilities in this. There were 300,000 but we don't have enough time to go through all of them. So now let's go back and look at what the Debian upstream is saying about this vulnerability. Can you all see my screen? All right, cool. So the Debian upstream has this security advisory and if you see, it says that yes, we don't have a fix for this vulnerability. It's a vulnerability in tar package. So that's good. And then you also see the urgency here. It says unimportant which is maybe good for security and then if you look back again in the notes, it says it's actually a crash in CLI tool and it has no security impact but there is a CVE ID for it. So what does that mean for me as an owner of that application image? Now if I look at my threat model, I'm using Golang app which is natively compiled. So there is just a binary and Debian base image below it. So if there is a vulnerable tar binary inside my image, the only way as an attacker I can exploit it is if I exec into the container, run that binary, exploit this CLI tool and then maybe restart the pod because I've basically crashed the pod. So if I can exec into the pod and the underlying container, I can exit the pod in any number of ways, right? So it really doesn't matter whether this vulnerability exists in my package or in my container image or not. So maybe this is not a production stopper even though this seems like a big enough severe vulnerability. So let's look at another one. This is another CVE on the Debian base image and this has no fixes, absolutely no fixes for any of the releases which is kind of concerning because I have a vulnerability that I've detected with some severity. I want to fix it, but I can't fix it because there is no fix available. Now if you look again, urgency says unimportant similar to the last one. And if you look again in the notes, it says this is not a vulnerability in libgcrypt but in an application using it in an insecure manner. So essentially Debian is saying it's not our problem. Now if you think about the threat model and the applications we have, my Golang application is using native libraries for crypto. That application is not using libgcrypt at runtime. So the chances of this being a problem or even the application using it insecurity is almost zero. And again, if you have to exploit this, you'll have to exec into the pod and then again the same scenario where you can do something wrong by exec once you have exec into the pod anyway. So this again doesn't seem that I would want to stop going into production for. Now there is another CVE. So this seems a bit more interesting and probably with some impact because Debian upstream has fixed it. The urgency is blank, which means maybe it's important. And in the notes, it says that a specific version of OpenSSL and above is impacted. So I go look into my image and I find out my OpenSSL version is actually vulnerable in this case. So in that case, now I start thinking, okay, maybe I should fix this. How important is it? So then I go back to my threat model and what I know about my application. My application again doesn't use OpenSSL at runtime. It only uses Golang native crypto. So if I have to exploit anything in OpenSSL, maybe I have to do the same again, exec into the container and then exploit it. But because there is a fix available, I might put this into my Jira backlog or create a PR to bump the image with the fix. But I would probably advise someone if I was owner of this image that maybe this is not a really big deal because of the threat model and the way our application is built. So if you see all of this, you realize that if something has a CVE ID, it doesn't mean it is actually always bad for your particular case. Now let's go back to the slides. So what does that mean for us as users of Kubernetes and users of container images? The usage of Zero Trust is almost as big today as supply chain security. But at the same time, some of the things that all of us have to consider is image scanners, even though they are great tools, will have their limitations. They will be signal to noise ratio issues. We will have issues where some vulnerabilities wouldn't make sense in our case because of the threat model and what we know about our system. We will have to manage the risk and reward of do I bump an image? How much time is it going to take and what's the value I'm getting? What's the reward I'm getting if I do this work and I spend some time on it instead of asking the engineers to ship features? And the second part is I'm using Debian base image and I have a Golang application. I really don't need all of what is in that base image as you saw in those three vulnerabilities. I don't use Star, I don't use LibG grip, I don't use OpenSSL, but it's still in my base image. What does that tell us? Maybe we're shipping stuff that we don't need to ship. So what might be helpful instead of looking for a complete clean zero CV scan report is really knowing the impact of vulnerability like we just did right now. Really understanding what is my threat model? What is my risk posture of accepting a particular vulnerability in production whether it's fixed or not? It also is equally important to know what to remove and if I detect an exploit happening in real time, am I able to isolate my workload quickly? Am I able to detect it quickly so that the attack surface doesn't broaden in case an attack is actually successful? Like I mentioned earlier, this is an end user problem but this is not only an end user problem. This is equally a Kubernetes project or any open source project problem. And the reason is Kubernetes uses a lot of container images as part of its release. Any guesses how many container images are part of any Kubernetes release? 200, anyone else? 30 is actually pretty close. So the total number of images actually in any release can today be found out by running this command. So if you're on a laptop and you want to run this command, you have the slides once get, copy that and try running it. You will find out that there are 25 images and shout out to release engineering team and Adolfo who actually created this SBOM utility which allows us to now fit for any particular release which images are actually relevant. So once we know how many images are there, we can then map it to different components of Kubernetes and what we find out are these components are part of every Kubernetes release. Now, generally the usual suspects in any Kubernetes cluster are API servers, scheduler, controller manager, HCD, right? So for the purposes of this discussion along with Qproxy, we'll skip the conformance image to discuss how Kubernetes project is managing vulnerabilities for all of these container images. And maybe we could learn something from it. So remember this slide? Now, when release engineering team started working on this, they realized that we are bumping images too frequently for things that we don't use. And these were the problems that they had at that time. But if you realize, they decided to focus on two things instead of solving for everything. So the vulnerability scanners are not really in control of Kubernetes project or any SIG for that matter. But what Kubernetes release has in control is how fast can you bump images and how much stuff can we remove that is not stated in my container images. So that's what they focused on. And the story is actually quite interesting. So it started last year and one disclaimer I want to share is in this particular story, even though I'm helping build the future of this, I'm just a storyteller and there are a lot of existing Kubernetes community contributors who actually helped make this better. So I'm just going to tell that story and we'll talk about what the future holds soon. So if you have been around Kubernetes project for any period of time, you will realize that if you have to do something new, you have to have a cap. And for folks who are not familiar with Kubernetes project as a whole and what terminologies we use, cap is essentially Kubernetes enhancement proposal and it's basically a design doc explaining why we do things the way we want to do things, why this feature matters, what are the concerns that we need to take care of, what are possible risks, et cetera. Shout out to Yuvin Ma who actually wrote this cap. And if you see closely the motivation behind this is very clear. We want to reduce the churn of continuously bumping images one after another and see if we could remove some stuff that we don't need. And the best way to remove some stuff that we don't need is to basically change your base image to something else. And that something else is digital-less. Now for me also, maybe one or two years ago, I just heard of the name digital-less and I wasn't really sure what it is. And then once I dwell deeper into it, I realized that it is essentially a very minimal base image with a set of files and directories and it has two flavors. So the first flavor is the more relevant one for us, which is digital-less base. And that's the one that we actually use as part of our Kubernetes images, as base images. And static is something on top of base. So base is essentially a set of files and directories that are needed for the base image to work as expected for any goal and binary on top in another layer. And static essentially has libgc, openssl, sorry, glibc, openssl and libssl apart from what is already there in base. So once we understand digital-less, then the problems start becoming clear. So we removed a lot of stuff by moving to digital-less, but we also removed some good stuff. And those two things that we actually couldn't get our images to work was bash and ability to use log files for my container images. So what did we do about that? So the first problem about log files was actually solved by writing a basic wrapper written in Golang, which is essentially called goRunner. And instead of using log redirection in Linux, what we are doing is using a native Golang binary, which will do the same thing and doesn't require any dependencies on having Debian base image. You can find more details in the PR link there. And then the second problem of bash being absent in digital-less was solved by essentially adding something called bashStatic, which is essentially an independent bash binary that gives the same experience as having bash in a Debian base image. So with these two in place, we were actually able to move QPPI server, scheduler, controller manager, HCD, all of them into digital-less with these two modifications. But one of them, which is Qproxy, actually remained. And the main reason for it is Qproxy needs IP tables. And when the discussion happened where do we move digital-less and add IP tables somehow, it made sense at that time to the people working on it to keep Debian image as is, but remove some unneeded stuff. So it is not exactly the fully bloated Debian base image, but a trimmed down one, but it has IP tables. But because of this now, the side effect is that Qproxy remains the most frequently updated images in the last few releases because now we have a Debian base image which has stuff that we don't need, but we also need IP tables. So if you essentially summarize all of this, what you realize is this is not a perfect solution, but any real-world project, any real-world end-user worshipping container images is going to be very complex. And your release train is not going to look like this, where everything is exactly identical. All your images are exactly the same. It's not. It actually is going to look something like this, where you have images that look similar in the beginning, like we saw in Kubernetes, but then you also have images that have different shapes, colors, sizes, and you have to take care of some of them equally, carefully, and make sure that you solve the problem based on how the container images and what its intended purpose is. Having said that, imperfect solutions have benefits. So if you now run a scanner on using any of your favorite scanner on an unsupported Kubernetes version that used Debian base image before we moved to Disturless, you will find a lot of vulnerabilities, even the ones that can be fixed and the ones that cannot be fixed. But if you now scan one of the recent releases, what you will find is essentially, there are no vulnerabilities. And what that means is that we are able to now essentially reduce the churn of continuously updating your base images to fix things. And the good part is, remember we moved a lot of stuff away, right? So most of the vulnerabilities that your scanner will detect are for stuff that we don't need anymore. And because now we've moved to Disturless, we haven't had to actually fix them. So the problem essentially becomes non-existent because now we've moved to something else. And this is the result of how Kubernetes maintainers feel about it now. Previously they were battered, bruised, and couldn't really keep up with the number of vulnerabilities coming in. But now they can breathe a sigh of relief. They can work on caps, improve Kubernetes, improve other parts of security of Kubernetes, and have some breathing space to work on stuff that really batters. Now you'll say, okay, this sounds great, but what if I can't use Disturless? So there is still hope, even if you can't use Disturless. There are four things I would say that we can take care when we are not using Disturless. First is remember the triage we did in runtime. Basically if you don't focus on fixable CVEs, you are basically stuck and don't know what to do about it, apart from doing some risk posture or security impact analysis. So focus on things that can be fixed so that at least you have patches that are available instead of worrying about and getting sort of paralyzed by how many number of CVEs I have. Second thing you can do is focus on whether my vulnerability is in my code execution path. If my vulnerability is here somewhere in my base image and my code execution path is here, even if an attacker exploit this, the chances of my code actually getting impacted is very unlikely. So it doesn't mean we wouldn't fix this, but it would be something that we would potentially put at lower priority than doing something that's more impactful for the threat model that we have developed for our own applications. Third thing I would say is once we do this impact analysis as security practitioners, we give some breathing space for other engineers who are essentially trying to juggle so many responsibilities. They're trying to add new features. They're trying to make sure that the performance and the scale of your application is good. And now they have additional responsibility of I have to fix these vulnerabilities because seems like customers are complaining about it. So if as security practitioners, we can help create that breathing space and say, hey, you know what? Only these four vulnerabilities matter out of these 400,000. And you should just take care of them. Rest of them, I have a good reasoning behind why this may not impact us. And once we give this breathing space to the engineers, they can now spend that time to improve the velocity by which we can bump images. So next time we have two more vulnerabilities, the time it took to fix those four would be less. And for those two, it would be even lesser. And then slowly, slowly, slowly, we are able to fix things faster every single time when we have a new vulnerability. Last thing I would say, in terms of what Kubernetes project did is create a list of images that I need to take care of. Qproxy is a good example. So we have a list of images today where we couldn't use this to less space. And as a result of that, now we know which ones to focus on and which ones are gonna be updated less frequently, which ones are going to be updated more frequently. Finally, like I promised, this is not the end of improving, managing vulnerabilities in Kubernetes. So we have done the recently a couple of things where we now have automated jobs that run every six hours that look for two things. Do I have vulnerabilities in my build time dependencies? So essentially what is in my go-mod file for Kubernetes? And is any of that vulnerable and has a known CV? And the second thing that we do is all the base images or rather the Kubernetes component images that we looked at, we scan all of them and we try to again find out whether there are any vulnerabilities either in the base image or some other layer. The result of that and the hope is that we'll be able to triage this quicker and essentially reduce the mean time to detect a known vulnerability and mean time to remediate that detected vulnerability. If that interests you and think you can contribute or give some feedback, hop onto the six security tooling channel. It's hyperlinked to the direct slack on Kubernetes workspace. And then just say hi, we'll reach out to you and we'll figure out how you can contribute. Lastly, if there is one thing that you remember out of this whole talk, it is manage vulnerabilities by being vulnerable. We have to be honest about and do our homework about what base images are we using? Do we know for all our applications what base images we are using? Do we know if we really need those base images for some applications? Can we standardize on those base images? Second thing we can do is do an impact assessment, know our systems, know our threat model. And what that is going to do is going to help us create that breathing space we talked about. Another thing is we have to be honest and humble that even though if you've been around for a while in cloud native space before vulnerability scanners existed, it was very tough to figure out whether my image was vulnerable or not. So that problem, vulnerability scanners have pretty much solved way better than what it was before. But even though they are fantastic tools and there is great work done by so many people in the community, they will have their limitations because this is non-deterministic analysis. You need some context about a vulnerability and whether this impacts us. So we need to acknowledge that and we need to create some time in our sprints if you follow Agile to understand the impact of vulnerabilities that are coming in. And lastly, be empathetic with your engineers. Understand they are juggling so many responsibilities and if you can just help them focus on the vulnerabilities that matter, it might just get you some brownie points and it might just help you ship features that actually create business value. With that, I'll open for questions. We have one question there. How different are images like Alpine which are already pretty thin to distrill us? Like, did you compare between Alpine and distrill us at all? So let me repeat what I understood. The question is, how does Alpine compare to distrill us and was there some analysis done by Kubernetes on whether Alpine would actually be a good alternative to distrill us? Is that right? Okay. Right. So my understanding is there is some historical context on Alpine and I will dig up the PR if you're interested later. And there were some attempts to use Alpine. I don't exactly know why and if that was dropped. But in general, as a general concept, if you can reduce your base image and make it as minimal as possible, it's always good. So if you have less stuff, you have less stuff to fix because there are less vulnerabilities because you have less stuff installed in your base image. All right, anyone else? Do we have any virtual questions? Oh, there is one more question in person as well. Right there. Thank you so much. Great talk. I did have a question regarding the, how much do you recommend users who plan to, for example, remove maybe like an offering system from that container? For example, there's Alpine. Just taking the base image, compiling that where the application itself runs almost like as a microcontainer, only exposing just the application bits and removing anything else that could potentially be a threat. How much would you recommend application owners or people in the community to go as closest to what you're going to use in production and sacrifice all the bloat that potentially can bring vulnerabilities? Where do you see that line drawn? Yeah, so that's a good question. And like we discussed, we will end up with an imperfect solution, but it will be good enough for the use cases and things we are working on. My suggestion or based on my observation would be digital access is probably closest to just running your binary as is in a container image. So if you can work with that, that's probably going to save way more churn than any minimal OS base image. And the full blown OS images are different matter altogether. So I would say anything that has least number of files. So what I actually did for digital image was I saved the image itself as a tar file and then untied it. And if you look inside it, it's essentially set of directories and files. But if it's an OS image, there is a chance that it will have much more even if it's a minimal OS. So that would be my recommendation. Look for not the MB size, or megabyte or whether it's gigabyte, but look what's inside and how much of that do I need to continuously upgrade when I find vulnerabilities. Yes. Are you aware of evaluation frameworks for picking base containers either in the Kubernetes project itself or elsewhere? Are you aware of what I missed the first part? Of evaluation frameworks for picking base containers either in Kubernetes project itself or elsewhere? I would be interested to know about it. So I don't know the specific term evaluation frameworks. Maybe I know it by some other name. So whoever asked that question, best way maybe is to have a chat with me on Slack if you are on Kubernetes Slack or my Twitter handle is on the slides so you can reach out there and I would be happy to learn from them. All right, anything else? I think we're close to finishing time. Yeah, okay. Cool, all right, thank you everyone. Thank you.