 Hi, Joe. Hi, Robert. Thanks for joining me today. We are here to talk about the Kubernetes benchmark report for 2023. I am Danielle Cook. I am a VP at Fairwinds and I'm also a co-chair of the CNCF Cardigraphers Working Group where we publish the cloud native maturity model. So I'll let Joe and Robert introduce yourselves. Yep. My name is Joe Pelletier. I had a product here at Fairwinds. I've been in sort of the cloud DevOps space for about 10 years. And Robert. Yep, I'm Robert Brennan. I'm VP of product development here at Fairwinds. I lead up the engineering efforts on our commercial platform, Fairwinds Insights. Awesome. Well, what we're here to talk about. So, Fairwinds has put together a Kubernetes benchmark report. We looked at 150,000 workloads and looked at them against different configuration settings focused on security, around efficiency, around reliability and brought you this report. Robert, do you want to talk a little bit more about like how we actually pulled that data together? So we basically examined, I think it's several hundred thousand workloads across many different organizations. Some big, some small, some big clusters, some small clusters. So a wide variety of different kinds of deployment models. And we basically scanned all those workloads for common configuration issues, which we've identified over time. And then we kind of aggregated those results to figure out, you know, okay, what percentage of organizations have, say, left the majority of their workloads exposed to being able to run as root or having set CPU and memory settings, things like that. Awesome. And the big takeaway we have from this year's report, we were able to evaluate it based on a year ago and compare and contrast the two. The big findings are, we're not doing well as an industry. We need to do better. So we are going to dig into these findings now. We're going to look at how you compare, you know, see what you're doing, see if you're configuring following these best practices. You know, it is, it is interesting finding. So jumping in quickly, let's just go through how to read the data. So we want to make sure everybody can, you know, if you're looking at this and you want to pause it and look into the figures. We have the kind of blue column, the top column, which is 2022 findings versus orange, which is 2023. Now those look at the top ones will say, okay, 46% of workloads from zero to 10% are impacted. So I know that's a little bit of a mind play. Robert, why don't you explain that a little bit better. Yeah, so basically every organization gets put into one of these 10 buckets. If that organization has very few of their workloads impacted, you know, zero to 10% of their workloads impacted, they go in that top bucket. That's a really good place to be. You want to be in that top bucket. If they have almost all their workloads affected, then they're going to go in that bottom bucket where 90 to 100% of workloads are affected. So some organizations are in that bottom bucket. That's a place where you don't want to be. It means pretty much every workload you're running is affected by, you know, the particular check that we're looking at. Awesome. And if you, you know, if you're looking at these and glancing at them and you see that, you know, in the 2022 column, it was, you know, really big for the first zero to 10% and then it's gone lower. That's bad. That means we are doing worse, not better. So, yeah, we're going to first start with our first category around reliability. So the graph we were just looking at, and here we can see that we are missing memory limits and requests. And that's a problem. So, Joe, do you want to talk about why that's a problem? Yeah, I mean, I think that, you know, this is one of the more important, you know, checks that are, you know, important for adding to your Kubernetes configuration, you know, right from the start. So when you create a new configuration and for a workload and go to deploy the application, you want to make sure that you at least have the memory settings set. And those are the requests, as well as even the limits, like how much memory is that the maximum memory that that workload should use. And I think what we're noticing is that over time, you know, organizations are, you know, adding more containers adding more workloads to their Kubernetes, but, you know, oftentimes forgetting to do this. So the percentage of workloads that impacted has increased over time, and you're seeing more organizations kind of neglect this kind of from the start. And I think, you know, this is an area that there's actually some great open source tooling that can help you address these issues as well. So I think there's some opportunities to kind of quickly close the gap here but we are seeing that things are getting kind of worse over time. So I think it's always just a business person coming to this and read this. Why should I care about missing memory limits and requests. Yeah, might be an engineer might care but why should I care the business owner. Yeah, I mean I think at a, at the end of the day you are consuming cloud because you're trying to get your applications out to customers faster trying to enable your teams to ship faster and cloud provides you with that flexibility and that scalability. Especially in nowadays it can be very expensive if you don't manage cloud resources correctly. And one of the, you know, first things that you want to do is make sure that your applications are operating within a bound of appropriate memory use. Because if it's unbounded or you're not setting this configuration, Kubernetes has to guess, and sometimes it can over allocate resources because it is guessing, and you're ultimately you're spending more in the cloud than you need to. And so over provisioning is a very large problem I think we've seen that on average customers are over provisioned by at least 30% and so that's one of the things that we're hoping organizations apply or put, you know memory requests and limits in place to kind of better manage their Kubernetes and container spend. Awesome. Alright, let's move on to the next one which is around missing liveness and readiness probes. And here you can see again, it's getting worse like 83% of organizations are not setting these so Robert let's talk a little bit why these are important. So liveness and readiness probes are basically how Kubernetes can tell if your application is healthy and it's ready to serve traffic. You know, it's, it's super important to basically give Kubernetes, you know, some kind of information as to whether whether it can start sending traffic into your application. But if your application does experience some problem, you know, maybe the database goes down etc. It's going to, or you know, a particular pod loses connection to the database is going to stop sending traffic there and make sure your customers don't experience that outage. And it'll restart that pod to make sure it gets back into a healthy state. And also if you're if you're kind of say issuing a new deployment, know you're updating your source code etc. So if your deployment rolls out, if it takes, you know, say 30 seconds for your application to go from, you know, the binary started to it's actually ready to serve traffic. Your customers are going to experience a 30 second blip unless you've specified, you know, this readiness probe where, where the pod can basically say hold off on sending me traffic, because I'm not ready yet. Well, we all know I can't, you know, that that 30 seconds I mean I know my attention span is gone if it if it doesn't something doesn't load. Excuse me. All right, so now that I'm choking. Let's move on to the next one. Yeah, which was around deployment missing replicas. So Joe, take this one away. Let me take this one away and get things started. So I think with, you know, missing replicas this is actually a new, a new check that we've added to this year's report and we're finding that, you know, 25% of organizations are running over half of their workloads without replicas. And really one of the advantages of Kubernetes is the ability to kind of quickly scale containers and, and, you know, have Kubernetes can help you respond to new inbound demands for your application if you have lots of users accessing it. But there's new workloads that have to spin up quickly to address a pipeline of issues or pipeline of work. And so replic what replicas do is really help with that horizontal scaling of Kubernetes so that you can make sure that there's enough replicas, if you will, of your application in Kubernetes to handle that load. And there's also, there's generally a recommendation around, you know, having more than one replica set so Robert, can you share a little bit more of around what those general best practices are for replicas. Yeah, typically we advise folks for our production application, especially to have at least three replicas available. That way, you know, any two of those pods can die at any point. And the third one can continue serving traffic. This, this particular check is checking to say, say that you at least have one or at least have two replicas set. If you only have one replica, then you're at serious risk for downtime because if anything happens to that one replica, Kubernetes is going to basically kill that pod and restart it. And that could that could even be just normal, you know, operating procedure for a Kubernetes cluster right a node might go away and get replaced by another node. You know, Kubernetes is supposed to be fault tolerant, there are there's aspects of the environment that are known, you know, known to be noisy, are deliberately noisy. And so if you don't have more than one replica set for a particular deployment. There's a very good chance you're going to experience blips of downtime. So that's our reliability findings, we are going to move on next to security. So here we can see, you know, here's our big statement, which is 33% of organizations are running 90% or more of the workloads with security capabilities like that's not good. So we dig into that data. We can see the findings of the workloads that are impacted but I think we should start with so what is an insecure capability. Yeah, so I'll let you kick that off Robert sorry. Well, so in Kubernetes there's a lot of configuration that goes you know around your Docker container as you're putting it into the cluster. And there's a lot of security settings involved and basically telling Kubernetes how to run this container. And there are ways you can give that container extra remissions for you know what it's allowed to do, how it's allowed to interact with the host node that's running that container. And one thing you can do is, is attach Linux capabilities to that container. If you know maybe it does need some kind of extra access to the host node. Typically, you know your average API server your average dashboard whatever it is that your application teams are deploying is not going to need those extra capabilities. They really just need to be able to serve traffic, you know over the internet. Unfortunately, a lot of these capabilities are added by default, and they need to be explicitly dropped if you want to not have them set for your container. So we recommend that you know for most workloads you know that are you know running applications that these that these capabilities be dropped. And you can see here in the data, there's a pretty like bimodal distribution here where some organizations have done this for the vast majority of their workloads during that top bucket. And some organizations have, you know, basically done this for none of their workloads so you can see there are some organizations that take this very seriously, and are making sure to drop these drop these capabilities wherever possible. There's other organizations who are apparently just unaware that this is a thing. And they've left the default configuration for all their workloads. Well, and I think what's interesting with this is that, you know, in 2022 report we saw 40% 42% of organizations were covering this, whereas that dropped all the way down to 10%, which means there's people who are again doing worse, not better on this. And I think sometimes what you find with that this particular area is organizations that might be rolling out Kubernetes across large parts of the organization may end up creating you know templates that allow for that repeatability of configuration but what can happen is if those templates aren't aligned to best practices, you can actually propagate these, you know, misconfigurations whether it's security, or liability oriented or cost oriented. And so sometimes these can become widespread problems quickly if, again, you're not making sure that that root template is following security best practices and this is an area that it's easy to miss right. A lot of times. You know, folks are trying to just get workloads to run and securing those workloads isn't something that you notice right away. Right. It looks like things are running things are running smoothly, but the security is sometimes an afterthought. And so I think we're seeing that organizations are having to pay more attention now to this misconfiguration so that they can make sure that things are a little bit more hardened in the future. Yeah, totally. And I'll add I think having some kind of policy enforcement in place like this is a thing that you might get yourself to you know 100% compliance today, but six months from now, you know people are onboarding new workloads etc. They're not necessarily following those best practices and making sure you've got some kind of policy enforcement mechanism in there to say hey anything new coming into this cluster has to be dropping these capabilities unless it's got some kind of carve out. Awesome. Well the next security findings is all about writable writable file system so let's talk about let's first define what it is, Robert or Joe what either one of you can take it away and then we can talk about the data. Well I'll kick it off I think like with writable file systems I know that in general containerized workloads the best practices to actually only just have the file system as read only which is really the best practice in this case. I think a writable file system can open up additional risk to that container but you know Robert help us understand kind of what are some of those risks why is it, why is it a best practice to be read only here. Yeah, so you know from a security perspective it really like kind of hampers any any would be an attacker from attacking the system they can download an external script they can install new software their their hands are really tied in terms of what they can do inside this container if they do manage to break in. So it does a great job of just kind of cutting down on on what attackers able to do. There's also you know the added you know reliability impact where you know that you're not accidentally keeping some state on the disk. You know if it's read if it's read only then you know that every time you spend up a cloud you're going to get an identical environment. It just helps you ship with a little bit more confidence. And it's worth noting to that this is an option that is not set by defaults the file system is writable by defaults with Kubernetes. You have to explicitly add some configuration saying I want my file system to be read only by default. You know again you can see kind of a bimodal distribution where there are some organizations that care very heavily about this they set it for the vast majority of their workloads. And there's some organizations just seem to be unaware that this is even an option for them and are leaving them the vast majority of their workloads impacted. So we think here you know the the stats says 56% so that's if you kind of take your 70 to 100% of workloads impacted add those all together that's 56 of or 56% of organizations who are just not doing it so needs to change. All right moving on to our next one privilege escalation allowed. So, Joe do you want to give me a definition of this one and talk about why this is important. This is one that we see that is pretty prolific right I mean similar types of statistics around this bimodal bimodal distribution. You know privilege privilege escalation I think is one of those areas that where it's fortunately it's an easy fix. I think it's a matter of, you know, being able to, you know, just kind of switch the flag from from true to false. But you know with privilege escalation it really allows a container to be able to as as you kind of expect escalates privileges and be able to kind of expose the container to more security risk and to kind of get into a little bit more detail around like you know why is that bad Robert do you have any kind of examples of why that would be a, you know, a bad thing. The big thing is if you know somebody does manage to break into that container or escape the container it basically gives them a whole bunch of extra permissions on the host node, which would allow them to spread their attack throughout the cluster. So basically, a lot of the security configuration that we're talking about here is all about limiting the limiting the blast radius of an attack, right we're assuming some attacker has gotten into a particular application they found a hole in one application, they've gotten into that container. And so all this extra configuration that's getting layered on top is keeping them in that container and restricting the amount of things that they can do from within that container. So this is yet another thing, you know, but yet another layer of protection you can put into place to make sure that an attacker is really unable to do anything outside of that one application. Well and I think one thing to mention on this Joe you said this is an easy fix and sure it's an easy fix if you have, you know, a few clusters, but when you have multiple clusters and multiple people using your platform. The problem isn't necessarily that it's a hard fix it's a scalable, are you auditing your clusters to make sure that this is done. Can, can, can you auditing it and can you implement controls in place that ensure that, you know, the right best practices are in place I think that's goes back to policy enforcement but there's also things like mutating admission controllers that can help, you know, force these, these good practices as well. Awesome. So, next one run as root now there was this, there was a vulnerability around this in the last year, a little bit longer than a year now, which, you know, looked at this being exploited. So, and we can see here that there's a lot of organizations that are still allowing root access. So you were going to chime in. This is similar to the privilege escalation one where it's fortunately an easy fix, but the number of applications out there that are running as root by default, you know, is way too many and in some scenarios you might need a certain workload to run as root depending on what it does but for the most part, I think this is a general best practice that you can apply. You mentioned there's there's a was a vulnerability I think we can share the link after this webinar. Back in 2021 where as if you had a workload that was running as root then the vulnerability was able to take advantage, or you know gain additional access to the system, and one of the ways to mitigate that vulnerability. And the near term was making sure that your workloads weren't running as root it was enabled to kind of you're able to kind of add that defense in depth to your existing system to thwart the attack. And so I think you're going to find that this is an important defense in depth layer right. If there are vulnerabilities that are needed root access to the system in order to, you know, do it's nefarious things being able to make sure your workloads don't have the ability to run as root become very important. Great. So, next step and we have just two more on security and this is around in the image vulnerability so you're 62% of organizations have this happening on 50% or more of the workloads like that is a lot. So, Robert talked to me about image vulnerabilities what are they why should I care, why should I be scanning for this. So every, every, you know container image that gets shipped into your cluster as not only you know your built application binary inside of it, but also a bunch of other applications that might be being used. It's got a whole operating system typically. And all that software that's getting packaged up inside that container can potentially be vulnerable to an attack. A lot of you know actual exploits in the wild involve you know a basically a known vulnerability and a widely used piece of software like php or WordPress or something like that. The attacker, you know, basically knows about the vulnerability because everybody knows about the vulnerability. They see that you're using that old version and they just jump in and exploit it. So it's super important to be for every for every container that you're putting into that cluster scanning that container and understanding what's inside of it what version is running and is that ball version known to be vulnerable. And if it is you need to make sure you update make sure you update the container make sure you ship that new update. This is a very it's very hard thing to stand top of you know new vulnerabilities are getting announced every day. So you really have to be doing these scans very continuously making sure you're routinely shipping patches and upgrades. It's super important to do. All right. And so last up on the security front we looked at outdated helm charts, and then outdated container images so the container images is new to the report this year. And you know it's the graph at the bottom which really does show the split of some people have it covered. Some people do not have it covered at all. And then the helm charts are all over the place that Joe jump on it. Yeah, I think I mean I in in cloud native right you're having to make sure that you're running the correct version of a container at any given time and you might want to make sure you upgrade containers so that you can take you know customers can take advantage of functionality, but you may also need to upgrade so that you're resolving vulnerabilities. Like in the previous slide, a lot of containers contain known vulnerabilities and upgrading to a newer version is often a way to kind of reduce your vulnerability risk. And I think the unique thing about the Kubernetes and cloud native landscape is there's lots of third party helm charts and third party containers that are needed to run your cluster. And this is sort of the equivalent of patch management for the cloud native world, making sure that your third party containers. You know these are different add ons that are necessary for running critical services in your cluster, keeping those up to date is important for not only just the reliability of your system, making sure that you're, you know, addressing again going back to that security aspect. These, you know they may contain known vulnerabilities that you as the consumer of this third party image, you know, can't fix directly and so you have to update to the more newer less vulnerable version. All right, so moving on to our efficiency cost efficiency area. You know this is something that is increasingly important this year you know companies are looking at what are we spending on our cloud spend, how do we reduce it. And so here we're looking at, you know configuration around memory, memory limits and those being set too high so you jump into it and look at the first graph which is, which is around memory limits too high Joe do you want to jump in on this. Yeah I think what we're finding is that with memory limits being too high. Ultimately, this is a signal that organizations are generally over provisioning their application so they're they're giving very generous amounts of. We talked about the importance of setting memory requests and limits so the, the low end and the high end of memory usage, but they're being very generous on the high end of usage, which means that, you know, Kubernetes, you know, may end up, you know, consuming more cloud resource than they need to. And I think the other, you know, while this is an important issue there's also the issue of folks actually provisioning too much memory requests, and that's where, you know, Kubernetes reserves. You know, an oversized amount of, yeah that's the exact slide there. It's reserving an oversized amount of memory that it doesn't actually that that application doesn't actually need. Ultimately this only means, you know, more resource usage and more resource usage and the cloud means increased cost and I think it just reinforces some of the earlier comments that we made around the importance of getting these settings right. Well, and oftentimes you know if you have created a platform for your developers to use, and they're using it there they might not be thinking about this but you're the one who the finance team is going to come to him and go, our cloud bills hi what is happening. So you're able to dig in put some guardrails in place for your developers so that they can be setting these right memory request limits is really important. And I think we found we found that actually there are a lot of people who are fixing it now and that's really good so Joe talked to talk to me about this. So I think what we found is, despite a lot of, you know, I think what we're seeing just is in general organizations are embracing Kubernetes more and more every year, like at massive scale, they're going from one team to many teams right and what we're seeing is, is being coming a great way to get applications to run quickly, help teams ship faster, but there still is this theme of neglecting the best practices which can, you know become problematic later on maybe not immediately around security cost efficiency and reliability. And one of the areas that folks are actually fixing is we're seeing a class of organizations really implement policy and guardrails intently, and making this feedback loop to developers. And one of the first classes of issues that they're fixing is really around those memory and CPU requests and limits that we had talked about earlier. The good news is I think there's, you know, in Fairwinds has open source here tool called Goldilocks that there's tools that are coming out that help provide those specific recommendations for CPU and memory, so that developers don't have to guess they can put the right values in place from day one, and you're seeing these this class of issues being fixed faster. I think over time, we're going to see more kind of context sensitive issues like the security issues where applications may or may not need certain privileges are asked or certain security context enabled. Those would probably be the next wave of things to be fixed, but it's great to see that there is some progress in general happening around the CPU and memory settings. Well, and here overall what we're seeing too is organizations that are trying to look at these issues that we've gone through from reliability to security and cost efficiency. If they're doing it earlier in the process and the shift left scenarios and your admission control, you're finding you're fixing them earlier, and that's a really thing that's not getting into production. You're not having to waste time after the fact trying to fix these problems. Yep, exactly. Okay, so those are our findings of the benchmark report. So, we're going to do a plug for fair ones insight so we have a free SAS product, which is Kubernetes governance software that helps scan for all of these problems that we were talking about. And it is a tool that if you're creating, you know, your internal developer platform, you want to create this paved road for your developers to work on it to get to the end without any issues without any bumps in the road. And that's what fair ones insight is really helping you do. So, Joe, do you want to take us through a little bit of the, yeah, I think you're absolutely right I think you at the end of the day there, the emerging, you know, role of platform engineering their whole job is to enable teams to ship faster. And in larger organizations, making sure that teams are consistently, you know, following the best practices, but able to ship their applications to Kubernetes and get the feedback in a timely way. That becomes critical. And so what Kubernetes governance is is a really an important capability of an internal developer platform. And what we bring to the table is three different, you know, key capabilities here that organizations are adopting at scale or adopting the security feedback, and that's not security feedback after it shipped while while we do provide that. It's also the feedback right at the time of pull request, so that those development teams or even those DevOps engineers who are helping get applications to run understand the best practices that they should follow right from the start. The other is in 2023 this is incredibly important now it's being able to manage cloud costs. Again, Kubernetes has gone from, you know, one team to many teams in many companies as the adoption has soared. And it's important to bring visibility into container costs because that is not something that comes natively or naturally to the current, you know, tooling out there. And so, being able to measure how much containers cost and also get recommendations on those CPU and memory settings that we talked about that becomes another critical capability for, you know, development teams or platform engineering teams who ultimately have to fix these issues. And finally, you know, Fairwinds also provides the suite of guard rails. Guard rails are really the core of this feedback loop. So at every stage of the process, whether it's your, your editing your, your configuration or infrastructure as code, or you're deploying your app, or you're trying to investigate why an app is not working correctly. Guard rails and policy feedback are important for you know helping teams understand what they need to do now and go faster. So what you're going to see in our platform, I think I don't know if we have any of this on the next slide. No. Okay, yeah, I can actually share a quick demo if that makes sense. That does make sense. Let's do that. And while Joe brings it up, I'll just say, you can get a copy of the benchmarking report by visiting fairwinds.com. And you can also try this platform by visiting fairwinds.com and setting up for free. Yeah, awesome. And so I think what you'll see is, you know, the insights platform, you know, first organizes all of this feedback for developers and for platform teams into three different categories. It's the same that you've seen in this presentation, whether it's security, reliability or efficiency. And you can also, you know, narrow in on specific tools like fairwinds, Polaris to understand that the types of issues that are affecting your workloads. So you'll see some of the recommendations we talked about here today show up. I think the thing that we're seeing lots of organizations turn on today is really this whole notion of repository scanning or infrastructure as code scanning. And so being able to integrate right at the time of pull request to detect things like image vulnerabilities or help you identify workloads that might be missing labels which are important for cost allocation. So, you know, a variety of types of checks can emerge here where feedback is critical for those developers. So fairwinds, you know, really again helps drive a lot of this feedback loop and helps you kind of, you know, plug important capabilities into your internal developer platform. And, you know, within insights there are a number of open source tools like Polaris, which we've mentioned, Goldilocks, which we've mentioned. So lots of open source tools that you can be trying and using. We are huge advocates of open source. So you can check that out too. And, you know, here's here's some of the links again you can get the full benchmarking report to to see how you compare in more detail, and you can try out insights. So thank you, Robert. Thank you, Joe. Thanks for chatting with me today on all of this. Any final words. No, thank you. Yeah, good discussion. Thanks.