 Hi, KubeCon. Hi. Hi, Noah. How are you? Hi, Simon. How are you? I'm very excited to be at the KubeCon conference, even though it's virtual, it's still really exciting. It's super exciting. Thank you all for coming and join us. So let me just... Yeah, so we're going to be here and we'll share our presentation with you and talk to you about, you know, what we've learned from reading 100 plus communities postmortems. And experiencing, unfortunately, but, you know, it happens to everyone. Yeah. Yeah. But first, I think it's the first time that we meet. So hi, everyone. My name is Noah, Noah Barkey. I am a full-stock developer for more than five years. And I'm also a tech writer and one of the leaders of Github, Israel community, which is the largest Github community in the whole universe. Universe. Universe. So hi, everyone. I'm Shimon. I'm one of the co-founders and the CEO of the tree. I'm also an AWS community hero, but we're here in CNCF. So I'm one of the co-organizers of CNCF Tel Aviv. Here, out from Israel, we have a vibrant community. And if you happen to stop by, you should definitely come and see the CNCF in Tel Aviv. So a little bit about us and how we ended up here talking about Kubernetes postmortems, you know. So we're a startup company. But what we actually deal with on a daily basis is we help companies prevent misconfigurations from reaching production in Kubernetes environments. So this means we have an open-source CLI that actually can run on your laptop or in your CI CD against Kubernetes manifests and helm charts. And it can detect misconfigurations such as missing a CPU limit or a liveness probe. Readiness probe and then required laborals. And we will talk about some of those things that you should definitely apply. But, you know, from working with the community and working with our customers, we were able to see a lot of those incidents that happened and to learn a lot from those postmortems. So here in the tree, policies is, you know, is what we do and we integrate directly within the development pipeline. So we get to observe and see a lot of those mistakes that happen. And today, hopefully, we'll educate you a little bit about how you can prevent and avoid having those misconfigurations that can lead you into a, you know, possible outage or security incident. Yes, you're very correct. And I want to add that as a developer at the tree, partly my job was not only to understand about Kubernetes and how it works under the hood, but also to understand how my user can blow up his own cluster. And today, I mean, usually we would want to do it and talk about the postmortem in a postmortem way to talk about the event and what are the lessons that we learned. But today, especially today, I want us to go over all the postmortems and to invite you to my very own private show. What is the mistake game show? Oh boy. Are you ready? I hope so. I hope so. It's a mistakes game show. Let's see. So let me guide you what we are going to do. So I'm going to show you a specific resource and you will need to guess or tell. Where is the mistake? What is the mistake if there is? And let's begin. Okay, okay. Let the show begin. Let the show begin. So this is the default cron job configuration. What is the mistake here? Well, I guess with cron job, you always mess up the scheduling tasks, the schedule timing. So I guess I will go with that. So you think it's the schedule? I guess it's a scheduler. I don't know. You always mix it up, right? You always go to this website where you try to debug it and see if it's the right one. No, it's not the schedule. It's the concurrency policy. We always want to make sure that we set the concurrency policy to either forbid or replace. Because when we set it to allow, whenever a cron job will get failed, it will not replace the previous one, but it will create a new one, a new one, and a new one. And this is actually what happened to target. So basically here, if you can go back one. So if I read it correctly, they spawn up like this cron job every one minute. And I guess what they wanted is they wanted like a long lived service. Yeah. Oh, and then it just continued spawning more and more. Yeah, you want to know how many is more? Okay, try more than 4,000 more pods that were constantly restarting. Yeah, that's what happened to them. They had one failing cron job that were constantly restarting new pods. And not only that it immediately took their cluster down, but it also cost them a lot of money because their API server accumulated thousands of CPU wasting, wasting resources. Yeah, yeah. And the lesson that we learned here is to never, never, ever trust the default configuration. The lessons not only about cron job. Yes, yes. Just because Kubernetes allows you to deploy a specific resource with a specific configuration, it doesn't mean that this is what you should do. And from my experience, most of the, in most of cases, this is not your, this is not the configuration that you should work with. But let's move forward to the next one. This is another cron job configuration. What is the mistake here? So you taught me about the concurrency policy and it's actually set to forbid here. So it's kind of tricky. You're trying to trick me. No, I can see it. That's true. Tricky game. Maybe there is no mistake. I'll go with like with the restart policy never. So restart policy never. Is that the correct answer? No, unfortunately, it's not. It's incorrect YAML structure. And this is actually what happened to Zalando. They used the correct configuration, but they placed it incorrectly in their YAML. Oh, so like it, like it did not exist because it was in the wrong place. Instead of having it here, they placed it here. So the concurrency policy wasn't part of the cron job spec. So they ended up with having a cron job without any limits and it kept spawning pods that were actually completed. But they never cleaned up from the API server, which eventually took their cluster down due to out of memory issues. Which is weird. It's like you think you do the right configuration, but you just put it in the wrong place. So it's not just what you configure, but like where do you place it within your configuration file? Yeah, yeah. And let me tell you a bit about Zalando. Zalando is an online fashion company with over six thousands in place. So again, you can only imagine how much money it costs to them. But the real lesson that we learn here is to always make sure that you define clearly defined the policies in your organization. And the best practice is in your organization. It's super, super important. And people think that having a correct YAML structure is really basic, but it's not. It can really happen to anyone. But let's move forward to the next one. This is an ingress resource. Okay, here I think that I see something that seems weird because it's an ingress, but the host doesn't have any URL or like service name or like FQDN or something. So this one seems fishy to me. And maybe you're correct. Maybe not. Let's see. And you're right. That's true. We always want to make sure that we prevent users from putting a star in the host in our ingress host. You know why? Usually if you do starts like either it takes nothing or it takes it all. Yes, exactly, exactly. When you put a star in your ingress resource Kubernetes will immediately forward all the traffic to that container. So you have one container to entire cluster. It's a lot of traffic. And that's what happened to also target actually was their first incident when they started to use Kubernetes. So specifically what happened to target is that one of their developers put a star in the ingress. Nobody was watching and immediately took their entire cluster down. It's not a specific. Yes, it can really happen to anyone. And I want to say that sometimes companies will want to use the star. But it's really important to understand what happened when you put a star in your ingress because it's interesting. It's like Kubernetes gives you a lot of flexibility over simplicity. But when you have so much flexibility, it's really easy to make mistakes. And then you you're able to do a lot. But sometimes maybe it's not what you wanted to do. Yeah, yeah, exactly. And the real lesson that we learned from target is to delegate the knowledge. It's really important when you start to use Kubernetes in your organization to delegate the knowledge to the entire the entire organization. You have developers team and you have DevOps team and you have. Many, many department in your organization and you want to make sure that everyone understand how to work with Kubernetes and what you should do and what you should not do. And let's move forward to the next one. It's not that easy to tell everyone what you want them to do, but we'll talk about this later. Don't spoiler. Okay, so this is a pod. Simple pod. But what can be the mistake here? There's like nothing in it is like a pod. It's a name front end image. Seems like there is no mistake. It's like the shortest YAML I've ever seen in terms of Kubernetes. It's like seems plain vanilla. So we're going with the no mistake. I go I pick no mistake. No mistake. No mistake. No mistake. Am I right? Am I right? No, there's no limit. We always want to make sure that we specify the request and limits, especially, especially when it comes to a serving third party applications. Oh, and you can tell that to blue meta door. Blue meta door had one pod that served a third party application and apparently that application containers where memo Hawks. So one day, one of the DevOps engineer noticed out of memory issues. He looked a little deeper and he found out that this container didn't have any memory, which immediately took their cluster down out of memory leak because those containers took all the memory in the production node. And well, it makes sense, you know, because when you use third party software, it's not even from your company. So you don't know how it's going to behave. Like I remember we were using like a rabbit MQ, which is like a queuing service. Yeah. And the default behavior of the queue is to accumulate as much memory as it can in order for it to be ready to serve queuing. So if you don't limit this container, it will just automatically take everything. And there are like some Java applications and stuff that are very, very like memory heavy. So if you don't put limits, especially to like third party applications that you use, you can't know what's going to happen. You can trust the default config and you can trust the third party applications. But having said that, we always want to make sure we set request limit. I think that the important lesson here that Blue Metador taught me is that it can really happen to anyone because Blue Metador is a small startup company. It's not a big company. It's not Zolendo. It's not 6,000 employees. And while I kept reading, I found out that it can really happen to anyone. And when I say anyone, I mean anyone. Google, Spotify, Airbnb, Skyscanner, Datadog, Toyota, so many companies that it happened to them. It's Kubernetes. It's not simple at all. And it can really happen to anyone. Yeah, well, as we said, like it, Kubernetes brings a lot of flexibility, but with that, you know, sometimes you need some sort of like guardrails or something to help you do the right thing. Because if you have a lot of options and a lot of things you can do, and especially if you're trying to delegate infrastructures code responsibilities to developers and engineers, and not to only have like the centralized ops or DevOps team, like babysit all the developers, you know, and go over like be human debuggers for YAMLs, you need to educate everyone and need to give them the tools, the proper guardrails to help them make the right decision. Because no one wants to take down Spotify. I really love Spotify. It's great. You know, you don't want to take it down, but sometimes you just don't know. I completely agree with you and I want to add something for the favor in developers in the audience. Because as a developer, when it comes to Kubernetes, when it comes to DevOps, it's not my field. It's something that I might be afraid of. And it's not my pipeline. Because I'm a developer, I'm writing my code and I'm shipping it to production. And yes, I want to make sure that I would never, never harm production. But doing stuff that related to DevOps to infrastructure code, it's not my field. And when people want to adopt, when organization wants to adopt Kubernetes, I think it's really important to remember that Kubernetes, it's not about Kubernetes. It's about DevOps culture and it's a process. It's not something that is done in one day. It's not one take and developers and DevOps are speaking entirely different languages. I am at the development side of the pipeline and you are in the production. We wake up every morning with different goals. I want to be, as I like to call it, the best feature machine. I would like to deliver my features way before my product manager even thought about it. Production wants to be a production warrior. You want to keep production up. You want to be alive. And we don't want people to streamline misconfigurations and cause issues and problems. Yes. And I think it's important to put the cards on the table and to talk about it because there's a gap here. I think that most organizations don't really put their focus on educating the developers. If you switch technologies and if you go and set up Kubernetes, it's not over. It's just the beginning. You're just starting your journey. You need to educate your entire organization of how to do it. It's going to be more true, but this all brings us to the main question. So we understand now that it's a big thing to start using Kubernetes. So how can you prevent it in the future? How can you prevent the next misconfiguration and not become one of those Kubernetes post-mortem stories? That's a great question. So I think that the answer to that is automation. I'm a great believer in automation and I'm a great believer in bringing the tools for people in order for them to be able to, on the one hand, be responsible for themselves to have guardrails that help them do their job, but that it empowers them and it gives them the ability to make the changes by themselves, but you're with them and you're helping them. And I think that the first thing you need to do as an organization is to define your policies. You need to define what types of checks do I want to do? Maybe I want to have memory limits and CPU limits and I want all resources to have liveness probes and readiness probes and I want to make sure that everyone puts labels on their workloads because otherwise it's going to be impossible to attribute to which workload belongs to which team. And number one is to set those policies and also to have ones that are in general, like maybe no resource should use more than 50 gigabytes of memory and to also have the ability to have granular policies that say, no, no, this is an AI service and it needs to use 50 gigabytes of memory. This is what it does. This is meant for it. So number one, define your policies for your organization. And I want to add to define your policies that what is a policy? Because as a developer, when I thought about policy, I said and I thought, what is policy in my area, in my field? Policy is basically what I need to do to feel confident in order to ship my code to production. So what do I do? I write tests, I write integration tests, I do QA tests. Basically, it's a bunch of tests but that's what I do. I obey clean code standards and I have my own best practices and I always make sure that I read for, I don't know if I use Golang, what are the conventions and what are the best practices when you write in Golang. And this is what I do in order to feel confident with my code. So policies is something that is, as a developer and as a DevOps, we all have, we all share policies. We all know how to use policies but the real difference is when we use and what exactly are the policies. I think that as a developer, my policies are when I write my unit tests and I add them to the CI CD. So this is the part of the pipeline and we all in the pipeline and I want to add them in the CI CD pipeline and I make sure that before I push my changes, I run NPM tests. Yeah, everything is okay, good, good, good. I feel confident with my code and I think that policies, this is the key. This is the way that, this is how the DevOps and the developers are going to communicate. So you should really put your thought on how to do it, how to use those policies which lead us to the next step which is integrate policies in your organization. We want to make sure that we validate each changes that every developer or DevOps, that DevOps engineer would like to make those policies on every change. So we want to validate that ideally through automated checks like having it within your CI CD pipeline or even as a pre-commit hook as a local testing in your local dev environment. This is practically what you do, what you already do. So what's the difference between Kubernetes resources than the rest of your code? Absolutely and I think that seamlessly integrating within the development workflow whether it's a VS Code extension or a CLI utility or a pre-commit hook or just having it enforced and mandatory, it's a hard word to say enforced but in a way security and stability and there are some tests that are enforced so like every job and maybe you have like 500 Git repositories you make sure that in every CI CD pipeline those checks are being run because if you don't, as you said, your cluster might blow up so you should definitely integrate it within your CI CD. If you could think about it, what happened to Zolando is that they placed it incorrectly so all they needed is just to make sure that the YAML is structured correctly. The structure is valid. This is practically everything that they needed to do but let's move forward to the next step. The next step is control, review and monitor. Yeah, so this is a very important step which I think it's absolutely crucial to be able to dynamically adjust your policies because okay, so let's say we sit down, we define which policies we want maybe we'll show you some tools that you can use in order to do it now but maybe you put it in but then you run for one week, two weeks, three weeks you might have a new policy but then what are you going to do now? Going to deploy a policy change to 500 Git repos so it becomes, you know, it's like sort of a burden now because in order to change it I need a way to have like a centralized management place where like all of my repos we pull the policies from a centralized location and then execute them so I can modify the policies in one place and it will propagate to all of my workloads on another place and secondly you want a place that gives you the ability to monitor which tests are running, which tests are failing, what happened, which workloads have the most errors in them so you can improve because if it's disconnected and it's running in an island like each repository is its own island you're missing the big picture and it will be hard for you to see. Yes, so today we are going to talk about three tools we will talk about Conftest and Gatekeeper and we'll talk about our very own Datria application and the real important thing that I think you should notice about each one of them is where each one is placed in the pipeline and I want you to start thinking about where would be the most suitable place to you, to your organization to put an application that will validate your Kubernetes resources in your pipeline because where you will choose to put that application it might affect your entire organization. So let's start with Gatekeeper Gatekeeper is a policy controller for Kubernetes it enables you to enforce policies that executed by OPPA under the hood it's practically an admission webbook it integrates with Kubernetes and has a customizable validation admission webbook and how to use it is very very simple you want to make sure that you install it on your cluster you need to define by yourself your policies using a regular because we're talking about OPPA under the hood and then after you define your policies you apply each one of them using Kubernetes CRDs and the really important thing about Gatekeeper is that Gatekeeper operates on the production area in the pipeline. So it will actually prevent like kubectl, kubectl applies that are like a misaligned or misconfigured it will block it from being deployed. The way that it works basically is that whenever you kubectl apply it will check your resource through the validation admission webbooks and if one of their policies does not meet the criteria it will kick you off but this is like at the end it's in the production so it's not shift cleft at all it's like I already developed my code it was built and everything and it's shipped and it's like the last mile but maybe I want to do if you want to check it prior to that so if we're going back and we want to work in a more shift cleft way another way to do it is by using ConfTest it's also part of the Open Policy Agent which is a CNCF graduated project repository and what ConfTest allows you to do is to actually run automated tests against configuration files such as Kubernetes YAMLs like Manifests or Helm charts and you can write rules in Rego and then just run ConfTest test against those files and see whether they meet the criteria and pass the policy check or not a very simple, very straightforward great utility that you can run in your CI CD or even you could possibly run it in your computer I guess Yeah, yeah it's when I first learned about ConfTest it was really interesting because it was the first time that I ever used like local testing this is my unit testing for Kubernetes resources and I remember that yeah okay, I understand what unit testing are okay, I started to love Kubernetes, it's fun and I think it's a great power when it comes to using ConfTest because as I said before this is the key point to the communication between DevOps engineers and developers and I think it's really important but let's move forward to the next tool our very own, the Tree application the Tree is a CLI solution then that just like ConfTest enables you to test policies against YAML files but the difference is that it comes with built-in policies for all Kubernetes best practices so I don't need to actually go and write the test by myself it comes pre-built with this Yes, yes, all you need to do is just to install it and in addition to that we also provide a centralized management for all your policies using a UI application things like creating new policies and enable or disable a part of the policies details, reviewing a full history of your invocations and lots of other stuff and the way that it works is also very, very simple the CLI runs automated checks on every code change on every resource that exists in a specific AM path and after it's finished for every violation, for every misconfiguration that it finds the Tree will display a full output as you can see here in the GIF just wait for it and it will display a detailed output of the violation which will guide you to how to fix that issue and it's also very friendly, I think, to developers I pro developers and to use the Tree, as I said before all you need to do is just basically to install it It's like brew install the Tree or just curl it from anywhere it's open source, it's in GitHub you can go and open pull request to Noah this is my code, submit a PR yeah, cool so those are the three main tools that you wanted to talk about you might ask yourself which one is for me maybe all of them are for me, what are the differences so we're at the end of our talk, the main difference is first of all, the first one is define your policies so the Tree comes with dozens of pre-built rules like from all those postmortems that we talked to you about so you don't need to guess or you don't need to wait for an outage to happen to think then which test should I apply of course if you do have a list or one you can use ConfTest or Gatekeeper to do it secondly, you want to integrate your policies inside your organization so as we said the Tree comes with a lot of plugins and web hooks and you know to GitHub actions and any CICD provider and pre-commit hooks and you can run it in your IDE and everywhere you want and of course in your CICD while other ones require a manual configuration and third which I think is very important you know for organizations because you know I really experienced this in my previous role where we had 400 engineers and it's really really hard to control you know this amount of engineers thousands of Git repos and I think that the final part is the control review and monitor which with the Tree we provide a centralized policy management solution so I can go and define my policies in one place and it will dynamically adjust and run like you don't need to change anything on your CICD or like you don't need to modify the binary because the policies are streamed into the CLI solution and then they are applied so it's really really suitable for organizations that want to use it so this is basically the difference between the two the three of the tools and the takeaways that we want you to take from this session is to first to understand how important are policies and really how you should use it and why you should define it in your organization and if we talk about policies then to never trust the default configuration define the policies in your organization as we talked before delegate the knowledge to the rest of the team across your organization and it can really happen to anyone Oh absolutely Thank you very much we'll stop sharing now and we really want to thank you very much for attending our talk feel free to reach out and you know come talk to us open pull request and github and we're available on LinkedIn and anywhere and we're looking forward to hearing from you and we're working very closely you can open issues and we're adding support for more and more platforms and it's been a pleasure Noah Thank you very much Thank you very much human too Enjoy the rest of the conference Bye bye Bye