 Hi open source summit. Hi, Shimon. Hi, Noah. How are you? I'm good. I'm good. How are you? It's great. It's great I'm so excited and you know the Linux Foundation always has a Special place in my heart. Yeah, exactly. Just all my words Yeah, so it's really really fun to be here And it's really fun also to talk about things from the cloud native foundation Which is like open policy agent and Conf test and gatekeeper and Kubernetes. So, you know, all words collide In a way So what are we going to talk about today? Today, we're going to talk about why policies are so important in Kubernetes. We'll talk about Opa Gatekeeper Conf test why to use it how to use it and We're also going to talk about our the tree, right? Yeah So and we'll show code snippets will show you real-world examples of postmortems of Protection outages that happen to some of the greatest companies in the world. So stay tuned and let's get started so Yes, so a little bit about us so, you know, yeah, my name is Noah. Nobaki I'm here from Tel Aviv Israel and I'm a full-stock developer for more than five five years And I'm also a tech writer and one of the leaders of github Israel community Which is the largest github community in the whole world. No in the whole universe Amazing yeah, and my name is Simone tolls. I'm the CEO and co-founder of the tree and I'm Love tech conferences and I love communities. I run the largest AWS community worldwide as an AWS community hero We have 8,000 members and we did more than 100 gatherings 8,000 really? Yeah, that's what happened to me Yeah, and my background is in software development backhand DevOps Prior to starting the tree I was general manager for the infrastructure division for iron source It's a nice company that IPO the like a couple of months ago for 11 billion dollars And it was really really interesting, you know joining when we were just 30 people and after like when I left We were like a thousand people and a lot of the challenges that we're going to talk about today Really I had to face them there as well with the 400 engineers It's really hard scanning Kubernetes and all of the development standards for 400 engineers It's really interesting to see that it's not the hype You know the hype that became began this year, you know about communities and everything So you felt it three years ago four years ago. Yeah, it's really interesting. We were born in the cloud We started microservices right from the get-go So we like we kind of felt the pains of the future as all of the companies were migrating to the cloud and moving to you Know those development architectures Amazing Cool so a little bit about the tree and then we'll just get over it So we help companies prevent Misconfigurations from reaching production in the Kubernetes workloads So we have a cool open source CLI written in go in GitHub go check it out submit a pull request Yeah, you know we wrote code to this CLI and then it will actually run against your Kubernetes manifests and the helm charts and Identify any misconfigurations like The ones that we're going to show you soon and to make sure that you have a memory limit CPU limit a Like this probe readiness probe and actually you can write Policies as code as you like and you can also build your own custom policies So it's an entire engine for you to play with so shimon you may be Born in the cloud, but I born in the policies because policies is what we do for a living here and With this knowledge, I would like to tell you a story. I would like to tell you the story of unicorn rentals Oh, I love unicorns. I dressed as a unicorn for Halloween It's true Unicorn rental started like every other company the two founders two developers and Everything was great business were booming who wouldn't want to buy unicorns. Yeah, sounds like Unicorn company. Yeah, of course. So the company started to grow. They created they recruited more developers They started to deliver more and more features and in the fastest way that they could and What happens? What happens when you have more developers more features your delivery pace and you ran fast I guess you start to break things. Yeah, you have more bugs and to have more bugs can be a very serious problem Especially when you have more and more customers and you want your production to be more and more stable So what do you do you start to recruit more DevOps engineer? You switch to Kubernetes maybe helm, you know more a sophisticated infrastructure and You want to make sure that you delegate the knowledge to the developers teams. So it wouldn't be only the DevOps Yeah Yeah, and that is exactly what unicorn rentals did. I swear they even sent in an email to guys Don't forget to use the memory limit. It's super super important. Don't forget it and This is how they lived and they live happily ever after but do you think that sending emails to developers asking them to do things? Is the right thing? It's a spoiler. It's a spoiler But you know a little while later Something horrible happen. Oh, no. Oh, no one of the developers It just pushed a resource and he forgot one tiny tiny little thing. He forgot the memory limit Which pretty immediately caused the memory leak and The cluster went down and so there was a memory leak in the application and the application did not have a memory limit So it actually Made a blast radius for the entire node in the cluster. Yes indeed makes sense And I think it's safe to say that it was a Friday night Always always always Friday night So Unicorn rentals is actually not so special. It in fact, it happened to a lot of companies Google Spotify Urban B really it can really happen to anyone and it doesn't have to be you know complex issues or Something that is very very drastic or anything It can be what happened to Zalendo when they actually put the correct configuration But incorrectly in the resource that they put the right thing at the wrong place Yeah, they had a crown job and they placed it not in the crown job spec So the concurrency wasn't part of the crown job spec and they ended up with having a crown job without any limits Oh, so like even if you you put in the configuration, but because it's like a YAML file You can find the wrong section and then it won't actually register as the So easy to make those mistakes. Yes. Yes, and it was it was a valid file So when they did Qtl Qtl apply, yeah, it was okay. It was fine But it it wasn't as you can ask them But and it can also be to put a star in your ingress and then to forward all the traffic to one container One container to an entire cluster and that's what happened to target. It was their first accident actually in Kubernetes Accident and incident No, but I agree because and it's it's problems that are easy to make so instead of like Specifying the host in the ingress you just put star. Yeah, which is a weird different behavior Like I wouldn't think that default behavior of star is like route all traffic to me. It's kind of weird really when I first saw it I was Is this the right way to do it? But maybe it was my hunch. No, I think that the default behavior of if you put star Like throughout all traffic to your container like to from all ingress. I think it's kind of weird. I don't know And but this is to the Kubernetes spec writers I'm not gonna argue with them and I think another example here is that if we look at the blue Matador incident What happened is that they brought a third-party container? Third-party service to run in their Kubernetes like many do like you can bring a database You can bring a cash You can bring a lot of different things my logging memory and so on and what happened is that they did not Specify the limits and requests in this case like memory limits and requests You can do the same for CPU and then when you bring third-party applications to your Kubernetes cluster You don't know how they're going to behave. Maybe they have a memory leak. Maybe they have a CPU hog Maybe there is an issue and in their case Unfortunately, because they did not put a limit it actually hogged the entire cluster and took all the resources Yeah, actually, you know it happened to me once when we Spin up a rabbit rabbit MQ queuing service and the default behavior is to accumulate all the memory available for it to be used as a queuing services So the default behavior is like to steal all the memory from all the other instances So if like you don't know it you're immediately going to kill your cluster Wow, this is like a little bit. It's like the downside or the dangers of like not running in real VMs Or like running in containers in Kubernetes that those kind of things can happen and this is why it's really important to put in limits Yeah, sometimes you just use I don't know a third-party application and you you trust that application, but You need to always make sure it's doing the right thing. It's just the default behavior of the right Yeah, to take your entire memory the default behavior. I love the default behavior when it comes to Kubernetes you should never trust the default behavior, yeah, but Everything is okay. So Google Spotify. We talked about it, but the real question How can you prevent it in the future? This is the real question and this is what we want to talk about today so One option is to have all your DevOps to review everything that the developers do everything every change But that's an anti-pattern because you have one DevOps and an entire team of developers which Pretty much afraid from everything that's related to DevOps and Kubernetes. Yeah, so what what can you do? Yeah, I guess I agree that it's a totally an anti-pattern to Turn DevOps or operations or SREs or you know the different companies call them different ways Yeah, I see it as like turning them into human YAML debuggers Of like doing code reviews and and it's not even like It's not like maybe you could write this code in a more elegant way or more memory efficient It's basically you put it in the wrong section or you're missing a memory limit It's like nothing interesting about it Like I guess as a person who who did a lot of like infrastructure work You want to do the interesting things you want to go you want to do cost optimization Optimization and want to do performance optimization Let's be a human debugger to developers and the developers don't want to wait for you To deliver their features and you don't want to babysit the developers to make the changes. Yeah. Yeah, and When we talked about it Shimon and I What's supposed to be a solution in this case I I immediately said to Shimon Shimon Let's put the cards on the table. I am a developer you're DevOps and we are very different people When I get up every morning. I want to be the feature machine I want to deliver my features way before my my product manager even thought about them But when it comes to Kubernetes, I'm I don't know. I'm too afraid. This is not my area I'm there in the development phase. I'm Yes, I understand CICD everything is great, but I'm I'm there with the code But Shimon is different as a DevOps engineer His mission every day when he wakes up every morning is not my mission. Yes, it's not delivering features No, it's making sure that the cluster is healthy It's making sure that we have cost allocations that we do cost reductions that we were running in a performant way That we update the versions of our clusters That they are secure that they are properly configured with the cloud vendor or the on-premise solution that we have But you know what we do have in common Not only that we we both working on the same pipeline But we both want to make sure that our production is stable I want to make sure that my features are stable in the production and that I would never never Take down the production and no one wants to take production. No, nobody so that Conclusion implied that the ultimate solution will probably must be Somewhere along the pipeline. Maybe in the CI section. Maybe in the production section Maybe But But it's it's gotta be somewhere along the pipeline So today we are going to talk about three tools we'll talk about gatekeeper the tree and conf test and every one of them is placed in a specific section in the in the pipeline and I really want you to think about where would be the most suitable place for you because the place that you will choose to deal with the policies to deal with Standards and validation in your communities might affect your entire organization and maybe you can use all and maybe you can use all That's true. That's also true. So let's start with opa But wait, you said conf test gatekeeper in the tree. You didn't say opa. Why were you talking about opa now? I remember the first time that I heard about opa. I went to shimon and I said to him Okay, so I understand policies is like When I write tests to my code, it's like To make sure that I use clean code and I have my own standards and convention and best practices This is what I need to do to make sure that my code is valid and it's ready to go to production Everything is okay. I mean integration test my unit test everything What is your policies and then shimon looked at me and he said Oppa we and I really need to Explain this joke because in Hebrew when you want to celebrate things you say And I was very very Confused when he told me that because we we have a band called opa. Hey, it was very popular 30 years from now And I looked at him and I said really are you sure? Can you explain to me that this is not the opa that he meant so what is opa? Great. So opa stands for open policy agent and the open policy agent is an open source project in Go that is part of the cloud native foundation and It is actually the underlying infrastructure for conf test and gatekeeper and the tree which I will show you in a moment Opa is a general purpose policy agent open general policy engine And what happens is and the idea of opa is decoupling the policy decision-making from the application logic So what you can actually do is you can have a microservice that receives requests and let's say no I is trying to add a resource to some Area of my website now the microservice that is responsible for it needs to decide Can I authorize Noah to actually perform this action now? One option is that we actually do all of the authorization Locally in the service and think like can she do it can she not do it and so on or we can have an external service Now we can either buy build an authorization service or you can use open policy agent and opa pull open policy agent and allows you to define policies and it can actually run as policy engine and as a service or as a library and you can actually a Contact it via your service and ask can Noah perform this action So you either bring it along as a library to every one of your services And then it locally accesses it or you can actually talk with it like with rest API and so on and it can be a central place to reach it and Once you decide how you work with it you run a query and you say, okay So I have this schema and I have this user trying to perform this action Queries and Jason, right? Yeah, but the language is regal Yeah, so the language that you write all the policies and we can see now in the next slide that It is actually the way to write policies is in regal. What is regal? I've never heard of regal before but it's like a declarative language Yeah, and that the open policy agent is using and it's very very popular now Yeah, and you can actually push opa policies into actually Centralized container management like docker registries, which is really nice and you can see the example here that a It's it's in a way you write it like tests. So I talked about the authorization example, but obviously you can also do Configuration test example So you download opa you see for example, you start it as a service in this case You see opera and minus service and then you give it in this case example that regal, which says that in If a file is of type deployment and it doesn't have labels that are called app Inside of it it will fail because we want every resource to have a label and with the label that we want is app So then we can identify like which ingress and and other stuff to to put so then we can contact the service and Actually, opa will return a result and tell us whether we are okay when one sending the the lower piece of Jason to tell us is it okay, or is it not okay? But if I understand correctly and Opa is not Specifically to Kubernetes. I mean every example here is about Kubernetes resources, but opa is just a policy engine You give it a JSON and you write policies in regal, but it's not It's it's not related to Kubernetes at all at all You could say that every user that have four bananas must have four apples And if you don't have four apples with your four bananas, you can't do you can't eat a chocolate Because you haven't eaten all of your So yeah, definitely you can write any policy would like okay, so that lead us to Conf test because Opa is is real great opa is is fantastic to all, but it also requires a lot of heavy lifting and a lot of configuration Because for a company like Unicorn rental is it's not a tailored solution they need to put a lot of effort to write their policies and to Download the opa and to I don't know how stated the service or something And it's a lot of heavy lifting and a lot of work Yeah, so especially when it comes to Kubernetes policies So it's like let's take opa and do a domain specific solution on top of opa for a specific use case Yeah, yeah so Conf test Conf test is an open source project It allows us to write test to any structured file and when I say any I mean YAML XML JSON docker file pretty much everything And it's specifically designed to to be ran in the CI or as a local testing and It's built on top of opa So basically when you run Conf test you use the engine in In opa and it also means that you need to write the policies in regular because it's under the hood Yeah, makes sense because this is about opa talks and something is that is very very cool at Conf test and most of the people don't know that Conf test and You can push your policies to a docker registry and can't can't test allows you to pull those policies from the So I can have like a docker service like a service it can have the docker container and the policies for it Yeah, docker registries not only for images. Yeah Revolutionary yeah, it's the new OCI standard. It's not that new anymore But like it was new for me when I found out about yeah, I think that you were the one that told me that yeah So how to use Conf test First we'll need to download Conf test, of course Then you'll need to start to write all the policies in rego and here we have the same example as before now Conf test it looks for rules in rego that can either be deny violation or warn and He run the policies so you to run this policy. You need to execute Conf test test with the path of the resource that you want to check for and You need to pull a to put those policies in a folder that Under the name policy and if not, that's okay. You can pass a flag And but this is the default behavior and now when you have our policies We have our resources and we ran the Conf test test. We can see the results Just like just tests or go tests just like no jest just everything It's practically unit testing for Kubernetes resources and not Kubernetes Any structure of the file? Yeah, I'd call it any infrastructures code file in a way. Yeah. Yeah Yeah, because I think that the main difference with the infrastructures code files that they're declarative and they're trying to give you the state The future and existing state of us resource. Yeah, and then Yeah, that's why you're a go makes the most sense as a language. Yeah, I Think so too, even though there are some like you can do some loops and stuff, but it's pretty basic Yeah, I go I prefer other languages But it's nice. It's very nice Yeah, so I think the easiest way to do is to integrate conflict is directly Yeah, and if you want to run Conf test in the CI because I I think in my opinion That's the real power of Conf test and any other like unit testing library for your resources and Because this is my policy as a developer this is an example of all the Recommended labels that communities Recommended itself for the official from the dogs. Yeah. Yeah for a any Kubernetes resource And so we created a policy just for that and here we use the key to action and as you can see here We pulled the policy from a registry from Conf test. This is the path Can you see it? No, I think that's the container and the one after it is putting no, this is that yeah Okay, yeah, yeah, yeah, so it's putting the container and the policy and running the test and then we run Conf test test you and double a strict and That's it. This is practically anything that you need to do to run Conf test feels like, you know Like running like a mocha or something with your files and your tests and yeah, it's pretty much depends on how much policies Do you want to write? Okay, if you yeah Exactly like unit testing if you will write more tests It's a lot of work. Maybe yeah, and But what if I Don't use the CI or the city. What if I just push don't do it directly to the cluster and there are no policies What if I'm a criminal So I know that like specifically, you know mainly operations and DevOps people like they love to live on the runtime Environment because they say okay So you have those text files that in the end of the day become binaries and then some day they come to my kingdom to the cluster so At the end of the day, it's really good to run shift left and to tell developers before You know the mistakes happen so they can fix them before they get into production but In many cases, you also want like an active protection to run on the cluster and for that we have solutions like gatekeeper So gatekeeper is actually also based on opa and it is also under the open policy agent Gita repository And what the gatekeeper does is it is a very like let's say opa is a very general purpose solution for policies Conf test is a more specific solution for infrastructure scot testing Yeah, and gatekeeper is really really narrowed down to like policies on Kubernetes it was a project that combined CNCF and opa and Kubernetes and yeah, it was a big big project gatekeeper It's a really nice project. I love it and and the idea from the Kubernetes side It connects into an admission controllers of Kubernetes so an admission controller is like hooks in operating systems in Windows or In the Linux or you know Mac and still you know Linux foundation Yeah, or let's say you can hook on a resource and tell the operating system Hey, if someone is you doing this system call Please call me and I'll inspect it or tell you what to do with it So it's exactly the same thing when someone applies a police applies a code like applies instructions I want to say to the cluster that's like cube cattle apply Or the cluster pulls it in a github's way from the repository it then Activates an admission controller an admission controller has a several options of of how to run and a We will show it in a moment that you submit to gatekeeper code and And now we're showing that you can submit code that tells it how to what to test and in this case I'm not going to go over all the YAML inside of it Of course, you can check it by yourself But the idea is you give it a policy in this case We gave it the same policy to check that there are labels and the difference is that it is being checked on top of the cluster In the moment that someone applies a resource change Request to the cluster itself and then gatekeeper using a Kubernetes admission controller webhooks can allow or deny the change So basically you can take Almost same policies that you have in Conf test And you can see put them in you can see it here. This is the this is the regal. Yeah, it's the same Exactly, and then and it will also run all the tests that you want not only in the CI CD pipeline But also inside of the container And you give it a You build that restriction Policy and then you actually apply a day you have like a kind who need is required labels And you apply a parameters or like where these policies should run on which types of resources and then the test runs when someone is trying to To go and apply a specific change so it's basically just to just to make sure I Write a bunch of policies. I put them in the cluster So every time someone will do keep keep CDL apply and it will go over all the policies like like here This is the constraint template. I'm sorry this is the My own animations Are not behave. Okay. So this is the policy I just put all the policies in the cluster and then I write for every policy how I want I would want to use it. Yeah, like on a namespace like on a plot and a pod and But it's it's it's like replicas of the policies that I might have if I would use confidence. Yeah, like I would have Two of them you could be maybe maybe the same one maybe not maybe not because it's still it's production. I don't know But I would have Two times of the policies one in the cluster and one in the context if I would like to change one I wouldn't to change it in the cluster and in confidence, right? Yes Yes, that is correct. Okay. Good. And the neat thing which is a bit deeper into Into communities and you should check out that mission controllers There is a mutation mutating webhook and there is a applying webhook where like you can actually even also change And resources that are applied to imagine you could write I'm not sure if it's supported currently in gatekeeper, but you could say every research should have a Label or should have like a memory limit. Yeah, and then you can write a mutating webhook that says we change it Yeah, that would automatically apply. I don't know a 10 gigabyte memory limit to this resource Yeah, so it's also a neat way of using it. Yeah, and you can also check it and in compared to your Other resources like I would die. I would like to change to make sure that I have Unique ingress hosts. Oh, yeah. Yeah, and then I will need to be aware. Yeah. Yeah And also gatekeeper have a lot of features They have dry run and audit if you want just to validate that your policies are correct. And I really encourage you to check on that But yeah, I think I think I understand my cool. I agree So I think the final bit that you might ask yourself is okay So I understand why I need to run policy checks and I you convinced me and you love policies And you showed me the tools that I can use in order to build it Conf test and gatekeeper And opa in general, but then you ask yourself, but wait Am I supposed to know all the post mortems that happened? Like how am I supposed to know which policies should they write? What are the best practices? Yeah, and then how do I make sure that all of the services are actually pulling the policies? Dynamically, yeah, you know to updating them. Maybe I have 400 services How am I supposed to update all the policies for all of them? See which ones pass which ones fail? How can I dynamically apply? Which types of policies should run on which types of resources and different environment different environments? That's right. So for that you have the tree. So if you would like to check it out. We're an open source solution And it actually basically does everything we just told you only it comes pre-defined with policies out of the box we read more than 100 post mortems and worked with dozens of companies on Unfortunate events that happened to them and we took all of this wisdom and we codified it and we put it inside of our solution and it is open source written in go and You can run those checks against your Kubernetes manifests and helm charts and verify that The pre-built policies are passing or go and create your own custom policies Yeah to Sum up I I really want to I really want you to remember that Sooner or later communities will become I would like I I always like to say your production temple but It's not only about adapting communities because when you adapt communities you adapt new culture to your organization And is it as a developer? I feel it Way more than the DevOps engineer in your engineers in your organization And you will spend time and resources on your communities whether you would like it or not Yeah, but Kubernetes is not and DevOps especially DevOps is not something that happens overnight It's a it's a process it takes time and you really need to think about How do you want to manage it and especially when it comes to a scale of Terms of scale it's something that is really really important and One of the most important things that I sharing and and the first of all I think that if you have any questions feel free to ask us and And we're hiring so come join our team, you know the regular plug. We're sorry, but they were hiring But I think that you know I saw the data dog study recently that more than 50% of all container environments run on Kubernetes And it's going up like this. Wow like this. So and It's it's not a question It's a fact. Yeah, and I'll tell you it's a fact and Kubernetes As I see it comes, you know systems come whether they're adapted to be Simply is simple or flexible and Kubernetes is adopted to be flexible Which means that you can make a lot of mistakes and it's not that simple to manage it So I think that the same way as today every service has a CI CD solution Yeah, and and it's like of course you you're gonna build it and test it in CI CD and then ship it I think that the next natural stage is that you have a policy engine inspect the changes before applying them to production Because it has to be automated because there's just no way for me as a developer to remember To put all those things in place and to not make a mistake. I'm also a person. I'm also a human being I make mistakes and I need some automated tool to actually go and check it for me So make your lives easier make your developers lives easier make your DevOps people lives easier by automating it Yes, make the world a better place Okay, thank you very much. No, it was great. Thank you. She won. We look forward to hearing from you and go check out Check us out at the tree that I owe. Thank you. Bye. Bye. Bye