 Hello, everyone. I'm Stefan and I'm here with Brandon. Hello, Brandon. We are going to have a nice chat about Kubernetes in 2023 and see where Kubernetes is and what its future and how you can take the best out of it and, you know, build your own platform. Yep. Pretty excited to talk about it in the ninth year of Kubernetes. It's amazing to be thinking about that. So where are we at in 2023? You know, I think that the thing is that Kubernetes is now mainstream. It's not even, you know, in the mid-adopter phase. It's in the late adopter phase. And it's really become the industry standard for running cloud-native applications anywhere. So, you know, Stefan, what are you seeing and the reasons why people are choosing Kubernetes? Well, the reason I'm choosing Kubernetes, besides anything else, it's community. I'm really amazed at what the community has built around Kubernetes. Its whole ecosystem is really great. I feel very proud to be part of the Kubernetes ecosystem. But yeah, if you want to choose Kubernetes, you have so many reasons. And I think one of the major reasons today is the fact that the managed services or the cloud vendors out there are offering a mature Kubernetes as a service. And I think that's, it's quite critical for Kubernetes adoption, given the fact that, you know, Kubernetes is complex and it's really hard to set it up. Well, if you can do it with a click of a button, then, you know, you can really focus on what's really important is, you know, create this continuous delivery ideal pipeline. So, all your apps get deployed instantly and all your customers are happy. Yeah, for sure. I mean, I think that the interesting thing is, I think at this point, we might not even say customers are choosing Kubernetes. I think you'd say even that they're just assuming Kubernetes. And I think you're seeing this all over in all sorts of, whether it's in AI or any other databases, a lot of people who are building ISVs or building software, they're just assuming that you have managed Kubernetes available to you. And I think there's a huge lift as well in terms of people can kind of forget about how do you manage machines? How do you deploy software? They can rely on this infrastructure and then innovate on top of that. And that's another big part of the ecosystem, I think, is that innovation. And speaking of that sort of stuff, I think what the truth is when we look at our developers out there, what we're really seeing is that Kubernetes is a platform for building platforms. And so it's the base, it's the foundation, but it's not where a lot of people stop because, you know, raw Kubernetes by itself, the basic objects that are in there are a little bit like machine code. And they're often too complicated for application development, or in some cases, it's just because the Kubernetes community proactively decided, you know, that's not a problem we're going to solve, we're going to defer to the ecosystem out there to solve it. So, Sivan, what are you seeing in terms of like ecosystem projects that complement Kubernetes? Yeah, what about vendor locking? I mean, when Kubernetes started, right, the main idea is it will get rid of vendor locking and you'll be free of it and it can move from any cloud to any cloud from any data center to any data center. Did we end up with this promise or it's more nuanced? Yeah, I think it's way more nuanced. I mean, I think that, you know, I think we see this out here with various enterprises who've said like, we're going to be building an application platform no matter where we go and our developers will just see the application platform and it won't matter, you know, which public cloud they're going to. And I think it's a great vision, but it takes a ton of discipline, right? And I think that what we're seeing interestingly enough is that really it's the value of consistent tooling and not having to educate your developers in different ways of deploying software into different places that is the real win. It's not sort of, I mean, I sort of draw a comparison to the Java promise of like right once run anywhere, which was really more like right once debug everywhere. And you know, so I say, I don't think, you know, we want to sell people on the vision of like, well, there's perfect portability. But I do think that there's skills transfer, meaning you can have one set of CI CD, you can have one set of, you know, developer tooling code review, all of that kind of stuff and have it that skill set be portable. And I think also getting back to that ecosystem thing, there's a real opportunity to hire people with the skills immediately valuable to your application platform, even if, you know, they come from a different company or even, you know, a different academic environment. And I think that consistent app deployment is something that is a real value proposition. Yeah, me as a flux maintainer, I'm interacting all the time with Kubernetes from an API perspective. So I'm from my point of view, the real strength of Kubernetes is that no matter where it runs, if it's Azure, AWS, whatever, the API is the same. And I have the same, you know, guarantees. If I want to look at the application, is it healthy? There is a deployment object, it has a status conditions and so on. So I have that consistency of at least saying, is it running or not? Or do I want to do something with it? And I don't have to change the, let's say the operations to perform a rollout, even if, you know, maybe I have to change an ingress controller, or I have to change the load balancer implementation every time I'm moving from one cloud to another. But there are so many other things around the application lifecycle, which is very consistent, even if, you know, I'm moving from one platform to another. And I think if something will live on from Kubernetes many years to go, it's the idea behind its API and how these things are mixed together, like how a service relates to a deployment, how a deployment relates to a pod and so on. These kinds of concepts are, in my opinion, are way better than thinking around machines and processes. So that's a great stay forward, in my opinion. Yeah, for sure. I mean, I think we always thought of it as being developer oriented APIs instead of infrastructure oriented APIs. And I think also that ability that you talked about for the ecosystem to build a single set of tools on top is a huge win too. We don't want to have, we wouldn't have a great ecosystem if Flux didn't work with Kubernetes everywhere. If you had to be like, this is the Azure Flux thing and this is the AWS Flux, it just wouldn't work. So that consistent API, even if apps aren't perfectly portable, that consistent API enables these shared projects, these projects that the whole world can come and rally around. And I think what's really great, actually, is also in doing that, we've also gone along and we've enabled people to forget about problems. So actually, I don't know stuff on when you started in the industry, but when I started in the industry, figuring out how you shipped your code out to a server in the data center, that was a problem you had to solve. Am I going to use a tarball? Am I going to use an MSI? Am I going to use, how am I going to install? There were companies actually that were about how you installed software. And I think now we've gotten to a place where with the container registry API, with a Docker file to declaratively describe your infrastructure, it's just a consistent way of doing that that everybody knows and that we can rally tools around. So you can do image scanning on a specific image format. You can do all kinds of different stuff in a consistent way. And similarly, we've taken the idea of how do you do a deployment, right? And we've turned it into code that everybody can use as opposed to maybe a checklist that somebody wrote down or just tribal knowledge that people were like, okay, this is the right way to do a zero downtime deployment. I think that's really huge in terms of making everybody's lives better and easier is enabling most people to not think about it. And then also enabling us as a community to come together and produce a single best practices implementation as opposed to hacky scripts left all over the place. But we've promised a lot on this slide. So do you think it actually can be this simple? Are we overselling this? Or how does the complexity come in? Yeah, I think the complexity comes in when you're not a single developer. You don't have a single G3po and it's just one app. More people have to collaborate on it. And in, you know, it's not just a Kubernetes deployment. You need, you know, policies around it. You need horizontal pod auto scalars. You need an ingress. You need all these things. And that idea that, okay, it's Kubernetes is declarative. You can get away with a Docker file, a deployment.yaml, and a GitHub workflow in your repo. And you can actually, you know, make this all work with Kubernetes as a service, let's say AKS. And GitHub actions, it's a great combination to deploy a simple app. But when you get to many teams, many applications, maybe the application is, you know, made out of many microservices, so we have so many repos and so on. It's quite hard to maintain that simple workflow that you've started with. So one solution that me and the other Flux maintainers, we are working on for, I don't know, five years now, almost six, is the idea of how you can, you know, take advantage of the declarative nature of Kubernetes and allow a more streamlined, continuous delivery pipeline, something that works at scale. So you can, for example, you start with two cluster staging and production, right? Then you realize, oh, I cannot have a single production cluster because now my clients, for example, are not only in Europe, they are also in Asia, or yes, so I have to, you know, get the application closer to them. So in the original, in the legacy pipeline, let's call it right, when you do everything from a single CI job, every time you add a new cluster, you have to, you know, put the secret there, make the CI job, be aware of that cluster. There is a lot of work. And it's quite dangerous because when you keep all things in a single place, we've seen so many issues starting a couple of years ago with attacks on, especially CI instances. So in order to make this story better, what our idea was to make the clusters look at the desired state, then having the CI process go to the cluster and tell the cluster with the KubeCuttleApply or KubeCuttleDelete or whatever, help me install or whatever command you're running, instead of doing that imperative thing where you connect to the cluster and tell the cluster, hey, now deploy my app. The cluster should look somewhere, it can be a good report, more of those, and see if there is a change in the desired state. Oh, the container image has changed. So I have to roll out a new version of the application. You don't have to connect to the cluster and tell, hey, now run this new version, the cluster itself, through controller extensions and flux is just an extension of the Kubernetes API, and there are other solutions out there that they are doing the same thing, Argo CD as well, in CNCF. So the idea is basically you add a new cluster and that cluster knows from start that it needs to reach a certain desired state. It needs to install add-ons, it needs to deploy applications and so on, without you having and telling him exactly what it needs to do. I think that is not a solution that all of the sudden will make you scale infinitely, but it's a good step forward. Yeah, for sure. And I think one of the interesting things that we've also seen is how you see the GitOps story. Yeah, I think one of the other interesting things that we've seen is that it enables you to really have a cluster that's locked down from a security perspective, because if you have an agent that's inside the cluster reaching out to get to find out its new desired state and then adjusting the APIs inside the cluster, the number of times you need external users to then go in and reach into that cluster and make changes is actually quite small. It's just maybe if there's a live site or a problem or you need somebody to come in and debug a problem. And so that means that for example, when you use an AKS cluster with Active Directory, we set it up. Our recommended production setup would be that you have a group that has no members in it that has access to the production cluster. So by default, nobody has access to the cluster most of the time. And then only in a just-in-time way, you add users into that group just for a period of time, a time bound, like eight hours, and actually AAD will rip them out, automatically remove them after that time bound and GitOps makes that possible, because if you are using traditional CICD, as Stefan said, you have to keep that credential in a different place. And then you also have that robot or that agent wherever it is that's continuously making API calls into your cluster. So you need to put it usually on the public internet and you need to have that inbound access available to that robot. And so I think GitOps not only makes it easier to scale and add new clusters, but it actually also enables you to have a more secure footprint by default inside your clusters. So I think it's a cool benefit. And of course also, I would say the other thing we see, especially out to the edge, we have three clusters here and maybe that's typical for an internet service, but we see retailers who have thousands of Kubernetes clusters in retail environments, and there's just no way they're going to manage that from a CICD, both because of the scale, but also because of the intermittent connectivity. The reversing of the connection from being a push into the cluster to being a pull down from the cluster means that if that cluster has intermittent connectivity, it still works. The flux agent will try for a while and then it'll fail and then eventually you'll get connectivity. It'll talk to the Git repo and it'll adjust things as opposed to your entire pipeline crashing to a halt because you couldn't do a cube control apply onto a particular cluster at a particular time. So I think there's a lot of benefits from that reversing of the flow. Yeah, one more thing I wanted to say here is around drift correction. It's like when you adopt GitOps, you should in a way be comfortable with the idea that you shouldn't be going into a cluster and doing edits all of the sudden, right? And what flux tries really hard to do is undo all the edits when it discovers it, report it, hey, something has changed the state outside of my knowledge and I will try to revert it to the last desired state. And in this way, you have something to fight with during incidents, right? So that's why we have this command on how you can pause it for a particular time so you can get in there and debug and resolve the things. But I think what we are trying to encourage users is if you want to change something, do it declaratively and your whole team should be aware of the change, right? Because it has a collaboration aspect to it, you should open up a request, someone from your team should approve some change and only then that change will actually be rolled out across the cluster fleet. And it's a hard change in terms of how you... I'll also say that I think there's, I mean, I've seen outages caused by people going in fixing stuff manually and then forgetting to move the fix into version control and then the next rollout comes along and it overwrites whatever the fix was. And so you actually have the same outage twice effectively. And so I totally think that getting people into that mindset of everything has to be checked in even live site fixes. I mean, every once in a while there's an emergency, but more often than not, you want to go through the standard rollout procedure even for a live site fix. So we talk a lot about the GitOps, which is kind of a general way of deploying and configuring your cluster. But what are the, what do you think the ideal cloud native platform, what are the pieces that people should be thinking about when they're building that cloud native platform? Yeah, so I used to think GitOps is the platform, right? Until I've talked to many, many flux users, which are telling me, hey, not everybody's cloud native savvy in our teams. Not everybody is technical. Not everybody can even, you know, get into Git and do a commit to open up a request and so on. There are so many people which are part of the delivery process, which are not technical. So I think the ideal cloud native platform should allow all people that are part of the delivery and maintenance and, you know, what it really means to ship a product to clients, all these people should be able to collaborate on this platform. So this platform has to, in a way, abstract things out and make them obvious. So yeah, I think it is for everybody. I think it should be the driving force behind building a platform. Of course, standardization. If you have Node.js, you have 10 Node.js app and every single app has a different Docker file, a different help chart, a different way of, you know, writing logs or stuff like that, it's not going to scale. Because then you can't just create a runbook. How I fix Node.js apps. Every single app is a little bit different, right? So standardization, it's great around that. And it's not only about tooling, how you build the app and so on. It's also about practices. I think the standardization should be starting with the practices and then going back to the tooling and decide which tooling should be based on what our practices and get into that. I think it's always a challenge, too, because people are like, well, but this is my favorite tool over here. And I think that sometimes you just have to tell people, like, the standardization is worth more than you getting to use your favorite tool. And that's a hard lesson, I think, for engineers to learn at some level. But at scale in large companies, there's just so much value that you get out of that consistency. And even for people who know what they're doing, and I totally agree with you, a large number of people, they just want to get their job done. They just want to write some code, check it in, magic happens, it's deployed. And that's goodness. But even for the people who want to be down in the cloud native space, you have to at some point be like, you don't get to choose which version of Node you run. There's one version of Node.js for our company. And this is, I think, how the platforms are how you get there. Yeah, it's also about flexibility. At some point, you have to allow something to fine tune something around their application. But you can't do that at the detriment of security and policy. You can't allow people to say, oh, I'm disabling the policy and now my container run is root because I don't have time to look into the decision. There are different configurations and different levels of what you should be able to change. And I think the platform should be flexible enough, but also have a way of enforcing policies. So even if you change, you click all the buttons and you made your deployment very unsecure, something should tell you, block you later on. There are great tools in the ecosystem that can do that. Even before your application gets on Kubernetes, you can do it as a static analysis in your pull request that just scans your deployments and say, hey, you've set this deployment to run as root. That's not okay. So yeah, policies is an important aspect of any kind of platform. Even if you buy a platform or you build it, you should be looking into how you can enable policy and enforce policies. And it's not only about security. I think policies should also address resilience. We're seeing a great use of policy for best practices. I think policy kind of got maybe not a bad rap, but that got associated with compliance and regulation and all this kind of stuff. But the truth is that even just having a policy that says you need to have resource and limits set on your pods, right? That has nothing to do with compliance, but it has a lot to do with keeping your app stable, right? And I think that is an area that we're seeing more and more use of in these defining of these platforms. And again, I think it's not just about it's not about trying to force developers. It's actually enabling them to think about less, right? They don't necessarily have to remember all this stuff. They don't have to have a checklist in their head of, I need to do all these things. Policy sort of helps. I say it's like the guardrails on a mountain road. You get to drive faster because there's a guardrail, not because you're going to crash into it, but just because you have that extra safety factor if you do. Yeah, I think setting good defaults is one of the most important things when you do the standardization, right? Let's say what most people do. They create this a hand chart, right? And in the hand chart there are no defaults. You have to set your security context. You have to set resource and limits. You have to set all these things. Why not make, you know, set some good defaults there? So people don't have to think about it. If their app doesn't run correctly, they will notice in the test environment and then they get to fine tune these things, but you should always set good defaults. And I've seen there is a trend now with policy where you can actually set these defaults at the admission control of Kubernetes. So Kubernetes has this thing where you create a deployment. The deployment has no resource. The deployment has no limits, but you can actually inject them based on, I don't know, some conditions, some labeling based on what namespace they end up. So you can set all these good practices and defaults even if it's not in the standardization of how you package the app. You can also do it at admission inside the Kubernetes cluster, which is very powerful. Yeah, I think it's sort of like the equivalent of, I always said, it's like the equivalent of when something like VS Code added formatting on save. Like I don't even think about formatting my code. I mean, I write terrible code now, right? Like I write really ugly code and then I hit save and it's like beautiful, right? I just don't even think about it. And I think that's a real benefit too, is like, you know, it makes me faster if I don't have to think about it. And I know that some other automation, even if I'm the one who configured the automation, is going to come along and do the right things for me. All right, so what do we think about when we're designing this platform? You know, we kind of have the components of it. What are we thinking about when we're designing it? So I think the first step is to acknowledge that even if the platform is built everywhere, right? Before you started, it's already there. Each team has their own scripts for deployment. Maybe they, someone builds Hangar, someone builds, I don't know, they have good Docker files, whatever, right? All these pieces everywhere are the platform. So someone has to own the platform project. And in order to unify that, you actually have to create a team. Name it the platform team, but someone has to own, especially at the beginning, and drive, be the driving force to consolidation and all of that. The second step is, of course, identifying all the parties involved in the delivery pipeline. That's very important. You don't have to take into account the only technical people, everybody. You should know the whole aspect of that. And I think after you describe your process, how are we going to do this delivery? Then you can go into Kubernetes, pick up the components that you need, define infrastructure, blueprints and all of that. But the human aspect, the first one is very, very important. I think even of one of the things that we've built into our deployment platform is a big red button, which basically is like, I want to stop all deployments. It's Black Friday. We're going to do a conference. It's whatever it is. We all know that deploying code is the easiest way to cause an outage. That is the truth. And sometimes, you just want to stop every team from being able to deploy code. And you don't want to have to go to every team and be like, could you please stop? And here's the window. And it's work for them. It's painful for everybody. And so even just having that ability in the platform, it's a totally human thing. But to be able to hit that button and just say, we're stopping, we're not doing anything. We're not deploying any code for two weeks or whatever. Well, hopefully it's not two weeks. But whatever that time period is. It's a human factors thing, but it's a huge value. And then the other side, I'd say, and I don't know if you've seen this on the human side, is the people who are out there selling to our customers, they want to know when a particular feature is going to get to a particular region where a particular customer is. And so if you can give them that dashboard in the platform that says, like, here is where you're at in the release process. They don't have to ask us. They do a better job talking to the customer. So I think there's a lot of that human factor stuff that goes into designing that platform. That's part of the observability part. And observability is not only about telemetry. It should also be about what software are running in your platform. You should have this unified software bill of materials, not just I'm running these apps. Well, those apps are definitely importing hundreds and hundreds of packages from upstream, from open source projects. You should be able to look somewhere and see all these things. And of course, you should also have observability into the deployment of pipeline in CO. Okay, this region is this version, this other region is this version. And this part of the UI aspect, right? I wrote in the other slide that you should have self-service APIs. Well, it's not only about APIs. APIs are great and are a requirement for building a UI. But at some point, even if you say you don't need a UI, you definitely need something maybe read only. But that gives you that overlook, right? So you don't have to call all these APIs on your own and so on. For sure. Yeah, I mean, or even just visualizing, right? Like one of the things we've done for our DRRIs is just give them a timeline visualization of when did each deployment happen that they can then effectively overlay with a metrics. They're like, oh, I see there's a spike in errors here. And I see that there's a deployment of this microservice here. I wouldn't necessarily have thought they were correlated, but because I can visually see that they're correlated, I'm going to go investigate that thing, right? And I think you're absolutely right. I'm all about declarative or deployment, but there's a reason we use monitors and pictures and not just command line tools for all the stuff that we do, right? Like there's value in the visualization. Yeah, yeah. One thing we did in the flux project, we integrated with the graph annotations and every time flux does pull and an apply, we annotate all the graphs in Grafana. So when you look at your business metrics or whatever, you see there, oh, this tag is now deployed. So you see, oh, this is version 2.0. Maybe that's why the CPU has piped in the memories of the roof. Yeah, yeah, yeah. For sure. That marker is so important. Yeah, for sure. And I think that the other thing that's on the bottom here is just another really, really valuable piece of advice for anybody out there who's starting to sort of get their feet wet in production in producing one of these platforms, which is, I think it's a real tendency. We're all in technology. We love technology. We like the cool stickers and all that kind of stuff. And so there's a tendency to like throw the kitchen sink at your platform and kind of make sure that you, for that CNCF landscape that has, I don't know how many logos on it now, you know, you've sort of checked off each one. Like it's like a merit badge or something. But don't do that, right? Like be very clear every component that you add into your platform, be very clear about what the benefit is and what the cost is in terms of, can you secure it? Can you update it if there's a CVE? Do you, can you operate it if it doesn't work? You know, I think we have to shift from thinking that each component we add to our platform is a new cool thing to thinking each component we add to our platform is a potential liability. I think we had, I went through that same mental model when we talked about software dependency earlier, right? I went from a world of five years ago being like, cool NPM, whatever, right? To now being like, okay, what bad actors am I inviting into my code by installing this dependency? Okay. And I think, you know, I don't think there's bad actors necessarily in the components of your platform, but there are, there is a operations liability and a security liability that comes with HP. So start simple. Know that it's going to evolve anyway. Platform is going to change. Everything changes. And so anticipate that you can add things as you need them instead of having to have them in there at the start. So like when we think about that, then what do we think about in terms of like delivering the application? So we've got our platform. Now we're going to deploy code into it. Yeah. So before we get to piece together our platform, we, I think we need to put on paper the delivery process of each application or each group of applications. It really depends on the scale that you are running, but things are very clear. Not all apps are the same. Some apps are critical for your business needs, for your, I don't know, whatever the product offers. And some are not in the critical path. Right. And how I think that decisions should be made is by defining some service level objectives. Right. And you can group apps based on that and create separate delivery processes. Maybe for something that's highly critical, you want to go through depth staging test. You want to keep it running for a very long time, just with 10% of your users using the new version and so on. But for other components, which are not critical, maybe you don't want to wait five days to deliver a patch. So it's quite important to group them based on, I'm suggesting here, service level objectives, but there are so many other ways of how you can structure them. But revenue, I see a lot of people, revenue is the big thing, right, where it's like, this is my logistics app. And if it goes down, I can't sell anything. That's my, you know, like if that other thing goes down, well, like maybe people get laptops a day later for internal users. And like, there's a big difference between, you know, the the business criticality, I guess, of what the particular apps are. And another important part in the delivery process are humans. Like, we usually think of delivery process as a totally automated process. It runs on its own, it's continuous, humans are not there. But as you said, that red button has to stop everything, I don't know, don't deploy Fridays or whatever. So every step in the process, you should consider, should I make this step optional? Can a human intervene here? Can someone do a rollback, even if all the metrics are okay, even if the application is green, but maybe there is a business decision to delay some future or whatever. So your delivery process should take into consideration human intervention at every step. I'm not saying it like at gates, human gates everywhere, but allow that possibility if you can. Yeah, for sure. For sure. I think there's always that case where either your testing doesn't catch it, but you see that there's a problem or, you know, I would say like, you never know when somebody, you know, somebody's going to be out demoing your thing and you just don't want to do a rollout while, you know, while somebody's doing a demo. So, well, so before we even think about like delivering code out to production, you know, what are the pieces that people are thinking about as they go from, you know, the thing on the code on the disk to the container image that's ready to push? Yeah, before we even get to Kubernetes, we have to build our apps and package them in a container image. And that process can be very simple. You throw their docker file, you do a docker build. But given the fact that, you know, your app is made out of so many external components, you should think very careful about this initial step, which is very critical to the security of your whole system, right? It really depends what kind of packages are installing on the base operating system in the container. So maybe you should look into creating, attaching a software build of materials to your container image. And recently, for example, docker, docker, buildkit has such a future now. It's not perfect. It will not create a perfect S1, but it's a start. And if you don't have, if you have nothing today, you should enable that, right? And another important part is provenance. When you look at a container image, you should be able to tell when was this built on which machine and what tools were involved in creating this image, right? And this helps you discover, you know, CVs that there was no public CV when you build an image, but maybe two days later, there is a major CV there and you should be aware that through the provenance file, oh, I used this software to build this image and that software was compromised, right? Maybe I should revert it, get it out of the production system. Yeah, I think it's even stuff you should shift left. Like one of the things we've, you know, we've done is shifting left to things like from tags in your docker file, right? I think people are like, I got a private registry. I built my image. I pushed it to my private registry. I'm good to go. And you're like, well, that's great. But what was the base image that you built from, right? And if that base image is like random user at docker hub, should probably be thinking very carefully about whether you want to support crypto mining, basically. You know, and I think that's, that's the other piece of it is that because we're going to these declarative formats, not just for our applications, but for defining how the application is packaged together, we can start doing a bunch of the shift left stuff that we would typically have thought of as being associated with code to our application package as well. I can now say, look, you have to have a from image. Like I've got these six blessed from images. You can start there and that's it, right? And I'm not going to let your docker file even check in if you're, if you're coming from a different from image. And I can even go through and look for, you know, I can assemble. It's great. The docker has the SWAM stuff in it, but I can look for things like apt. And I can, you know, see that you're installing, you're running a script to install apt or you are, you know, running NPM and I can actually extract the packages that you're installing from that docker file. So I think even before we start building the image as a bunch that we can do to kind of keep track and keep a handle on exactly what's going in. Because I do think that there's this, like, somewhere along in the container cloud native world, we forgot that, like, downloading random binaries from the internet is a bad idea. And somehow docker poll, like, made us forget that. So, but I think we're getting it back. Like, we're starting to realize, like, oh, yeah, this is actually something we need to pay attention to. Yeah, I've seen in the last two years a lot of, you know, going back to the basic building images and there are many projects out there and organizations who are trying to, you know, make this process as safe as possible. And I think the other thing is we've seen as well is, like, going towards things like reduced package images, right? You start from, you know, let's say in Ubuntu or anything else, the standard traditional Linux, there's a lot of packages in that image that you might not be using, but that might trigger CVEs. You might be, but might trigger CVEs, right, in your scanning. And the more noise what we found is the more noise in your image scanning, where someone's, like, oh, yeah, that's a vulnerability, but we don't ever use that tool. It just makes figuring out if you're actually secure or harder, right? So reducing the number of packages going to, like, a distrilous image or other kinds of, like, reduced, I mean, if you're in Go, maybe even just a pure scratch image, static binary language, helps you make sure that, like, when there is a CVE scan, the CVEs are really about your application and not about, you know, like, image, some image magic binary that's sitting, happens to be sitting on your, you know, your image, but it's never actually used by your application. Yeah, I mean, it also depends on which programming language you are using. If it's Go, Rust, or things that can be built statically are way more suitable to use from scratch or from Alpine with no package, no nothing, not even the shell install. But for others, for interpret languages, you have to rely on OS packages, you have to have OpenSSL installed there, or libss agent and all of that. But instead of installing the whole dev suite and getting every single tool, only install the certs and only OpenSSL the client, you don't need the whole suite, right? So it's a lot of stuff unneeded there that just is there to be exploited in a way. Yeah, that's a great bridge to thinking about security in general, right? Because I think when people are thinking about images, there's been a lot of focus lately in the cloud native space on image signing. But I think there's just so much more to the broader problem. Because like, the truth is that if you sign an image that has bad software in it, and your signature is not doing much, right? And so one of the things I've been really excited about lately, we put on a lot of our projects is dependabot, which will come along and do a pull request to update your dependencies in GitHub. But I think in general, getting things in that PR, it kind of comes back to the GitOps thing. It's like the more you can get stuff to just be, I pressed a button, it merged a pull request, and then magic happened and it deployed like the faster you're going to get to people being patched and secure, right? Like, I don't know, for whatever reason, pressing a button to merge a pull request is just way lower barrier to entry than somebody editing the same equivalent files, right? And yeah. Yeah, we adopted the dependabot and code QL in Flux, and it was, especially code QL was really, really a real extension. Code QL is super cool. I was a little bit skeptical when I turned it on, and it immediately found a security bug in my code. And I was like, okay, we're going to turn this on everywhere. Yeah, it's amazing. And it's free for any public request. It's all free. And it is actually also community, like a lot of the rules in there are community sourced, which is kind of cool as well. I think again, it comes back to the value of having people come together in the ecosystems to produce kind of a unified set set of best practices. So I think, you know, the next thing I was really thinking about was, like, what are the pieces of Kubernetes that we should be thinking about in this platform? Like, what do we, what does Kubernetes offer? What do we need to add? How do we figure out in this, you know, I joke about the CNCF landscape being an eye chart. But there's, like, I don't know how many projects in that slide now. So how do we figure out, like, what are the right ones to do? It's a tough process, right? So you define your delivery processes. And from there, you can extract the futures that you want. We serve applications to end users. And that has to go over the internet, right? So I have to expose my application on the internet. So from that requirement comes, materializes itself into I need an ingress controller on Kubernetes, right? So first you have to identify the futures. But for every single future you have in the CNCF landscape, a gazillion options, right? There are so many ingress controllers out there, so many CNIs, everything, service meshes and so on, right? There is no way around it. If you are building your own platform, you have to put effort into it and you have to understand these tools, do your own benchmark, make your own decisions. That's one way. Another way is trust someone else with that recommendation. I don't know if you are an Azure customer, I bet Azure has opinions. And I don't know, it supports some of the CNIs. It can offer support for some of them. And in a way it makes choosing easier because you will be choosing something that you can get support on and you don't have to maintain it on your own. But not all things are supported, right? There are so many components out there or future sets of Kubernetes that you want and no cloud will support it. I don't know, weird things around the networking and all that stuff. So at some point you have to put something there and you have to understand what you are installing in Kubernetes. Like Kubernetes add-ons, we call it CRD controllers or operators. These things are extending Kubernetes and you have to think that once you have deployed such a controller, you have to take care of its lifecycle the same way you do with Kubernetes upgrades and so on. Being a controller, being something at the heart of Kubernetes is very tied to Kubernetes versioning, how Kubernetes functions and so on. The more components you add to your cluster, you have more features, but the greater the maintenance burden is. So think very carefully before you install CNI service mesh, ingress, load balancer, all of that just to expose a simple app on the outside. You may not need advanced CNI or service meshes for simple things. Yeah, for sure. I've always said the best thing you can choose is the thing that someone else will run for you. Again, sometimes we as engineers don't always love making decisions that way. We want to use the thing that is the coolest or the thing that we find the most intriguing or is written in a language that we like, but the truth is that the ability to call someone up and say it's broken, please fix it is just worth way more than anything else. So when we're thinking about all these different Kubernetes, I think we sort of talk a lot about the context of a single cluster, but the truth is that actually there's a lot of clusters. We talked about staging and production earlier, but how should we be thinking? What do you think about when we're thinking about the various environments that we might be deploying to? So I think one of the most important things of your platform is the fact that each should abstract away and all the infrastructure complexity of setting up an environment. You should be able to say, I want a new test environment and you shouldn't have to install all the things, create a cluster from scratch and so on. So I think it's important for a platform to offer environments as a service. I don't know if it's feasible to do that for production environments, but you should definitely be able to onboard a new team member in your dev team. They should be able to get their own test cluster or dev cluster or whatever without putting them through all the burden of creating one from scratch. At the extreme, I mean you said maybe it's not possible for production, but at the extreme we actually have customers who do blue-green Kubernetes. They do blue-green deployment, but they turn up an entirely new cluster, install all the software, test it, and then tear down last week's cluster and they do that on a weekly cadence. And that's I think the power of the cloud at some level is that you can do that with an API. And the great thing about that is you're practicing your failover, you're practicing your disaster recovery every single week. Yeah, and I think here GitOps helps a lot in creating almost identical clusters because basically it's a director in a Git repo where you have there all the Kubernetes add-ons, everything is in there. So if you make small changes with, I don't know, customize overlay or with some templating or whatever, you can change the load balancer name or you can change the DNS name and so on. And you can easily create an almost identical cluster next to the one you have. And yeah, that's a great first step to achieving blue-green deployment in production. But I think from a platform perspective, you should start with being able to have test environments or dev environments fast and work from there towards being able to replace production in a week. Yeah, and I think that brings up another discussion which is related, which is then how do you perform the rollout across these various environments? Whether it's into multiple dev environments or into staging environments, you need to upgrade your monitoring or whatever it happens to be. Clearly, lasting changes to the entire world all at once is a bad idea. We've all been there and caused some outages, but we've learned our lessons. So what is the GitOps way of doing stage rollout? There are many ways of skinning again. Yeah, so the GitOps way because everything is declarative, you basically have to have some automation that moves changes from one declaration to another. So in a really simplistic way of doing it, you have two deployment files. One is for your staging environment. One is for production environment. You change only the staging one. Then, you know, Flux or whatever is there picks it up, deploys it, runs the tests for it and only when tests are okay, then it can open a pull request and say, I'm moving this change now to this next file and I'm bumping the version there. After someone approves it, only then it goes to production. But that's a simplistic approach, right? Because production maybe is not only a single cluster, maybe are more than one cluster, there's one per region and so on. And at some point, you need some kind of orchestrator on top of the promotion pipeline that understands that a rollout to production may fail, but maybe you shouldn't roll back because if from three regions, only one region fails, it's better to freeze it there and fix it and go forward and roll back everything. So you can't have this single solution that works for everything. It's very tailor made to how business critical is the app that you are deploying on how many clusters you are doing it. But with Flux and Argo and other GitHub solutions, there are basically two ways of orchestrating this. You either have a management cluster where from there everything goes to all the other clusters or you install the GitOps agent on each single cluster and then you control everything only through the Git repo history. There are two approaches. They aren't perfect, neither one. I don't like the idea of having this management cluster because in my mind that becomes a single point of failure. You have still another cluster if you have to maintain and so on. But for many organizations, that's the solution because they want this single entity that can actually drive changes. Yeah, we've seen a lot of people with GitOps on AKS. We've seen a lot of people use tag-based workflows where they manipulate tags in the repo. I think it always scares me when people when you said the intro, it scares me a little bit to have multiple copies of a file because I've seen so many cut and paste or weird. I remember to fix it in one place, but I didn't remember to put it in another place. So I think that it was good in Flux 2 actually to see the customized work come in so that you could really have one file and then if you need to change the database name or if you need to change some small piece of it, be able to adjust it like that. So then the characteristics of application rollout I think are all designed around this idea that we're delivering global applications out to the world. The Azure Kubernetes service is present in like 60 different regions and more every quarter around the world. But we've made this promise that failures have to be local and otherwise who people can't rely on you. And you just mentioned the sort of like when you decide to roll back and then also like the fact that every single change you ever make could break something. We went through a long list of various reasons. We've broke out, things have been broken, but it feels to me like every single time we think we've found every possible way to break an application we find a new way to break an application. So like given all of that like how do we how do we do the actual orchestration? We've got the mechanism whether it's tags or manager cluster but like what do you think people should be thinking about as they do put their rollouts together? I think they should look really well at the app and observe how the app behaves when it gets deployed. For example, can you run multiple versions of the same app on the same cluster using the same database? Can you roll back an upgrade? There are all sorts of dependencies in the deployment pipeline. For example, the most usual one is you do database migration saying your new version of the app runs the migration and it renames some columns and after the app runs a little bit it fails. It has some bugs and you want to roll back. Well, surprise the old version does not work because you have renamed the columns or you have removed columns altogether. So you are not able to roll back, you have to manually I don't do a database restore, you lose the data, it gets so so complicated. So I think it's very critical to think about all these dependencies and make at least two versions, one after the other, backwards compatible with their data stores, caching, all the other dependencies. If you can do that, you'll really have a hard time, no matter if you are using Kubernetes serverless, it doesn't matter, right? You'll still hit this issue. And practice it too, I would say also, because even if you think you can roll back, we do things like roll forward in Canary and then roll back and then roll forward again. Just to practice it, just to make sure that you haven't introduced some new database or some new file format or some new whatever it is that you thought you had done it, but you're now broken, right? Yeah, another aspect is around configuration, like most people think that only code changes can produce bad effects and that's where the bugs are because I'm working on my code and so on. I've been working on Flagger, which is a continuous delivery tool part of the Flux project, which does canary deployments, A, B testing and so on. And early on, people noticed like, hey, it's okay, Flagger can detect changes in my app version, but I changed something in a config map. That value in the config map mounted as environment variable for my app. My app went insane because that variable is wrong and couldn't understand crash loop and so on. And now Flagger treats config changes, be it in secrets or config map as code changes does the canary analysis and all of that. And I think it's important to treat code and config changes as a whole. And there are some ideas in this direction in CNCF and with the open container group where what we really want to achieve is to be able to create these OCI artifacts that holds your code as in the container image, but also your configuration. So when you deploy something, you have configuration along with code and that can be tested together and see and deliver as a single package and versioned in a single way. So you don't have like my config version and my code version. Now you have to match them and so on, which is a tough problem to solve, but I think with OCI artifacts, we are getting close on solving this at the packaging level at least. Yeah, no, I'm really excited actually. A lot of the work on OCI artifacts has actually been driven out of one of my teams and we're really excited actually about the things you can do with OCI artifacts. I think you're exactly right. I feel like one of the things that we didn't quite get right in the design of Kubernetes early on was integrating config map into deployment more tightly. Like config map is kind of a by-product of the fact that you have a pod template in there, but it's not really like it's not a first-order thing. And I do think that having the tighter association between config and code is the right way to go about doing that. Additionally, I think we're going to see this with ML a lot as well, where people are going to want to ship images and models and they're probably going to want to rev them independently, but you're going to run into exactly the same problems. I was like, my code expects a certain AI model and my AI model is a different model than what my code expects and either it doesn't work correctly and it crashes or maybe it even just gives bad data back. So I think this is going to be an area that we're going to have to do a bunch more in the future. Once I go ahead. One second. I want to conclude here about OCI artifacts. So there is a new API specification which is called References, which has landed in OCI 1.1 where you have now a way to say this image references this other layer in the registry which is the config which references this other stuff, which is the machine learning model and so on. So we are getting closer to having this year, next year tools that will address this aspect and we in the Flux project we integrate a lot with OCI and it's our focus for the future. Yeah, for sure. And I think it's going to even, I mean my guess is it will even actually extend to some of the base image stuff that we were talking about where you could even imagine saying like I'm going to have a base Java JDK image and I might actually attach my jar as an artifact not necessarily as a layer that I put right on top because like in some level the JDK image is something that someone else produced and my jar file or my war file or whatever is going to be something that I produced. It becomes a little bit more like a build pack or like a pass kind of environment and I think that's a really good step forward. I think that container images have been great but they are, they kind of assume that the developer owns the entire container image and in many cases actually all the layers exactly. So they sort of assume that the developer owns all the layers and I think what we see in practice is actually it's like an app team or a platform team that owns a lot of the layers and then the developer owns just like the last layer and that's really unclear in the container image as it stands. It makes it harder to like have a central team push an update like if you know that there's a vulnerability in your JDK image you suddenly have to talk to all of the other teams and get them to rebuild their images and it's very painful. You can set up CICD to do it but it's painful right and I think it's because we don't have that ability to say here's one thing and here's a different thing and they come together but they're not smashed together. Yeah so you could patch only the base image replace only that or exactly you patch the base image but the jar yeah and then the jar or yeah you could do either one and I think that's going to really help people people manage a lot of this stuff in production a little bit better where you might be able to even centrally man like you could centrally patch something like log4j without talking to every single application team maybe right like that's sort of the goal. Yeah first you need to find that. First you need to find it so you need the s-bomb to find it and then you need to yeah exactly exactly so all right well we're pretty much out of time I think yeah so what do we think in closing? If you build your platform if you go on to that journey I think yeah you should be really really careful of what you're choosing and if you can choose a managed service you should be doing that of course it's it's also a business decision right you're you also have taken to consideration costs and everything but on the long run I think the more managed services you get into our platform the easier it will be for you to maintain it but if you go you know bare metal and everything it's on you then you have to be very conscious that you need a team on the long run you need a lot of maintenance and platforms evolve like everything that's in CNCF landscape changes all the time so you have to be prepared for that. The CNI you deploy today will not be the CNI you deploy tomorrow for sure and I think I think that you know it's it's easy when you get a price sheet to know what a managed service costs it's a lot harder to understand the maintenance and engineering costs of owning it yourself and I think it's incumbent on us people who are developing platforms so to to be clear with with everybody about like it's going to cost you right like open source is not free like beer it's much more free like puppy and so you should expect that you're going to have to do some work or you're going to you know in many cases it's actually more efficient choice to use a managed service in a public cloud or with you know an ISV like we've works and I think the other piece we said is like you know maybe the corollary is every component that you pull into your platform it's adding cost it's either adding literal cost or it's adding time that you're going to have to use to patch it and maintain it and operate it and potentially reliability issues too so be very careful and embrace the change that that Stefan was mentioning and realize that if you need a component you can always add it later and so I hope that was useful for everybody and you can reach me out on twitter or github at brendan d berns i'm stefan from on twitter and on github it was great talking to Brandon had fun absolutely people will get something out of it all right for sure take care everybody bye