 Next up, we have Julio speaking about GitHub's Survival Guide, Kubernetes Edition. I'm really curious of the stages yours. Thank you. That's very nice to be here my first time on this conference, and I work on the OpenShift team at Red Hat. Before that, I used to work at AWS. Before that, I was at Red Hat, so I'm a boomerang, as they call me. I went out and back. In this course of this decade, I've been working with automating software delivery, as it's usually called, GitOps these days. We're going to talk about why, and well, that's what I have the most experience and practice, so I thought I should talk about that. I don't really believe in best practices, so I'm not here to preach the correct way to do stuff. I'm just going to say some things that is the opposite. I think that if you don't do too bad, perhaps the okay practice is the right thing for you at the moment, or even the practice that's temporarily not so good, but you certainly can't do insecure things or unreliable things. And that's why I call it the survival guide, like what's the things that you absolutely must watch out for and what you can get away with, perhaps, and how to get away with some things. So my name is Julio, and that's Fairman J on Twitter. If you are seeing this afterwards or want to send some comments or feedback, I truly appreciate it. So let's go ahead. So as I mentioned, this is about the continuous delivery process, the process of getting the changes from the hands of developers to and transform that into value for customers. And ideally, that would be very simple and automatic. We would like to make that as fast as possible because as fast as those changes and that value gets to customers, we can get feedback and see if that sells more or if that sells less or if customers are happier or not, loop on changes. And that's pretty much the process behind many services and products at Amazon. Well, and how do we do this these days? How do we automate it? So we know that in the speed that we release changes to software these days, it's not possible to do it manually for most companies. So developers usually build on Git and the many flavors and services for Git, like GitHub or GitLab, and that would trigger events usually. Of course, that's not the only way we can have scheduled events such as nightly builds and things like that. But all in all, a pipeline of some sort, we're gonna talk about possible implementations of that. We'll build and test the artifacts that are going to be finally delivered to customers. That's usually involves provisioning resources on the cloud like on AWS services and things like that and roll out the changes like make the changes public and available at least to some users. And then also we must monitor and clean up and use it resources and things like this. So this sounds like very familiar and straightforward to many of us and I believe in the conceptual sense, it makes a lot of sense for any developer, but as we do it in practice, it's not so simple, right? We have to think how many clouds are we using and even what we are calling a cloud because we say things like multi-cloud, but what we mean by that is that like two clouds at the same time that are integrated or is it the possibility of changing the clouds if I ever change my mind, what does that actually means? And am I locked in by any API or any service that I use on this cloud and how many clusters or resources am I going to need? How I'm gonna build it, I'm gonna keep it secure and if that's well-architected or not or how to keep it safe. So this ends up being very challenging according to the size of your architecture, the application components that you use and so on and so forth, but I usually try to keep it simple and same, same by while you just have to think what's the next step in this automation, how you can make it better even if you can't make it fully automated or perfect what's the next thing we can automate and grow from there. So that's the usual process. So that's not all or nothing situation. I think it's very important to see the steps in the middle and try to keep in a way that you can evolve and progress. We're gonna talk a lot about this idea of evolving and progressing these automations. So as I said, a lot of this is going to be provisioning resources on the cloud. So the way we perceive and use cloud services matters a lot in this automations. So one thing that we clear see as cloud computing or main by cloud computing is things by like AWS computing services. So like EC2, pre-installed, ECS, ECS and so on and so forth, you name it. So that's clearly something known as cloud computing but what about all those servers that we have out there? So in corporate data centers, in startups, in service buildings, in antennas, is that all cloud, can we make that part of the cloud? Is it the same kind of thing? Can we use the same technology? Is this different? And what about going even further into what is now discussed as the edge? So for example, cell phone antennas like 5G antennas have a lot of software and hardware right in the installation itself. So it doesn't make sense to talk a lot about cloud 5G antennas but there is a service for that actually announced at last re-invent by AWS but for operators still using physical antennas they actually need that harder. And even projects like satellites that are becoming more common in startups, exploring space and even in education and not to mention self-driven vehicles and a lot of things that has computing on board. So is that what we understand and can bridge to cloud computing and to manage these resources in the same kind of tools? So that's the idea behind not only Kubernetes and as a compute platform but the flavor at Red Hat, the Red Hat OpenShift Container Platform is very much focused in allowing installations that are different everywhere but then can be executed from your cloud provider to the antenna, to the car, to whatever you have a sufficient RAM and that's up to down to a Raspberry Pi these days and actually run a full stack OpenShift cluster on a Raspberry Pi 4. So the, and of course that extends to a lot of more traditional customers but in many different scenarios that we would call well-architected. Like there's a very famous case in Brazil about the nationwide payment network PIX that's built on top of Red Hat and things like that and technologies like we just saw on the previous talk and confidentiality and containers and everything still makes sense in that context, right? And the idea I'm trying to sell here is that you may want to run this literally everywhere. The service I use the most personally is Red Hat OpenShift on AWS. As I'm more familiar with the Red Hat security and operations but you could be using that on Azure with, if you prefer the Microsoft side of things or have a business relationship with that cloud provider or even with your VMware infrastructure for example. And I have to say even without touching Red Hat at all like we have OKD, the open source platform and distribution of Kubernetes whereas you can just run with pretty much similar set of features on your own hardware your own virtualization things like this. I'm not going to dive too much into which is better in each case. If you would like in depth comparison I recommend you check learnkates.io. They have excellent research training professionals doing and updating this kind of comparisons and discussions. But I hope it's pretty much the value proposition of Kubernetes and delivering applications in this way, right? But the problem is that after you get started and launch your first cluster when you jump in the water you quickly face, config maps, security, authentication, authorization, delivery and many things that will keep you thinking like is this well-architected? Is this secure, reliable, performant? Can I go to production with this? So it's especially if it's the first application or if you're just getting familiar with OpenShift and Kubernetes in general, this is very much true for us all. And because in the end you get in this balance between evolutionary, being evolutionary and starting from scratch and just building things up or trying to make it well-polished and demonstrably good, let's say with tests and so on and so forth. So what I want to say is just don't be the... What are the things that you should really care and you just can't let for later, let's say or think about what's the things that usually cause problems in this automation scenarios and the things that I suggest we absolutely don't sacrifice. So for us security as a shared responsibility and job zero, the security of those automations is very important. So that's what I'm gonna share some thoughts about in this talk, how to demonstrate the reliability and how to roll out changes reliably. And more importantly, how to keep collaboration because what I see is that when we get afraid to make changes, collaboration suffers a lot. So that's true in many senses and I'm gonna dive deeper into each of those topics and show you how to automate that. So we don't get this in the situation that if it's working, don't move it, don't change it, right? We get this pressure that on some things we want to change it as fast as possible and roll out new virtual machines with API calls. We know how easy that is, but on the other side we must keep changing things stable and compatible and tested. So how can we keep this without too much complexity? Let's say that's the whole point, like balancing this location. And the first thing you have to think is if you want just auto deployment or if you want to auto provision things. So the first thing, the first step in your automation journey is usually run the commands to deploy your code on infrastructure that already exists. So for example, you already have a cluster and you just have to push those changes. But as automation matures, we probably want to start provisioning new resources automatically. So just not create change, perhaps not change the existing server, but creating a new server with the image already in because this is a fundamental change in automation brought by cloud computing because before cloud you had the server and you don't have an option. You have to change its contents. So that's the way people use it to do things. Right now, if you use another server and flip the traffic instead of and just kill the old one, perhaps you have for a little while, you have two servers, but the end cost is pretty much the same. So why not provision a new one, right? So that's where we're trying to go. Of course, the first step is just going for this deployment and then trying to provision things with resources with the application already deployed. And that brings us to this immutable infrastructure idea and we have many projects around that at Red Hat. So the idea is instead of changing resources, provision a copy provision new resources and don't change it. So that's very good because it allows you to have more freedom in many senses. So let me tell you a few benefits. If you can do this approach, I'm not saying that everybody should or can, but if you can, it certainly makes things a lot easier. First, because you have a reduced need for authentication, authorization, auditing and things like that. What I mean is imagine that your server already boots up with an image with everything configured and ready to run a web server, let's say for a web application. So why would you need for SSH or logging or shell or anything that's not there? So if anything fails, you should debug in the automation and perhaps in a debug environment, you might have a login and access and identity management, identity providers, open ID, open ID connect, all that stuff. Perhaps you can cut that out just by automating the delivery of those artifacts, right? And that also makes it as reliable as before. There is not that case like, who changed this? Why does it stop working? Did someone deploy yesterday? So no things, you don't change resources like containers or servers or even functions, whatever that is, we publish new ones. And that makes it very simple to rollback. So as long as you keep the old version, perhaps it's just a matter of flipping the DNS traffic from official name to name one, to name two, they'll just do the C name swap on the DNS. That's one way to do it or change the target on the load balancers or anything like that. But you are safe that for example, if the new version has a bug or has a issue, you can rollback to the previous one, right? And that means you can push things or make changes from day one. So a problem we've seen many customers is that people take a long time to feel comfortable to push changes to production, where if you have this safety net of simple rollbacks, you can just let people push code to master on first day. Why not? If you have automation in this sense, it's pretty much safe. And also this lets you have that fast feedback loops and feedback about features, about things, new metrics, whatever is that we're launching. And that's not necessarily more expensive because if your application is using auto scaling and cloud computing in general, not wasting resources, like just keeping it to the resources it uses, as you change and roll out versions, one environment is drained out as the other grows. So the sum keeps pretty much the same in the end. So you do have an overhead because for a while, you may have both environments at the same time or things like that, but that's not too much more expensive. And it brings in the discussion of infrastructure as code, right? You don't roll out this manually. You can't do a new, you know, go to that web console, click twice, run this shell script, this kind of thing doesn't work. So we have to automate that through Git and through infrastructure as code, right? And again, it's not a one thing. It seems that it's just, well, build your infrastructure this way and you're done. It's not, it's a matter of evolving the software delivery process. First, you learn how to make immutable deployments and, you know, have new resources with everything deployed correctly using your favorite infrastructure as code tool. Then you acquire that blue-green capability, let's say to flip from one to the other and just release changes without the risk of not being able to quickly roll back. As I said, that's the first step. Then we go to this extensions of blue-green. First, you can have blue and other colors. It could be blue, green, red, yellow or canary releases, as they usually say. The first idea that comes to mind as well. So let's roll out, test the new environment, make sure it works for like 1% of the users, then scale it to 5%, then to 10%. Usually we've waited DNS, but again, it could be done by load balancing or other mechanisms. But slowly rolling out and ensuring the safety of operations and that's canary deployments. And further than that, you might opt in to classes of users such as internal or beta or just a city and circle your users and route them to different application versions or flags. So that's a way towards circle deployments. And again, this is not a matter of best or better. It's thinking what of those approaches is more suitable for a company, perhaps just having blue-green deployments with two environments that you can flip back and forth is all right. And I see many customers being successful with that. For example, in the AWS Elastic Bing Stock Service, that's just one click or also in Kubernetes, it's very easy, right? Well, I hope it's all right. Let me know I can repeat anything. My internet is not too bad. It's a good connection, but sometimes it may flip or no internet weather let's see. Feel free to drop in any question. Thanks for the heads up, yeah. And let's go towards the more specific advices. And again, you can automate this in pretty way you want. The one thing I usually say is don't move data around, try to avoid. This is the more expensive thing. Perhaps when you do it in your local development machine, copying a database may be real quick and no problem, but as the application grows and grows and grows and your database becomes a data lake, moving these things is extremely expensive. So that's the problem, for example, with Helm charts. Sometimes you don't want to make a very direct dependency between your application in your database because perhaps you want to update your application 10 times a day and change where it runs, but your database and data is probably a bit more difficult. So that's something we usually keep. So that's when I say, when you automate, when you build those scripts, don't try to build a single deploy.sh that deploys everything or a single GitHub action, but do this year by year. So, well, the database is something you don't want to execute every day. Let's say you have a weekly or monthly maintenance window for that because there are so many applications that depend on the data model. For the API, perhaps when they are released every hour, if they want, they can deploy. For the app, perhaps you want that with a different release cycle, for example, if you are published on the iOS app store or Google app store or any app store, they have the restrictions on release frequency you can do. Perhaps you want to do this differently. So all those automations, if you want to have like this master Helm chart or this master script that do it all, I think it's very hard to make it actually work because it will have so many failure modes and conditions. It's probably better to have a very simple automation for each of those cases. And some, you may not even automate at all. For example, the CDN tier, if you use Cloud, for example, or CloudFront from AWS, you probably, you can roll out that automatically for first structures code, but most customers just don't. And again, when you design for this, especially in the Kubernetes worlds, this means that must multi-cluster is a necessity, right? You don't want to run everything in the same cluster. And just one problem that arises from the many is updating the version, for example, of Kubernetes. Perhaps you don't want to, perhaps something impedes you to improve your database Kubernetes version, but you have your application in another cluster, so you can update that. So it doesn't make sense to lock in everything, everyone to the constraints of every system. That's how things get frozen up. So I suggest breaking down first by tier or purpose, like you mentioned, database, network, storage, compute. Also by grade or data, such as development, staging, pre-production, production. And again, this is like the data realm, the data that the application will see. And perhaps only on the application tier on the UI, for example, you may want different circles to have different versions, but again, this is mostly for products, tests, for startups, not so much for enterprises. So this depends a bit on your context. And here, thinking about it, there are so many different technologies and possibilities for building this that we usually end up in the paradox of choice. If you don't know this talk, that talk by Beres Schwartz, it's brilliant. And the idea is very simple, right? Anyone that's been to the Cloud Native Foundation website and List of Project C, like choice is a good thing. It's nice to have choices, but sometimes we have so many that we just get paralyzed by not being able to make the optimal choice. So I think that's a very critical problem. In automation and GitOps, we end up, oh, I need this, I can improve with that and try to draw it better and better and better. And the customer is there waiting for whatever is the new Cloud Native thing we're trying to bring into the project. So what I wanna say is for many, many projects, I've seen it work with just a small shell script with EMO processing. So if all you need is run a simple GitHub action that changes EMO and call Cloud CTL or OC, by all means, that's perfect. So in this sense, there is this awesome Kubernetes project. It's in the references or the links, don't worry, but this is a very nice one with tips of different tools and processors you can use to automate it. But very soon, you're gonna need infrastructure as code and something like Terraform, my favorite, or CloudFormation, or even and CloudFormation, like something that you can store and build on Git and store those templates and keep building on Git, that's declarative. So you don't have to try, catch exceptions in each of your commands because the infrastructure interpreter is going to do that, something that lets you repeat those deployments as often as you need and as clear as you want. So if you want to roll out red, blue, green, yellow, magenta environments, that's okay. Preferably, preferably something that's extensible, something that you have a way to plug in your own resources when you want. So in Terraform, that would be Terraform plugins and providers in CloudFormation. It would be modules and custom resources. On CDK, it would be constructs. Every platform should have this way. So you're not in the woods when you want to call your Creator API resource or something that only exists inside your company or your contacts or something like that. Something composable that you can build blocks and share preferably. So the application team can reuse the database that the database team built and things like that. And I see many options that are partially managed like you can, if you use Terraform, Terraform stores a state in the backend that can be anywhere you want. So you can use a simple S3 bucket and store your infrastructure data there. And that's something that you still have to manage, create an S3 bucket, but that's okay because everything is managed by AWS and that's fine. But of course, if you want it fully managed, you have Terraform Cloud or CloudFormation-based services on AWS that can manage all those resources for you. And finally, those are not some things that you can use one or the other. So always keep a look for integration opportunities, for example, and there's this resource mentioned here in this snippet called Formation Stack where you can call a CloudFormation Stack from your Terraform resource. So if by any reason, such as there is no plugging yet for a given service for Terraform and you need to bridge to CloudFormation, I don't know for whatever reason, you can't do this kind of thing. And again, you have the full power of getting the things like the branch name and putting your own tags and setting up the infrastructure the way you need it. And the most important part of that, of having that as code is being able to collaborate with branches, PRs, threads, and the same way we do on GitHub and make that part of our architecture. So if I want to add a new server, I add a line in that declarative file. And if I need approval, I can ask in the PR and use the same mechanisms and the same techniques we've been using for software. Now I've been able to use that for infrastructure. And here is GitHub Actions. That's what I use the most, but you have the same for any other platform. Here's the different events that can trigger actions. So you can have something that runs on a schedule every night, something that runs on a discussion comment if someone says approves and he's a manager that triggers the pipeline. I don't know. On a label that someone adds like a label that this is released or tested or something like this, or on a push on a new branch. So when a new branch, one that I do often is when a new branch with a given name comes in, such as live and a given branch name, that will actually deploy a new environment and push that version or testing and things like that. Or a release when there's a new release in your GitHub or even manually you can run those workflows. So if you have automation and this is a good one because remember that problem works on my machine like you have your version of Python, your dependencies and you forget to tell the customer they need that and you want your automation to run anywhere. So being able to run those manually and say, well, just log into the action list or in this case, the GitHub Actions and execute this and it's fixed. So that works, you can test that to work every time that's convenient, I believe. And a very important part, I think this is the one that most people struggles the most is where to manage configuration. So this is a very important part. Like how do I know the database URL, the which environment variables to set those values and so on and so forth. So one thing, this is like the order that I think about it, like first by naming conventions. So if the branch has a name like prod slash name of something, I'll provision an environment in the circle, prod like in and bind it to the prod namespaces or DNS names that will drive them to what we refer as prod and a new environment with that name. Or environment variables, most tools support that. So that's when we can't infer information, the configuration from the naming conventions check the environment variables. If that's something that is a long file such as I don't know a key file perhaps or a data file of some sort that you wouldn't put on an environment variable perhaps check the infrastructure, the repository content itself for that key where it will have a email file or whatever is a file under that name. And that's when you have the one delivery for your software just your app, let's say but when you have multiple deliveries of your software say like WordPress where you have one source but million deployments, you might want to host that deployment content in a separate repo. And that's what tools like Techtron and OpenShift CI CD in general works like you have your code deployment, your code source where from your container image is going to be built usually and a separate repository with your email descriptors for Kubernetes and things like that that are going to be pushed to production. And just remember to take a special care with secrets like GitHub has secrets mechanism, Hashicorp has vault and vault has integration with everything AWS has systems manager, parameter store and even if you're rich enough to pay for it as service called credentials manager which is a bit overpriced in my opinion. But all of that is to set and save credentials, secrets in a way that's secure and won't cause problems such as committing that key to a GitHub file or end up with that exposure that's something you really don't want to do especially if you're on a regulated situation which can have severe legal implications. And after that is just a matter of integrating with observability so it's not done, you still have to pick up metrics being the CPU memory, HTTP counts and things like that. The logs from the applications see if anything is throwing an exception or anything like that. Throwing alarms according to these conditions captured by logs and metrics in events and traces and you can do this on CloudWatch or Prometheus. There's many different tools for observability. I just added them so as a first as a reminder that something do this and mentioned this tools but to get into the importance of that. So this is a fault tolerance demonstration from this actual tool called Visero by Netflix that draws this simulations and it shows the kind of approach I'm trying to show here. So it starts with three environments say Netflix application in three different US regions and you can see in US East 1 the region starts increasing the error rates and when it gets to a high threshold of error rates it will start sending traffic to the other regions so that the other regions can scale and prepare to take over while this recovers and again just to remember by this day in prime hours Netflix occupies around 40% of US total US bandwidth. So it's a lot of data flowing around and you're watching your favorite series you don't even see a flick because of this kind of redundancy mechanism and then things get fixed and rolled back because again of that capability of rolling back and forth. Not saying everyone should have this kind of capability of visualization or failing over and failing back. Again, this is huge but just showing what kind of automation we can build towards, right? And on the Kubernetes world this is usually done by operators, right? This kind of automation and they're not going to build the dozens and dozens of scripts to roll things back and forth and automate and send these messages and control those things. On operators are Kubernetes applications that can basically create resources that are of whatever resource type you name it and trigger actions to create, read, delete and update them. So nothing too hard. There is SDK for building those but as OpenShift and this is important because OpenShift itself it's essentially in its core, a set of operators for Kubernetes with additional security, CICD, rollout and all these benefits I've been mentioning. So that's why it's cool. And when you build this operators you have this maturity model that guides you through the steps of starting with the basic install just getting code deployed and ran in the cluster how to upgrade versions how to influence the lifecycle of resources up to this autopilot level where things happen magically and operators can get really advanced and there are certainly examples such as the Helm Ansible and Go operators around that. So all in all it's basically using infrastructure as code and manage operations with resources declaratively and keep collaboration through Git and keep more than keeping it through Git is keeping it alive in a way that people is not afraid to touch infrastructure just because it's risky or it's not automated or we can do it only six months and et cetera. And a final thought is to separate and this is important in for continuous delivery is separate the idea of a release and a deployment, right? You can deploy a hundred times a day, right? But you may only release on Christmas, right? It's like when your favorite applications shows a Christmas a nice Christmas message that code was already there perhaps since spring, but it has a little check or a feature flag, as we say that that's going to be available only for prime users on Christmas. And there was a long while ago people use it to create new branches for new features and create hundreds of branches and merge them in a very complicated way for, usually known as Git flow and things like that. But since I think 2016 the state of DevOps report by puppet labs demonstrated pretty clearly that basically committing to master and using feature flags is way more sane and productive. And because of this you can create those feature flags and deploy that code hidden. And of course there's tools for that from open source to service is like launch darkly and things like this that allows you to deploy and release in different steps and ceremonies, right? If you wanna practice those concepts I just mentioned for all of that we have learn.openshift.com where you get a cluster and a tutorial and you can build and practice those automations. Here are the references for this talk. First I'm starting to write so it's a very small just to post at the moment but including these slides around this new dev to RedHex I've been building the awesome Kubernetes reference with lots of commands and cool stuff about automation. The learn.openshift.githops is specifically for our way of building those concepts and learn Kate's as a separate company but with cool courses and material. And finally my friend Mauricio Salatino is publishing a book continuous delivery for Kubernetes where you can also find great insights on bringing this to reality. And that's it for my talk. I think we have five minutes if you have any questions. An hour later it would be a great pleasure to keep in touch with you. Thank you so much for attending this talk and enjoy that count.