 Let me go back. I don't want to be on the About Me slide yet. I want to be on the title slide. All right, thank you everybody for coming today. I am Alex Corvin, and I'm here to talk to you a little bit about handling chaos in containerized environments. Specifically, I'm going to focus really on Kubernetes or OpenShift containerized environments. I feel like I'm cutting out, but I don't know. All right, so real quick, just a little bit about me. I am a software engineer at Red Hat. I work right now on the Open Data Hub project. If you've maybe heard about the Open Data Hub project at some of the other talks, we were giving several this weekend. And I focus mainly on the internal Red Hat instance of the Open Data Hub. So we create a platform for other teams at Red Hat to be able to perform their data science and analytics experiments and work and give them a platform for storing all their data. And one of the things we really focus on is enabling teams at Red Hat to become what we say more data-centric. So teach them to do data analysis, teach them to work with big data, that kind of thing. And my job primarily is to, I'm mostly tasked with keeping the internal data hubs stable, making sure it's highly available, making sure we have good uptime, so that the teams that rely on us can actually have a data platform that they can rely on and use. So in general, I'm super passionate about psych-reliability, super passionate about psych-reliability, engineering, dev ops, that kind of thing. Don't really think of myself as a developer. I'm kind of a systems guy, but I like to code, so that's what I do. And then just the fun thing about me, I'm a beekeeper. So I get really bored if I don't have a lot of projects. Beekeeping is one of them. It's usually an interesting thing to talk about. So if you want to talk about beekeeping, hit me up. So to dive in, what we're going to be talking about today, I'm going to cover first, what do I mean by chaos? We'll go on that on the next slide. Kind of like how we would have handled that chaos in more traditional environments that weren't containerized. And then I'll talk about how OpenShift can help you cope with this chaos. I think it has a lot of ways it can help you manage it that are a lot better than what it worked like in the traditional ways. And I'll do, I think, three hands-on demos with that and talk about two other ways that I won't be able to demo. And then finally, I'll make some recommendations on just ways, I think, that you can build cultures and teams that plan for this chaos and expect the chaos and learn to kind of manage and control it. And I'm hoping that I'll have some time at the end for questions, so we'll see. So first of all, what do I mean by chaos? I made this kind of cool word map with some things. I had fun putting gremlins on there. But so basically what I mean by chaos is that if you've ever run a production system at scale, you probably know that eventually kind of just weird things happen, right? You may maybe know somebody or know a story about somebody who accidentally dropped like your production database or maybe like you rolled out a software patch and it turned out that that software patch just let anybody log in with any blank password. Or maybe somebody thought they were in the staging pre-production environment and they like scheduled like a failover of your edge load balancer and it turns out they were in production and it just killed all your connections. You know, weird things happen. Those three examples are all things that have happened on teams that I'm on. So I've been on, not anymore. It wasn't a niche, I promise. So anyways, we all know that we should, weird things happen, right? We all know that we should follow the best practices for testing our code, right? But deadlines are a real thing and you've got to rush to get software out and maybe you don't test all the edge cases and then it turns out that some weird non-unicode character comes in and it just breaks everything, right? You know, maybe it's not human error at all. Maybe, oh, I went to the wrong side, hold on. I was trying to scroll, go back. This is where I wanted to be. Where's my mouse? I got to find my mouse. Okay, I found my mouse. Maybe it's not human error at all, right? Maybe your network is a little flaky. Maybe an AWS region goes down, right? And takes down half the internet and you can't watch Netflix anymore. Maybe your hardware fails. Maybe Russian hackers dosh you, right? Because they want to steal your money, right? And so maybe the chaos is not a bad thing. Maybe your system has grown and become really popular and now you're a millionaire. Maybe you're a billionaire, right? And you bought an island and instead you're just hanging out on your island and your swimming pool filled with gold and then you just forget to scale your system, right? So you're so busy counting your money that you forget to plan for capacity planning and then you have a weird spike in traffic and it brings down your system, right? So all this stuff can happen. And like to account for it, I think you can do one of two things, right? Maybe you hire a team of like 500 full-time engineers and they maintain the system night and day and everybody's really busy and it kind of sucks. Or maybe you're a team of one and your phone is constantly buzzing and you're on every night and weekend and you don't get to sleep at all and your family hates you and like you're just miserable, right? And neither of these solutions is really very good long-term, right? Neither one of these is gonna scale. But if you don't do anything, then you're just app is gonna succumb to the chaos, right? And it's gonna fail and you're gonna go out of business. So what do we do about this, right? I think in the kind of traditional world, I'm gonna tell a story of how this worked kind of before containers and before open ship. Let's say you have an app, right? You're a cool new startup, you develop the cool new app, maybe your Instagram or whatever's gonna replace Instagram. So you deploy this app on a server, right? It turns out that you're trying to get up and running really fast. You deploy this app, you don't really take time for proper configuration management or deployment automation or like tests because those are for losers and who wants to monitor? Monitoring is for dummies, right? So you just get it out there really fast. But it turns out people like your app. So you decide you need to scale your app. So you deploy a few more instances of it. You spin up a few more AWS servers or whatever. Now you're highly available, right? Because you've got four instances and maybe you buy like a load balancer to put in front of it, right? So now you're redundant, right? And that's cool. And then it turns out that people like really like your app and people all over the world are using your app and so now you have to spin up like multiple data centers. So you're in like London and China and Australia and everywhere and you're really awesome. But remember, we still haven't taken the time for configuration management. So we're just, we're super busy having to do all this, right? You know, now you're global though, right? So maybe GDBR comes out or CCPA is coming up if you're in California. And so now you've got to start worrying about where is my data located and how can we get access to it? And it's just piling on more and more processes that how do we cope with all this, right? And you still don't have any like configuration management but you're still doing everything manually. Like it's just really hard to do, right? And so you think, oh man, this is not scaling and I need more people. I can't do this all alone. So now you like start throwing more people at the problem, right? You hire a couple of engineers. And now you have more and more people working on the system, right? And so they're starting to like commit conflicting changes and like they're starting to roll out different things at different times and you can't keep track of everything and it's not working. And so you think, oh man, this, I'm not good. Maybe I should like run on AWS or something. I'm running everything locally in my own data center right now. Maybe it'll help if I switch to VM. So you do that, right? And so like it's taking more time to manage. It's taking more money to manage. It's taking more time and more money, right? And the chaos continues. So I don't know if anybody's ever dealt with this but I think that it sucks. And I think that luckily there is a better way and I think that OpenShift can help. So that's my little OpenShift hero guy swooping in. I was looking for a hero with a cape but my icon library didn't have one. So if you manage to read how that icon library give me a user with a cape. Anyways, so OpenShift I'd say has some really cool features just out of the box that you can use to help with this chaos. As I said, we're going to explore five of them today and I think three kind of hands on. So the first one I want to talk about is monitoring. So, you know in the past we had things like Xavax and Nagios and all this other stuff that like worked reasonably well. But there's a new player that's come on the scene called Prometheus. If anybody's used Prometheus it's really cool. It's really swanky. I would say it's kind of becoming a new standard for monitoring applications. And what's really nice is Prometheus integrates just beautifully with Kubernetes and therefore OpenShift. And you can configure it in a way that you just get automatic monitoring scraping of your services. You can really easily, if you have a custom like Flask App or whatever you can very easily add custom metrics into it. If you're running like a MySQL database or whatever there's like plenty of options for telling Prometheus to just scrape through MySQL server and you don't have to worry about custom metrics or anything like they're just out there. You pull them, right? And so like Prometheus has those native integrations with all sorts of other things. And there's a library or a tool called Grafana that you may or may not know about it. That's been around a bit longer than Prometheus. But it's a really nice visualization tool on top of Prometheus. So you can get a really swanky monitoring suite. And then Prometheus has an alert manager component as well that you can configure to generate alerts based on the metrics that it scrapes. And you can send you an email. It can integrate with the Slack or whatever. You can have it send to Page or Duty if you're using that. It works really well, right? And so real quick I'm going to try and demo sort of just this. So bear with me. I'm going to see what happens if I drag this over. Okay, I have to escape out of this, I think. Escape, cool. So I'm going to do everything over here. Is that big enough? Can you guys see that? Hopefully. I'm going to do this. I don't want to do that. I'm going to switch to screen mirroring because I feel like that'll be easier. Does anybody see my mouse? There it is. Displays. Mirror. All right, so here's what we're going to do. This is my command line, right? We're in OC, who am I? And I'm logged into an OpenShift server. That OpenShift server is running in a few VMs in AWS. It's OpenShift 4, provisioned a couple of days ago. And it should just work fine. And so the first thing I'm going to do, remember, I mentioned I work on the OpenData Hub team. And the OpenData Hub team, or the OpenData Hub project, one of the cool things we can do is deploy Prometheus and Grafana for you and configure it all to scrape metrics from any service that's deployed in your OpenShift project and display it in Grafana. So it's kind of cool. I'm going to open up Teamux because I like Teamux and I'm going to go to DH internal data. So this is not yet a public repo, but something like it will probably be at some point, but it's a deploy ODH. So what this is going to do is run an Ansible playbook that deploys the OpenData operator to my OpenShift thing. So it's actually going to create the OpenShift project for me, which is nice. It's just called DevConf demo. And then it applies if you OpenShift custom resources to it. So if I open up this guy over here, OpenShift, and I go to projects, like I'm already in DevConf demo, that's cool. Oh man, I didn't delete the namespace. That's okay, that's going to work out all right. Now I'm going to delete the namespace. OC delete. I didn't clean up after myself last night. Sorry guys. I want you to get the full experience. That's going to take a couple seconds. So anyways, I'm just going to talk through this. What's going to happen is I'm going to reinstall that OpenData Hub operator, which is going to give you a pod that's running the OpenData Hub operator. And then I'll run another command that's going to create an OpenShift custom resource for the OpenData Hub operator and tell it that I want to enable monitoring. So I can show you that real quick. So if I look at OpenShift object templates, O-D-H-C-R, I can see that down here, I'm telling it monitoring truth. So that means I want monitoring. Notice I've disabled Jupyter Hub and Spark and Selden and I guess Jupyter Hub again. I'm at that line and what the difference between AICUE, Jupyter Hub and Jupyter Hub on OpenShift is. But if you want a Spark deployment or a Jupyter deployment or coming soon like Kafka and then Selden and then ultimately the Thrift server, Spark and all that kind of stuff, explore OpenData Hub because it's really cool and it can give you all that stuff really easily. Today we're just going to do the monitoring aspect. Let's see what's happening with my project deletion. I think that's done. Cool. Let's try that again. Now we can see all this from scratch. So my DevConf demo project got created again. This time there's nothing in it but now there's an OpenData Hub operator pod. And so the next thing I'm going to do is deploy that customer resource that we were talking about. So for disclosure, this is going to fail first because it doesn't properly wait for the Grafana deployment to get created. So I'm going to deploy it. The OpenData Hub operator is going to deploy Prometheus. The OpenData Hub operator does correctly wait for Prometheus to happen because what happens is Prometheus starts up, then it configures Grafana to connect to Prometheus as a data source. Obviously for that to work, you have to have Prometheus working first. So the OpenData operator actually does wait. My crappily thrown together Ansible does not and I didn't take the time to fix it. But at some point this is going to become ready. We'll see you get pods. I'm going to wait for Grafana to switch to ready. This takes a few seconds, I don't know. Test, yes. All right, there we go. So now I should, still just one out of two is ready. So I'm going to wait a little bit longer for two out of two. There we go, all right. So I'm going to run that script again. And it is all idempotent, so let's skip some stuff or not do a changes with some stuff. But then what it will do is create a deployment of a really basic Flask app and then a Prometheus black box exporter which we can use to run availability checks against any arbitrary HTTP endpoint. And I think that's it. So what I do now is if I go into OpenShift and go to my Grafana route. Oh, the other thing my Ansible does is it gives me a Grafana dashboard which I just import via YAML. So this takes a little while to set up but what's going to happen is I have this other deployment which is a demo app which gives me a route that I can curl and just hello world, right? One thing that's going to be interesting in a second is this host name. It's going to be the name of the pod. If I curl this over here, it's that, right? Same thing. But hopefully at some point, yeah. So now you can see that this service is now green. It's available. So what's happening here is Prometheus is configured to run at a regular interval checks against the HTTP endpoint. If it's green, it means it's up. It's up. And then over here, this is the number of page count or page hits on it rather. The index route increments a Prometheus counter every time it gets called. So these availability checks are running every five seconds and it's going up. What's nice though is that, like, so these availability checks, there was a little bit of hard coding for these routes. I'm sure you could do it dynamically. I didn't take the time to. But these metric pooling up here for counts are just like, I told Prometheus to scope any Kubernetes service that exists. Actually you can enable that and disable that really easily. I'm going to show that because it's cool. So you get service, this guy, OCD. Describe, service, that's not right. Stand by. There's supposed to be a, oh, yeah, it's an annotation. I was looking under labels, it's under annotations. By setting an annotation of Prometheus, that was slash scrape, slash true, I'm telling Prometheus to scrape this. So if you spin up a new application and you put an OpenShift or Kubernetes service in front of it and you set this annotation, you automatically get Prometheus scraping and it's really nice. And I think that before Prometheus, if you had Nagios or whatever, that was a lot more work to set up. Now it's just like, you deploy Prometheus once, you configure it correctly and it's really easy to get monitoring and really easy to start playing with Grafana graphs and get metrics for your service and it's awesome. So the next thing, I'm going to go back to my presentation and I'm going to step forward to high availability. So I'm not going to spend much time in the slides because I'm just going to demo this again. So let me do this. I'm going to make this smaller and we're going to watch kind of in the background that's green right now. I'm going to do OC get pods. So this is the one instance of my API running, right? It's in this pod here and let's say like some sort of chaos happens and your pod dies. So I'm just going to OC delete pod this and it kills it. And what we should see is that would go red and if I run a curl command, like I get some gibberish now instead of that nice hello world. So like my API died, right? And before Kubernetes, before open shift, if you wanted to avoid this, like what's happening here is I'm not, I don't have any redundancy, I don't have any high availability. Remember what we talked about? Like if you want to fix this, you've got to buy a few more servers, you got to buy a load balancer, you have to configure all that, it's a lot of work. And with open shift, this is really easy. And I like if you use open shift at all, this kind of becomes second nature. And so if you use open shift, this is like not an exciting thing to talk about in the talk. But if you haven't used open shift, this is so cool. Like if I want to fix this, I can run a scale command. I would, you know, in production, I would do this in YAML and put it in Git but this is just easier for this. I can run an OC scale command and then I can do OC get pods. I'm gonna watch that. What's happening is my demo app where there was previously just one of these, there's now two of these and there should be one, two. So I scaled up to two and now I have two pods. And what I can do now is I can delete this pod, do the same thing I did before, OC delete pod that. One of my pod dies, but this is never gonna go red. I can curl it and it's just gonna keep working. And once that pod dies, I'm gonna run my curl command. Curl. If I do this a bunch of times, you'll see like I'm always getting that pod. I'm gonna watch this actually. Watch. That shit. And one. What we should see eventually is this hostname should switch to a different pod. There it goes. So what happened was I deleted one pod. All the requests were going to the remaining pod. At some point, OpenShift realized, oh hey, there's only one pod here. There's supposed to be two. And it's one of different ones. And now OpenShift is round robin-ing me between them. So with like one command scaling up from one to two, you have more high availability. Usually you'd want at least three. But like, it's just super easy in OpenShift. This is super easy in Kubernetes. It was really hard, you know, in a traditional environment. So this is a really cool thing. Back to my presentation. I think the next thing is on hardware placement. So I'm gonna do this. I'm gonna switch back to, I wasn't gonna say I was gonna switch back to extended, but I guess I'm not gonna switch back to extended. I can just talk about this for a second. So this is one that I don't have a demo for. And this is one that I wanna talk about, like where your pods run. So again, in traditional applications, you want high availability. You want to avoid chaos, like network failure or hardware failure. So you deploy an instance of your app in, like keeping in mind which rack it's on, which top of rack server it's on, which row it's on, which data center room it's on, which availability zone it's in. And like, it's a lot to keep track of. And then just keeping track of it is a big job. And then managing where your applications are deployed so that they're spread across them. So if an AWS region goes down, you're still running the other one, like there's a lot of work and it's hard to do. With OpenShift, you still kinda have to keep track of that underlying hardware topology, but the placement is really easy. So there's two features I wanna talk about. I think it's two. I'm not looking at my notes, I'm doing this. We're doing it live. So one feature that you can take advantage of is pod affinity or anti affinity. So let's say you have, in our API example, we had two pods, right? And if one node goes down, you want your API to stay up. So what you can do is tell OpenShift to apply an anti-affinity rule on these pods so that they never get placed on the same node. So you can lose a node, maybe you lose one pod of your API, you don't lose both. OpenShift automatically spins up another one, you're good. Maybe you have an API and a database and you want like really a latency between them. So maybe you want your pods to run on the same node as your database. You can do that too with an affinity rule. So not anti affinity, but affinity. And tell the pod to always run with the node. So that's one thing, anti affinity, affinity. If you're interested, check it out, Google it. The other thing you can do is like node-aware placement. So OpenShift gives you the ability to assign arbitrary labels to your underlying OpenShift hardware nodes, right? So you can do whatever you want. You can specify a label for your availability zone. You can specify a label for the hardware person. You can specify the top of rack server. Like you can just, whatever you want, right? And then you tell OpenShift to deploy, to use those labels when deploying your pods. So again, you can do affinity to node labels. So you can say that like, don't run any of the two pods on the same underlying node with these labels or whatever. You can do whatever you want. So it becomes really easy to manage like the placement of applications relative to your underlying hardware topology. And again, doing that without a tool like Kubernetes OpenShift was like a lot of work. You had to, like it was just hard. You had to, if it was hardware, like without a VM, you had to pick the machine, right? And then you had to have these massive maps of your data centers and know where everything's deployed. When you have thousands of servers, that's hard. If they're VMs, you have to like, that's just like another layer of complexity. You have to keep track of where the hypervisor is and like, it's just a lot of work. And Kubernetes and OpenShift makes it really easy. I couldn't really demo that because it's hard to get a test environment with multiple nodes. Anyways, though, scaling is something that I can demonstrate. So specifically what I want to demonstrate is some of the Kubernetes or OpenShift auto-scaling features. I don't remember what's on this slide, but I think the exciting thing here is going to be the demo. So that's what I'm gonna do. So what I'm gonna do is I have another script here called deploy auto-scaler. So OpenShift has the ability to automatically scale your applications up or down based on memory consumption or CPU consumption. You have to have the metrics module or whatever enabled in your OpenShift server, but I think that's becoming pretty standard with OpenShift 4, so it's on in mind. What I'm gonna do is deploy auto-scaler. And what this is gonna do is deploy, I think I'm using the same basic deployment, basic class gap I had before, but it's also going to deploy what's called a horizontal pod auto-scaler. So if I do run OC get HPA, you can see I have this DevConf demo auto-scaler, and it's going to watch my DevConf demo app deployment config and say minimum of three pods and maximum of 10 pods. So if I do OC describe deployment config devconf demo app, what we should see, yeah, I'm not gonna do describe, I'm gonna do get. OC get DC devconf demo. So notice that desired is three now. So remember, we only scaled it up to two before, but I deployed a horizontal auto-scaler and told it to keep a minimum of three pods of that, and it just did it, it scaled it up to three. So that's kind of cool. One thing I can do is go over in OpenShift to this monitoring tab and see some dashboards, and I can try and get the memory consumption for my app. So, this is probably what I want. If I can scroll down, I gotta select my namespace. If you haven't played with this, it's really cool. So this is, again, the file on top of Prometheus built into an OpenShift cluster and automatically pulling metrics for your pods to play with it, it's really cool. But I wanna find my devconf demo namespace. Stop clicking and scroll up. All right, so in here, I can find my pods, memory usage pods. You see, devconf demo has a request of 500 megs, limit of 500 megs, it's only using like 35, 37 megs right now, so not very much. What I'm going to do, though, is go to the load endpoint of my really fancy API. What this does is it generates like 100 megs of random binary data in Python and stores it in a Flask global variable list. Don't ever do that, it's terrible. But it was the easiest way to come up with the simulate memory usage. And what we're gonna see if I run this a few times is this memory usage here is going to start to trickle up. It should spread this across multiple ones. I'm gonna run it a couple times. And normally what would happen is ultimately your pods would run out of memory. The OOM, Oomkiller would run and kill your pods and like you'd have just like Oom things can be hard to debug, whatever, but like you'd have instability, you'd have downtime, right? But so let's see, yeah, so this is, like now it's using 133 megs and like there's some latency here so it'd take some time to catch up. But what I can do is OC get HPA again. So now it's saying that like I'm using 13% out of my target 25%. So what I'm telling this horizontal pod out of scale to do is if this deployment config starts using more than 25% of its memory targets, start scaling it up up to a max of 10. So if I curve this guy a few more, I should be running this in curl and not in the browser because of sticky sessions. What was happening a second ago is all of the requests were probably going to one endpoint. And I want them to evenly get scaled. So look, now we're up at 32%. So in a second, this should go from three to a higher number. I'm going to do this a couple more times. Look, now it's at four. So describe, no, no, just get. So again, you can see now there's four pods and that would just happen automatically up to 10 pods. And so back to comparing this to before open shift before Kubernetes, like I ran a production environment that run on a bunch of VMs. And this was my dream to get to, right? We had really, really common like usage cycles. I'm sure a lot of you are familiar with this. Like during the day, traffic peaks up at night. It goes down. Maybe it follows like a global pattern of time zones. And kind of a lazy way to do this. And what I think a lot of us do is you just scale your application up really high and you just, you account for the peak and then maybe you double your peak, right? So you just, that's capacity planning, right? I'm just, I'm super over-over-allocated. And then your AWS budget gets like massive. And what you really want to do is like auto scaling that takes into account how much usage you're using and what your load is. And like that's the dream in, but it is hard. Like it is hard to do with VMs. And it is really easy to do with open shift and with Kubernetes. Like that took two minutes. Like that was super easy. And now if your memory, so the limitation of this is it's currently, at least as far as I'm aware, it's only based on CPU usage or memory usage. But if your workload, like if that's, if you fit within that box, this is so easy to do. And you should play with it. It's called horizontal and pod auto scaling. So play with it. All right. We have 14 more minutes. I'm gonna go back to the slides now. Bear with me because I do want my notes for the rest of this. I'm gonna unplug for a second. Notes. All right, we're back in action. So that was scaling. And I think I talked about everything I wanted to. All right. Come on mouse. Where are you? I'm on monitor. I locked my screen, so I wanted to do. All right, I'm good now. All right, so the last thing that's like specific open shift features that I wanna talk about is complex rollout strategies. So this is really designed to address the, so when I was talking about examples of CAS before, remember I talked about you upgrade your software and maybe introduce a bug or whatever, right? And a really common way of addressing that is to leverage something called a blue-green deployment. Basically what a blue-green deployment is is you maintain two instances of your application, right? You have the old application that you know works and traffic's going to it today and everything's good. You roll out the new version. And again, the easy way to do this is just send all of your traffic now to the new version. But that can be dangerous, because what if you have a bug and then you break all of your users and it's no fun? So with a blue-green deployment, what you should do is run both instances in parallel and try and migrate some subset of your users to it, right? So with vanilla open shift, what you can do is just deploy multiple instances of your deployment config, put a service in front of both of it. You have your main route, maybe you create a second route and you deploy your dev team or your internal users or something at this new route, test out the application, verify it works. When you're good, you point your official route at the new service. Done, right? There's a thing called Istio that I would really like to have time to talk about or actually demo here. But Istio has this thing called the service mesh that works with Kubernetes. And you can, so vanilla open shift, it's kind of manually, manual a little bit clunky. With Istio, it's just built in. And so you can do really cool things, spin up your new application and then send a specific percentage of your traffic to your new deployment. Or if you have different global regions, you can say send this region over here and this region over there. You can do really cool things and you can orchestrate the flow of traffic to make sure you're evenly or fully utilizing your resources. You can do some really cool stuff. So pulling this slide, consider doing blue-green deployments if you're not because they're kind of cool and they're a good way to make sure you're safely upgrading and things like that. If you really want to be cool, play with Istio because it's kind of cool and I want to play with it more. So that's what open shift can do to help. I do think though that there is like, know your tools that can really help but a lot of this starts with the team. And so I think that in order to really take this chaos head on and address it, you need to change your team culture. And there's two or three things that I wouldn't specifically recommend. One is I think you have to change what it means for work to be done. People on my team, I harp on this, but an app is not production unless it's monitored and unless you have a plan for scaling and unless you can deploy it repeatedly and easily and if it's documented and you have monitoring and alerting and the team knows what all that is, my team is probably really tired of me saying that. But I think the alternative is you do a prototype, you get it working on somebody's laptop and then you say, okay, let's deploy this to production and then that's just don't do that. It doesn't work, right? Like there's all this extra stuff that goes into running a production. You have to have a plan for it. And I think that as teams, we have to really adopt that in mindset and like hold ourselves to a higher standard. And this can be hard, right? Like you have deadlines. You have like management says, this was really easy to get on your laptop while it's not working in production yet, right? Like I think this takes buy-in from multiple levels. It takes buy-in from the team, it takes work from the team, it takes buy-in from leadership. And I think it's a culture that we just have to fix, right? So that's the first thing I think we have to do. The second thing I think we have to do is embrace the DevOps model, right? So kind of in the old world, a developer would write the code and just wanted to write the code and didn't want to care about what happens in production. They kind of just throw it over the wall for the ops teams, right? And this is DevOps, right? If you've read any DevOps book, the Phoenix Project is a good one, by the way. Like this is DevOps, right? You have to get the people writing the code to at least know what's going on in production. And then they can start like thinking about monitoring and maybe instead of writing some behemoth application that's like impossible to monitor, they start thinking, how do I build in Prometheus metrics into my app? Or just like, how do I architect this in a way that it's really easy to scale or monitor? And like at least get the ops, like you can still have ops teams and you still have dev teams, but get them together and talking and build better apps that are more fit for production as a result. And the last thing, and I do have a little bit of a demo for this, it's kind of cool, is like fire drills. So we do them in school, right? We do them in our offices, they can be really annoying because you have to leave in the middle of a meeting. But I think like they're important because they can, like if you simulate this chaos in your world, you get really good at handling it. You know what to do, everyone on the team knows what to do. And it like, it exposes the areas of your applications that are not really yet robust or resilient, right? And there's a cool tool for this that I'm gonna demo called CubeMonkey. So there's a tool out there, I think Netflix invented it or wrote it called ChaosMonkey. And it's this app that you can omission your production environment and it will just break stuff, right? Somebody wrote another version of it called CubeMonkey for Kubernetes. And what it does is it will randomly kill pods. So it's kind of cool. I'm going to go back, I think the next slide is just questions. So yeah, so let me, let me mirror my displays again. We'll need to go back and forth anymore. I'm gonna go over here and I have another script called deployCubeMonkey. And what this is gonna do is deploy as a CubeMonkey pod and it will be configured in a debug mode. And it runs like every five, I guess 15 seconds. And it will, so normally you configure CubeMonkey to run at a set time every day and you give it a window of time during which it's okay to kill pods and you give it parameters for like how many pods in a deployment or whatever to kill. In debug mode you tell it to run like at a regular interval and every time it runs to kill a pod. So normally there's like a waiting thing. So it's kind of designed to deploy in your production environment and just turn it on. And applications can opt into it. So like it doesn't kill a pod unless an application specifically opted into it. And so like I think you can really leverage this on your teams. Deploy it, let every team use it and tell teams that they can opt in on this. And like they set the parameters. That's like you have a really important app that's like not ready to be brought down on my team. This is probably the last six search. Like I don't want that those pods to randomly get killed. But if you have an API that's in theory like resilient and fault tolerant enable it. And it will just get randomly killed and you can test your modeling, you can test your loading, you can test what happens to users and it just gives it to you for free and like it will become the norm and your teams will just accept it, right? So what this is going to do is it's going to deploy Q-Monkey configured in a debug mode. And it's going to deploy another instance of my basic class gap that we've been showing off. It deploys another instance because the one you've been looking at so far used an OpenShift deployment config. From what I could tell, Q-Monkey requires a Kubernetes deployment which is slightly different from a deployment config. But if I go over here, you'll see now I have this Q-Monkey pod that's getting started up. Then I have this Q-Monkey victim pod. If I look at this, I think I can do this in the UI. Yes, I have these labels. Can I make this bigger? Maybe I can expand the window. This is like my first time using OpenShift 4. So, all right, I guess this works. So Q-Monkey victim equals enable. So that means I'm opting in. You give it an identifier. You tell it like a mode which is fixed or random. So how many of the pods should Q-Monkey kill? I'm telling you just one at a time. Meantime between failure is how often it should get killed but in debug mode that's kind of ignored. Anyways, if I go over to my Q-Monkey pod. Here it is. I can look for the logs. Man, this is really small. I could do this in the command line but I'm already here. Like you can see it's running every 15 seconds and it's identifying this Q-Monkey victim deployment and it's gonna kill it, right? So every 15 seconds my pod's getting killed. Again, in production you would probably not run every 15 seconds but you could. And this Q-Monkey victim, actually I don't wanna do that. Well, yeah, let's go back to pods. Q-Monkey victim. What we're gonna see is every 15 seconds this pod's gonna get killed and automatically recreated. And just an example of what kind of things you can put in place to watch this is, so this other line is green now, Prometheus is doing availability checks on my app and every 15 seconds or whatever it's gonna go to red. So I could watch that and see, oh, my service is intermittently available. I should fix that. What can I do? Oh, maybe I should do scale it up. There's just one pod now. We already know how to scale up to two, do that kind of thing. And so this is a really easy way to, nobody, it's kinda boring to do availability testing on your service, right? This kind of just does it for you in a way. And I think it can be a really good tool in your tool chest of building more resilient apps, more cash resilient apps, right? So I think that that is all I have. And as promised, there's time for questions. I'm gonna go to the resources slide though. I have to find the resources slide. I'm gonna work on this, but I guess I can open up the questions. If anybody has questions for me, I'm all yours. Yeah, here are the resources, so jeez. This has links to the code that I used for this, to the open data project, and to Qmongi and SEO that I talked about a little bit. Questions? Yeah? When you just leave the pod, is the process running the pod that you care for? That, so the question was when you, so I showed the reading the pod, right? The question was, what happens kind of under the covers? Is it a kill? Is it like a, like is it a hard or graceful kill, that kind of thing? I think that, I actually don't 100% know the answer to that. I think that that's something you can configure with OpenShift, and you can do like startup and shutdown commands and that kind of thing, and you can kind of build it into the app. But I'm really glad that you asked that question because it reminded me of something I want to talk about. There are multiple like deployment modes that you can specify, and you can specify recreate versus rolling, and the difference there is, I think with recreate, it will spin down your app and then spin up the new pod, and so like, maybe you have a database or something like that that's, you know, you can't really run multiple instances of it, it's talking to the same underlying files you would want to delete the pod before you spin up the new one. So there's a little bit of downtime, right? With rolling, it like spins up the new pod, then spins down the old pod so you don't have that downtime. So that's another thing you can tweak to handle this kind of stuff. Yeah, there are ways to specify that kill mode. Okay, and I guess the same thing with the kid monkey where I would assume that should be like a catastrophic, just turn off the lights, the pod's gone. Right, that's my assumption. I would have to look into it more to figure out exactly what's happening because I agree, like that's what you want. Maybe there are different ways. I think like CubeMonkey's open source, I would like, that'd be an interesting thing to contribute is like a way to specify how it kills it because then you could test like how do your apps, like is your app configured to die gracefully, that kind of thing. And I'm really glad you introduced us because I wasn't aware of CubeMonkey on Kubernetes, but this is massive because this enables anybody that's moving to Kubernetes or OpenShift environment to start to adopt some of the things that are commonplace at Netflix that nobody else, that I know is really doing it. I've run into a few companies that are supposedly doing the semi-an army stuff, the chaos monkey stuff, but nobody else. Yeah, and like it was really easy to get running an OpenShift. The one thing that took me a while to figure out is I was trying to deploy the app as a deployment config. It wouldn't work. Finally I changed it to deployment and it did work. But like once you're running, it was so easy to opt in. You just set those labels on your service and now that service is gonna get killed regularly, right? Super easy. So yeah, like we use these tools for the traditional deployments. We should be using them for OpenShift and Kubernetes too. Yeah. Introducing CubeMonkey. So I have a question. So do you think it's also a good way to introduce CubeMonkey if you want to kill your servers based on a certain condition? Like I regard, apart from the fact that you, I know you said that it will kill anything for 15 minutes or something, but apart from the time condition, like probably a health check or anything like that, do you think it's a good way to do it? Yeah, so what I know that CubeMonkey can do right now is you can specify like the mean time between failure at the mean, maybe minimum, I don't know, on your pod deployment configuration, right? And so you say like kill this every one day, two day, whatever you want, right? And when CubeMonkey runs, when you're not in debug mode, it will decide like is this eligible for being killed and then there's like a random coin flip thing it does to decide whether or not to flip it now. So there's like a little bit of that logic there. I don't know if you can do anything like more than that, like you described, but again, I think that'd be like, I think CubeMonkey's relatively new. I think that'd be a really cool thing to contribute to it. Yeah. Hi, Louis. Hey, Alex. Do you know what the plans are for extending the auto scaling capabilities? Is operators the answer to that? So the question was like what are the plans for extending the auto scaling capabilities if they're playing use like operators or whatever for that? Again, I don't really know the answer for that. I don't have a lot of insight into the development of this. I'm just somebody who's excited to use it. I mentioned that the auto scaling right now is limited to only being able to scale based on CPU usage or memory usage. I think like I would really love to be able to do like a custom script that decides whether or not to scale it up and I could do it based on like message queue numbers or something like that. I don't know what the plans are for that, but I hope that that becomes a thing because yeah, it'd be awesome. All right, well, I think that this went until 10.35 so I'm typically two minutes over, so. All right, thank you, everybody. Hope you liked it. Thank you, Alex, and before we disperse, I'd like to.