 Is this mic working? Yes, it is. I can hear myself. Yes, sorry about that. Got a lot of technical conferences with technical problems. So we're here for, we accidentally created a serverless application. Hi, I'm JJ, developer advocate for IBM Cloud and IBM. I don't know what that means. So if you do, please tell me. But I've been doing this for about four years. I really do have the email address of awesome at IBM.com. And if you do at IBM, that's really impressive. Also you can find me on Twitter at JJAskr. So this whole talk is basically the story that I went through basically accidentally creating application. Who here considers himself a real developer? Who considers himself an operations person? Like runs the software or whatever. Who runs serverless applications? Okay, cool. Okay, we're going on one and two. So this is the story of me kind of going along that. I want to get through the slides as quickly as possible because I want to actually show you the code that we did to actually get it working, which is really, really surprising. And when you see it, hopefully who knows Python here? Everyone should know Python, yeah. You're going to laugh at my Python, but the funny thing is it actually works. That's the best part. So you're probably asking yourself, how do you accidentally create an application? It doesn't make much sense. Unfortunately, I can't see my notes, so I forget some of my jokes here and there, but that's fine. You're probably wondering yourself, normally when you start an application or some project out there, you have a plan. You don't accidentally fall into an ecosystem where you have to build something. Normally you have like JIRA tickets and things like that, but I'm an old school ops guy. I was a DevOps engineer in SRE, whatever you want to call it for a while. Eventually I became a developer advocate, but I still have that same mentality where I kind of just use duct tape and bubblegum to work on some stuff. So our problem stems from something that happened in the pre-COVID times. I kind of went into the COVID times during lockdown. It wasn't so successful, but as a developer advocate, what I do is I run workshops, right? Like I try to teach people how to do things because this technology is hard, right? Like talking about OpenShift or Kubernetes. If you have no frame of reference, it's really, really hard. And we discovered at IBM that opening up a workshop and just like giving you a Kubernetes cluster or an OpenShift cluster was a great way to learn because what do people do? They need to get their hands dirty, right? Like it's really hard to understand containers if you've never built a container. So the problem was is we needed a way to spin up clusters a lot, right? Because before COVID, I would go out to usually DevOps days because I'm really deep into that environment. And we'd spin up 30, 40 clusters to teach people free OpenShift or Kubernetes. The problem is, have you ever actually spun up an OpenShift cluster before? Yeah, you got one. How long did it take you? 50, 60 minutes, okay. So that's a relatively good run. Being that I work for IBM, we have to use something called the IBM cloud. Who knew we had a cloud? Hey, that's actually more people than normal. That's supposed to be a joke. Anyway, the IBM cloud, our clusters are designed to be production ready, right? Like they're not supposed to be ephemeral. They're supposed to be, they spin up and you're supposed to iterate inside of them. The challenge though is that all of those clusters because they're production grade actually take anywhere from three to eight hours to spin up, mainly due to stuff I will talk to you about over a beer, not on stream. Let's just put it that way. So imagine now as you start to roll your head back here, if I have to spin up 30 clusters or 40 clusters for that matter for a workshop and it takes eight hours and there's issues around it, possibly some things don't happen, all that kind of jazz, you start seeing the problem and I'm iterating these things on a weekly basis, right? Because if you install a new operator inside of OpenShift you have cube admin access to that cluster. It just, you start seeing the problem start arising around it. And that's what this talks about is how we actually got around this problem. So of course as developers and at this time I had maybe three people helping me half time because we also had to be on the road, obviously doing the workshops and things like that. So we kind of had three people thinking about this and we started with something that we all do as an operator is we use Bash. So the IBM cloud just like AWS or Azure has a CLI it's called IBM cloud and you could just run a bunch of Bash commands against the IBM cloud. And this is actually really close to the original script. And if you know anything about Bash, this is bad, very, very bad, but I literally run it on my laptop the night before or two days before the workshop they're supposed to run overnight. And as you can see, it pulled out some VLAN information, it created a cluster and a specific data center, had the public and private VLANs because that's a big thing inside of IBM cloud. And then I saw if it succeeded or not, basically a fire and forget, right? That seems reasonable, right? Cool, whatever. It's a simple Bash script. Obviously it's not scalable because only one person can run this because we might have name collisions. If I decided to call for instance, I have fringe in Japan for the longest time, had all their workshops called tech dojos, right? Well, if I decided to spin up tech dojo 001 and my friend in Japan decided to spin up tech dojo 001 at the same time, we have name collisions and issues around that too. So obviously this is not great. So we have to, yes, please. Oh, it is, okay, cool. Well, then I will. Perfect, I will do that then. While I'm doing this, I'm also a streamer, so I'm very used to pivoting my conversation very quickly. So I'll try to look straight ahead, but unfortunately I'm an animated speaker. Anywho, so obviously over time, we needed a better way to kind of, so we're gonna work for this categorize our clusters. And as you read this, now I was able to run this all with a little bit more advanced Bash. I used functions, ooh, okay, it's scary, functions in Bash. Kind of fun stuff. Now, if you didn't notice, there's this little sleep command in there. This has been a few years old now, so I'm allowed to talk about it. Hopefully nobody on stream is gonna record this. That's another joke. If you notice, there's a sleep random seconds there. The first time I ran this specific script, I took down IBM, found out I actually have a printed out incident report from the SREs that run IBM Cloud because they were like, what the fuck just happened? Basically, I overran our API because I sent too many requests. I was like, I'm not gonna save the number because it's embarrassing. But the point being is I just shot all these requests to the API asking for all these clusters and it overran the SED instance that was keeping track of everything. And then all of a sudden there was a, I think they used to term back pressure on the rest of the clusters, which caused a cascading issue across the whole infrastructure. According to one of the product managers, they were like, JJ, you're the first person who hit this problem that we should have never seen this problem and you have now officially load tested IBM Cloud, congratulations. So how I got around it was the little sleep, which in essence, random seconds. Unfortunately now I couldn't just shotgun all this out in one time, I had to wait now even longer. Now, if you imagine every single request goes, everything's happened, I would have to kick it off almost two days early to make that happen. As you notice, there's another sleep for 60 seconds in there because those are completely different API that we have to do to do tagging. And eventually we started leveraging our tagging API to categorize this. Anyway, let's keep going. So obviously Bash is not sustainable, right? It's just not. So we needed to use something a little bit better. We came along with it, we found this Ansible IBM Cloud community plugin and it was shit, not gonna lie. It didn't do anything we needed. It was very, very not good. I can't even go farther than that. It had some very low impact usage for the IBM Cloud and it didn't even do clusters, which was one of the biggest problems we had. We had to spin up Kubernetes clusters and it didn't have the resources to do that. We could have extended it at the time, but only people, three or four people work in half time on this, it wasn't worth it. So when we tried, it didn't work, we continued moving on. So of course we come to this thing called Terraform. Who knows what Terraform is? Who uses Terraform? I got a couple, cool. So of course, eventually, every single operations person or dev ops or whatever you wanna call ends up with Terraform at some point. And this was great at the time because now we had the actual state, we were able to leverage the way we did the state. And it turns out, for whatever reason, the power's to be at IBM for more money and investment into our Terraform provider than our Ansible provider. So we had support for everything, it was awesome. Unfortunately, we ran into some other issues that, again, I can't talk about here with our Terraform provider, but we had to move away from it. So we went back to Ansible. And we ended up, we actually ended up writing our own Ansible provider to talk to specifically IBM Cloud to request clusters. So basically we took our bash script, turned it into Python, fixed it and turned that on. Because we needed a way to programmatically shotgun all these requests out in a way that we can have state known over possibly eight hours of time to see if a cluster actually comes up correctly. Because by this time, we were actually starting to do some real post-provisioning with our stuff too, where if you wanna install OpenShift and you wanna install it when they're already, so people can learn how to use Tecton or whatever, there's added bonus to having the operator already there compared to someone having to install it. Granted, it was back and forth, depending on what the teacher or the instructor needed, but we needed a way to do that. So when we knew what the state of the cluster was in and then Ansible coming along down the line, telling us what state it was in, then we can do our post-provisioning. But I digress. So what we ended up using was something called AWX. Does anyone know, does anyone know who AWX is? Do you know what Ansible Tower is? Ansible Tower is the easier version of AWX. That was supposed to be a joke. Nothing? Okay. So AWX is the upstream version of Tower. I worked for this small startup called IBM that you may have never heard of. And a funny thing, I can't get a license to Ansible Tower. So I haven't figured out how to do that. So of course we needed a way to have some type of command to control. So we came along with AWX to do the work for us. And then we built our own little app from an amazing engineer, his name is Mofi Rahman. I desperately miss him. He has moved on to greener pastures and one day I will get him back to work with me. But he wrote actually a little go app called Cube Admin that allowed us to have a nice overview of what all our clusters were doing. And the best part about Cube Admin, and this is funny because if you do any cloud stuff at all, you're going to laugh at this, is the delete command. So on IBM Cloud specifically, I counted this, it's been a while since I've counted it recently, but I did count it. It takes 17 clicks to delete one cluster, 17 clicks. And that's not including also verifying by putting the cluster name in there. So I'm clicking around from the very beginning if I need to delete this one cluster, I counted through going through all the different things and eventually figured out it was 17 of them. Cube Admin turned it into a checkbox and a delete button. So that means I can nuke every single cluster inside of my account now with literally a checkbox and select all. It was phenomenal. It saved so much time, it was unreal. Okay, you're probably saying, JJ, why didn't you just write a Bash script to do that because you can do that at Cube Admin, right? Or with IBM Cloud, might I go back to the problem? So needless to say, Cube Admin was a godsend for us when we were cleaning up the stuff too because these are clusters that cost thousands of dollars a month to run. So I can't have these running all month, right? I need to delete them as soon as the cluster, the class is over. So saving that was awesome. I will show you that here in a moment. So the next thing we started leveraging was we needed somewhere, I'll wait a minute. Now I've done all the work that I had to do myself, spinning up all the clusters and supporting all the clusters. But I needed a way for my instructors to request these things. So we ended up leveraging something called IBM Cloud Functions. That's basically Lambda, if you will, for IBM Cloud. But built off of OpenWisk, which is pretty neat. So we actually have another one called Code Engine, which I'll talk about in a second. But we ended up leveraging Cloud Functions and I'm speeding up because I want to get to the code. We have this Cloud Functions starts scraping our internal GitHub, which is really neat. So anytime somebody put a request in for a new cluster, it would actually scrape through this issue, pull out the information from it and then actually put all that code into AWX with a REST call and then shotgun that stuff out. So now I'm not even using AWX anymore. Now I have this little service function that does the work for me, which was really, really neat. And I'll show that here in a second. I do want to take a quick aside. Who knows what Terraform Workspaces is? Or Terraform Enterprise? Yeah, you got one or two? One of the biggest pluses about Terraform Enterprise is the Workspaces ability. When you need to have state across your Terraform infrastructure, basically it sends your state file up to a shared environment where you can give access to knowing what's going on. When we were doing our Terraform story, we actually used something called Schematics and it's an IBM product. It's free and it does exactly what that Workspace does for Terraform Enterprise, which is pretty neat. It worked great for us for the longest time until we ran into those problems that I'm not allowed to talk about. So if you ever are looking to do more Terraform Work, I just want to make an aside saying that Schematics is worth it if you need to deal with more than one or two people needing to know your Terraform Enterprise infrastructure. And again, I would have actually mentioned more about Schematics if I had my notes in front of me because that was a whole Terraform story, but you know. So I want to talk about Code Engine also, which is basically our next generation service platform. And this is when all the stuff comes into play, how we accidentally created our service application. So if you know it's Lambda, recently you can start shoving up containers to Lambda and it sits there, of course you have the cold, of course, but you need a way, yeah, you can just run a container. Well, IBM Cloud has something called Code Engine that does the exact same thing. You just give it a container, it sits there and waits for it. It's actually built up of Istio, which is really neat and all runs on Kubernetes. So if you know anything about the Kubernetes ecosystem and have experience with Istio, you can actually see the bits and pieces of Istio inside of it. So as an open source developer, you actually can understand what's happening behind it. It's not just a black box, which is nice. So Code Engine was actually our ended up power tool because we were able to start building out those different features into different blocks. So here's the actual API, or the whole workflow that we had from that bash script into our serverless pipeline. So as you can see here, we have an actor, we have that GitHub Enterprise where we put in a request and we have that GitHub Validator that was that little open-wisk feature that would parse through the GitHub Issue and then from the GitHub Issue, it would go to the GRAT cluster API, can you see my mouse, good, which was just another, which created a serverless app that allowed us to give out the clusters. Then we had this Issue Tracker that would actually parse through it and that looked for OpenShift or Kubernetes. Then send the request either to Schematics to spin up a Terraform instance on a VPC or if we went directly to AWX and send up the cluster, classic clusters. IBM Cloud has two different infrastructures, Classic, which is the old software stuff. Again, we can have a beer later and I can talk to you about software. Or we have our second generation with VPCs. Does this make sense? See how we progressively got to this point from a bash script where we started containerizing different pieces of our application and just turning it into, and basically shoving it all just inside a code engine. So what did I create? I created a Rube Goldberg machine. If you don't know who Rube Goldberg is, you're missing out. But the short of it is is that's what all serverless applications are, is just a bunch of different JSON blobs going back and forth between different features to make things happen. And it's fine, right? Like it does the work we need to get done. But it was, when you boil it down to an application, like what do we do? We put a bunch of text into an issue and then we cluster the other side. So let's actually see some of the code now. So, officially speaking, this is privatized information because it's on our private, but it's just text, so whatever. As you can see, I've done 605 of these. And what happened is we ended up, for instance, here. If you can see this. Oops. So this would be one of the issues that we put in, right? So I would be like, hey, put in this issue. They fill out all this information, 40 OpenShift clusters, two worker nodes, the size they needed inside of Tokyo or whatever. And it would call the name OpenShiftLab. Back to that problem I was talking about early about the name collisions, that was a real problem. And then, we throw all that stuff up. We come along, we actually parse the information as you can see here. If you have kids who that little squirrel is, most likely, door of the Explorer, Chico. No, okay. He's the one who drives the car, Chico. Make sure everything comes out properly. And then, you know, shows up a request. And then, you know, does all that stuff. And here's what you see right here is QBadmin. Actually gave us direct link to what we need to go through. So the most interesting part of all of this was our GitHub, our issue tracker. And as you can see here, I promised you some really, really bad Python. As you can see here, what did we just do? Oops, it was this one. What did we just do? We just shelled out to IBM Cloud again. That's the joke out of all this talk, is we just went back to our Bash script. We just wrapped it in Python with a Flask app. And I just realized that we only got five minutes left. I'm sorry. But the whole joke is, all this story is we started a Bash, we ended up a Bash again, just with bunch of Python in front of it so we can just shove API keys to it. This is what it boils down to. And we just put it on code engine. And unfortunately, I wish I could talk more about it. And again, my slides were not great, but thank you, I guess. If you want to, right there, I'll post. Oh yeah, well, so was it worth it? We went on a massive journey to just come back to Bash. Right, and it's fine. Having the Slash approve, it actually did save hours upon my time as an engineer, which is nice because now I was able to do it. Also, I got to learn a lot of stuff that I didn't know beforehand, which was great. And that worked all that well too. But if you want to know more details or things I skipped over, take a picture of that QR code. There's a blog post I put out about it that I go into more details on IBM Developer. And thank you. Any questions? I'm a Pisces? Yes. So the question was, is how hard was it to maintain and understanding how all the rats nest, to use your term, was? The Rib-Gorba machine? It's the story of a good enterprise architect. Because I'm the one who designed it and built it, I knew where all the bodies were buried, right? But when I brought someone new on to help me for a short period of time, it was hard because they would be like, why are we doing it this way? And why are we doing it that way? Luckily, this was only a platform for a feature to make other people's lives better. So it wasn't like production. So I could work with them to work, move things around. But eventually, I just sent them that image saying, you're looking for this portion of it. This is the name of it. Here's the GitHub repo. Pull down the code and go fix it yourself. Kind of thing. This one, this became my source of truth. Go ahead. Great question to repeat it is, how would I maintain this over a long period of time and productize this? Hopefully have more than just me. And then back to the joke of, you break down the features into the different portions. For instance, like the grant cluster API, all it did was spin up the application to give out clusters. The issue contractor, all it did was just parse to that GitHub issue. I microserviced everything to the smallest entity I could. So when something broke, I could figure out exactly what went wrong on what specific application, which is exactly what we're supposed to do with microservices, right? Or back to the old Unix philosophy, do one thing and one thing well, and that's what I tried to do here. Presumably, yeah, I just didn't design it that way because I like having histories, yeah. Yeah, yeah. Prototype. What's the joke? The most, no, the most permanent fix is a temporary one. There's something on the back. But okay, if you have any other questions, please don't hesitate to find me. Thank you. This is on. I'll drop it in your pocket just as long as it's not right. It's your phone. Phone is that way, ooh, I'm already live. Were you planning on staying stationary? No. I mean, I'll probably not fall off the stage. You're gonna stay on the stage? Yes. Okay, we were just wondering where to point the camera. Is what we're gonna try to do is get the camera in. Right, I can stay still. I can do it if you need me to, yeah. Is that what you want me to do on this side? I'll just, yeah, we'll just do this. This'll be fine. Is this over? Yeah, ooh, that's, yeah, that looks good. All right, are we ready? Thank you, you're the best. Are we good? All right. Thank you everybody for hanging with us through the slight technical difficulties. But we do have the presentation up on both sides now, so that's a win. This talk is called Helm and Back Again. And my name is Hilary Lipsig. I'm a principal site reliability engineer at Red Hat. A little bit about me. The site is like slightly old now because as of a couple of days ago, I've taken a new role at Red Hat, but up to now I have been the global team lead for a distributed site reliability engineering team. We support managed services running on top of OpenShift dedicated. If you've heard of Red Hat's managed Kafka offering or OpenAPI manager, those are the managed services my team supports. As I go through this talk, so one of the things you'll get to enjoy is I'm going to talk to you about real production incidents we experienced in the field. And the reason that you're gonna get to hear about those is those drive a lot of my perspectives that I'm sharing here today. My friends said I should mention this. I'm a self-taught engineer, so this officially makes the most amount of time I've ever been at university. At least you still think it's funny. So today we're gonna go over a couple of things. We're gonna talk about what are Kubernetes operators? What are the pros? What are the cons? What is Helm? Again, pros and cons. When I would recommend using one over the other and then options for transitioning. So starting off, what is an operator, okay? So Kubernetes is a series of controllers, right? And to understand what an operator is, you first must need to understand CRDs or custom resource definitions and the custom controllers that go with them. These expand the Kubernetes API to allow it to control things that are not its primitive types. So I'm gonna go on to the next slide that actually is a little bit more obvious here. Like if I wanted to say I wanted a custom resource definition on a database and I want to custom controllers to interact with this database, that is when those two things come into play. And this is just basically expanding the paradigm of Kubernetes up to the application layer. I should probably like follow along and actually read off my amazing speaker notes that I prepared, but you know, we'll see how that actually goes for me. So these custom definitions, they slightly increase the complexity of your Kubernetes deployments, but they also give you more options. So again, it's taking the nature of Kubernetes and applying that paradigm to actual software applications. Raising through that. So why is that important, right? So for a Kubernetes operator, that becomes important because what you're allowed to do or what you're enabling yourself to do by using a Kubernetes operator is you're taking the operational knowledge involved in running your application successfully at scale and you're codifying it. You're turning that into code. So historically there might have been some sort of operational automation that would exist to fine tune or recover an application from sort of state. That's now all baked into an operator. When that's baked into an operator, that means every instance of your application when it is installed and managed by an operator is guaranteed to be the same. It's going to be operated the same. It's going to behave the same. People consuming it are going to experience the same experiences. So that said, one of the things I say the most about Kubernetes operators is a good operator requires a good story, right? This is the who, what, when, where, why and how of your application. So you are writing software that is impersonating a human. I'm currently a reliability engineer. Prior to that, I spent 11 years in quality engineering. I have dedicated my life to writing software that impersonates humans. So the important thing about doing that type of software development is that you can't just say, oh, it's automation, so I'm done. No, you have to still lifecycle your automation. The Kubernetes operator is a fairly complex and amazing piece of software, and you need to lifecycle it. So it's going to lifecycle your application for you. It must also be life-cycled. We can actually start getting into several layers of this, operators for operators. Some folks will be familiar with OLM, the operator lifecycle manager, which is an operator that life cycles operators. So it really can go all the way down, but at its core, good operator requires a good story. It's replacing the human element. There are five levels of maturity to a Kubernetes operator, and I just actually pulled this right down off the operator website. For those of you who don't know, it's operatorhub.io. So the five levels represented here tell you about what types of behaviors you can expect from the operator. So if you're consuming an operator, which you can, many companies, red hat and footed, make operators available for free use for folks, it drives the adoption of the software. So if I'm looking at an operator, I'm thinking to install it, I'm saying, oh, it's a level two. Well, what does that mean, right? It means that it's doing installs and upgrades. Level three, we're getting into the full life-cycling. Level four, deep insights. This is important. When I mentioned earlier the types of workloads I'm a reliability engineer for, manage Kafka, Realm, these are level four and above operators. They come with these deep insights. Metrics for SLOs, SLIs, alerting, usage. These are the types of things that tell us about how our software is behaving and it feeds back into the software development life cycle because it allows us to also improve our applications and our operators managing our applications. And the final maturity model is level five. This is autopilot. That means that the operator is also like autoscale. So if your operator is detecting that your software is starting to run up against some limits, it could actually increase the capacity of your software. This might be scaling some replica sets up, increasing a database size, those types of things. There's actually not a lot of level five operators out there, but they are really awesome. I will say CouchDB has a great one. So pros and cons of an operator. Let's just start off with the sunshine, the rainbows, the good stuff about operators, right? I work with these every single day. I mentioned this earlier. An operator, when you take your application and you package it in an operator, you've made it more accessible to people as a whole. Because again, you've codified the operational knowledge required to be successful with your application into the operator. So you're really reducing a barrier of entry for adoption. The automated and ongoing dependency management that comes with an operator is great. Operators are continuously running. They're kind of like the old school Linux demons. And that way they're continuously running and performing actions on things called opera rams or the Kubernetes kinds that they're watching. So the reconciliation loop helps with chicken egg scenarios where if something had just waited a little longer or retried again later, it would have worked perfectly. The reconciliation loop kind of handles that. And then unlike some other ways of deploying applications in Kubernetes, you're not restricted to one namespace to manage when you're working with an operator. It can actually do quite a few different namespaces. This is very nice when we start looking at the topology of some of our managed services. They have all of their managed observability pieces in one area, they've got all of their other pieces in another area, so namespace separated. And then that also allows for multi-tenancy because of the namespace restriction. So that's all very cool. Service accounts with custom RBAC, this is really important, especially in SRE land. Most amount of actions, least amount of access. And last but not least, no snowflakes. I talked about this before. Every instance of your application managed by an operator, kind of ignoring the auto-scaling potential, will look exactly the same. So for me, I have a fleet that can change in size every day. I will get different numbers of clusters to have to look at. And so I could get a whole brand new cluster, have an alert go off on that cluster and I know exactly what to expect because the operator makes it so that it's exactly the same as every other cluster I have to worry about. And that's very, very nice when you're operating a fleet at scale. This goes back to like the whole cattle nut pets situation. Operators drive that. But there's some kind of not so great parts about working with an operator again. So the reconciliation loop, right? We talked about that a second ago. The reconciliation loop is your operator going, is this in the state I'm expecting? If it's not, it will try to make it into the space that it's expecting. And if that fails for whatever reason, it will try again. Sometimes you would really want that to fail. There are some things that can happen where you have an operator running and let's just say that it's a real story. The operator is trying to perform an upgrade and it looks at the install plan for the applications wanting to upgrade. And within the install plan on the new version, there's a change, cool. Except in the old version of the install plan, that changed field was immutable, meaning it should never change. That upgrade will fail. But your operator will keep trying and keep trying and keep trying. And that can cause it to do some not so great things. You might be seeing some serious spikes in CPU usage. Now, if your operator has something like extremely high priority, those spikes on CPU usage might be preventing other workloads from scheduling. So that can start causing some real other cascading problems that you need to be aware of. This is a double-edged sword. A powerful tool you can hurt yourself. So again, going back to level four, deep insights. The important thing to do here is when you would want something to fail and you know it won't fail because we're dealing with an operator, you should be writing in some alerting, some way to catch those types of things that would otherwise result in failures in most traditional ways of packaging and deploying software. An operator has more overhead to write than, for example, we'll be contrasting this to Helm. There's just a little bit more going into it. And then something or someone has to operate the operator. So be that a person such as myself or another operator such as OLM mentioned earlier, something will operate the operator. They're only self-driving to a point. And no snowflakes, right? I just had that on my last slide. Wait a minute, that's great. Sure, except sometimes the workloads that I'm an SRE for are on customer-owned infrastructure. Now customers are paying money for this infrastructure so they are, of course, using it. Well, I'm like, that's the point, right? So the problem can be, though, that installations and upgrades can use more resources than runtime. And so in situations where a customer is really pushing the envelope on their resources because they're trying to get the most for their money, you might have to start doing some stuff in order to make an installation or upgrade finish. Now that could be simple, part says, okay, well this part's already upgraded in this application so I can actually scale those pods down to zero replicas and that'll free up the resources. That's one thing that you would do, except for that when you have an operator, by the time you finish and apply that change, the operator's probably already put it back to what it's expecting. So you're not gonna get the benefits from those freedom resources. You actually have to scale down the operator to zero replicas, then scale down your workloads. Remember when I said operators for operators? Yeah, you have an operator controlling your operator, you have to start scaling down multiple layers of operators to start freeing up space. This introduces a really uncomfortable amount of room for human error. I don't like uncomfortable amount of room for human error. So double-edged sword, once again, cuts beautifully, can cut yourself. So what is Helm, right? It's essentially a package manager, right? It's a little bit fancier than that, but at the end of the day, it's a package manager, it's a templating engine. If you've used Brew or NPM, you understand the basic core concepts here. Let me double check my speaker's notes because I'm not gonna lie to you. This talk was originally supposed to be done with a second person who has used Helm way more extensively than I have. And I just want to make sure that I don't say anything wrong. I don't see him here today, but one of my colleagues is a Helm maintainer and helped me if I said anything incorrectly, he would just never forgive me. But I know he's around too, so I'm waiting for him to pop out of the world where it can be like, surprise. Yeah, exactly. So, right, so ideally the software you want to run like an operator would come with a Helm chart supplied by the folks outputting the application, if that's not you yourself, right? Helm is nice because it's declarative, right? We're talking YAML, which is great because you're not doing some coding. And the various types of operators, which were kind of shown previously, typically like Golang, you're gonna be doing some coding if you're doing a Golang operator, which is great. Golang's not my favorite tool, but it's great. So it's basically just, it's a templating engine and package manager. So it has some really neat features, right? The Helm CLI is nice, it's clean, that's just makes it a pleasure to work with. Helm charts, that's the thing you hear the most, this is basically the templates, this is how you're really defining everything. Or rather the packages and then the templates provide the dynamic capabilities. You have values, so configuration values that you can make changes to, those are injected into the templating. And then of course you can version and revision and configure everything, right? So some pros and cons, right? So all things being equal, Helm actually has a much lower barrier to entry to get started with. The operator SDK is great, they've got some amazing tutorials about how to get started with the operator SDK, how to use the operator SDK. But there's still some overhead. And actually the original impetus for this talk, when I was, especially when I was doing it with my other colleague, he'd actually sent me a Slack message one time. And he said to me, hey, I'm trying to figure out if I should write a Helm chart or an operator. I was like, okay, well what's the operator gonna give you in exchange for the overhead? And he's like, nothing, I just already know how to do operators. I was like, cool, so you're writing a Helm chart then. And he did, it's what he ended up doing. He wrote a Helm chart for his situation, that was the better move. So Helm also has some additional niceties because of the fact that it's a package manager and templating engine, you can actually do releases and upgrades involving business logic, not to say that you can't do them in an operator, but it's not really recommended because it's not as quite as dynamic as a Helm chart and releases driven by Helm can be, right? So if I wanna do some AB testing with my application, I wanna test green versus blue or whatever on imaging to see what results in higher engagement, releases with Helm charts as the backing are going to be much better for that because they're a little bit faster to roll out. You're not packaging an entire software to operate a piece of software when you need to change your Helm charts and help configurations. So, and really just operators didn't set out to solve that problem, right? That's not really what an operator is setting out to solve. It is one of the types of problems that Helm did set out to solve. So that business logic attachment with the, especially anything involving like pre-deployment hooks or post-deployment hooks and checks, like those types of things with Helm can be really great. If you're an Ansible user, you can integrate Helm and Ansible together. That's not super dissimilar from writing a Kubernetes operator that is an Ansible operator, right? So there's the Gulling operators, the Helm operators and the Ansible operators. You go back a few slides, I had that picture up and you could see that the Ansible operator and the Gulling operator have equal parity in terms of functionality. So at that point when you're integrating Helm plus Ansible, like you're basically in a stepping stone up to an Ansible operator, you may never decide to just fully package everything into an Ansible operator, having Helm deploy operators is like a perfectly valid use case. There are Helm charts to do that. Having operators use Helm charts, perfectly valid use case. There's operators that do that. So it's kind of like, these things are pretty intermarried in some ways as you start getting into how do I start doing things at enterprise scale. And so coming back to that reasons for doing this, right? Helm has rollbacks and rollbacks are awesome. But a lot of places don't necessarily do rollbacks. So you might not prioritize that. So the ability to do a rollback is great because we all have that oh shoot moment where something terrible happens and maybe we don't even exactly understand why. But a lot of people in the industry rollback by rolling forward. And both with Helm and with operators that strategy works fine. And so the last thing I'm gonna talk about is actually it's on my site, it's on my speakers list. Third party software you need to repeatedly deploy. I'm not a big fan of writing operators for software that you don't own because you can't guarantee consistency in operations when you're working with software that you don't actually, like you're not the maintainer of it. They could decide to change some fundamental piece of how that software operates and your operator is completely host. So I think Helm's a little bit better for when you're having to repeatedly deploy your upgrade software that you're not the manager for. And you can do some conversions from Helm. There's a Helm operator, Helm operator SDK that you can do some conversions. And that's functionality that is increasing always. When I first gave this talk that was like barely at a level two. Now it's really like very solidly in a level three in terms of operator maturity abilities. So like it's continuously improving. We can thank the operator SDK group for that work there. So of course though, right? We're talking into pros and cons list. There are some things that Helm doesn't do super great. So we talked about CRDs, the custom resource definitions, right? And Helm uses these two. The problem that Helm has is it doesn't have the built in logic to clean up old custom resource definitions when it's time for them to go away. So you need to do some clever tooling, use some post install hooks or whatever in order to actually go back and do that additional cleanup. So that's not lovely because it's not all in one tool. Again, if you want to be writing an operator with deep insights level four above, you're gonna be basically doing that from scratch or looking at going from a Helm operator or Helm plus Ansible or something like that into an Ansible operator or just straight up straight Golang operator from scratch. And you can fairly easily move between Helm and Ansible operator. I actually even have found some little tutorials to help make life easier on that, but you're still talking about an effort of like some copy paste. In terms of upgrades, strategies and so forth with Helm, if you're talking about an upgrade that involves like a schema update to your database, yeah, Helm's not gonna be like the best for that. There's ways around it, some additional toolings, pieces that you can work into your upgrade strategy. Natively though, Helm's not gonna handle that for you. So that's a little bit of a bummer. And if you like dry code or don't repeat yourself, if that's really valuable to you, the inheritance model with templates in Helm is maybe not going to be your favorite implementation of inheritance because you end up having to define the same thing in multiple places, which is again, a little bit less than ideal in some situations. So who wins, right? Honestly, if you are looking at your software and you're putting out something new and you've not really done it before, I'm going to recommend starting out with something like Helm. That is a really strong starting point for just being able to help you move and iterate more quickly. You need to reliably move software. Helm's a great choice, especially in like a Kubernetes environment, right? Well, I mean, it's Kubernetes, but it's Helm's a great choice. You have the installation with hooks, you have templating, you've got upgrades with hooks and really when you're just getting going, that's really what you need. You're probably when you're originally starting out with some software, not going to necessarily deeply understand how your software actually needs to operate. There are some things that you gain from experience. And so starting off with Helm, instead of trying to write an operator and guessing, is just going to probably save you some cycles and a little bit of stress. But eventually, whoop, I just promised them I wouldn't fall off the stage and I almost did it anyway. Eventually, you might deeply understand your software and how it operates and you might be looking to figure out how you can better drive adoption for people that you want using it, right? And that's when you're going to start thinking about moving to an operator. So again, when you need more than just installs and upgrades, when you need deep insights, self-piloting, full life cycle management, you want to make sure that all the CRDs get cleaned up because you don't want other tooling interjected into the ecosystem to try to do it for you, especially when cases of an iteration can break, integrations breaking are awful. It makes everybody miserable, terrible to debug, we just don't like that. So eventually you might start thinking about moving into an operator for those reasons. I really feel like starting off with something like Helm, especially Helm and then moving into an operator just allows you to be more iterative. I mentioned, I think earlier that I've spent 11 years in quality engineering and I really, really, really, really like iteration, like a lot. So I really recommend it as like a great first step. It can last you for a very, very long time and if you're not a goaling developer, it's a really great place to be as well because we're looking at YAML, it's everything's declarative still. And again, like I was saying, transitioning into something like an Ansible operator, which is also going to be YAML, is going to look very familiar. So that said, yeah, I actually got the whole thing. That said, we're ready to go ahead and move on to QA. And thank you guys for coming along with me on this. Okay, David, go ahead and heckle me. I'm ready. That way I think there's a mic there. Maybe turn it on. Pretty good talk. So I come from a background where I know a good bit about goaling operators, but I don't know much about Helm. So I'm not sure if it's the best topic because I know you said you had a colleague who was generally doing that, but we'll try anyway. So the fact that Helm, sorry, if you're using Helm templates, you might not get the deep insights level of an operator that you would be goaling. Give a suggestion for how you solve the observability problem if you choose Helm over a Go operator. That gets really personal, right? That starts looking at like deeply the anatomy of your software. You're familiar with the Aaron exception tracking project that my team's working on today, right? Sure. So that actually doesn't have an operator behind it, but we still had to add that same type of level of deep insights. So we ended up actually adding into the application a observability API, essentially. So you just kind of add an additional piece to your application and that exports everything that we need to get from our application into Prometheus. So it's not super dissimilar in terms of a model. It's just not all part of one cohesive piece. Okay, makes sense. Thank you. I have one other Helm question, but again, I'm not, it's okay if you say, look, say you don't have the knowledge on it, but I, if you're a fairly technical, but if you're writing a Helm template, if you have a CRD that somehow packages part of a Helm template, I'm not quite sure how that works, but if what is actually watching for events on that new custom resource in Kubernetes? Like is there some- It's a controller. It's kind of the same mechanisms, right? You're just talking about a custom controllers. Yeah, but is there a single Helm controller that's watching for events based on whatever you've defined? Is there a single one of those running, no matter how many Helm templates you have? I'm not sure if it's a single one running. That's a great question. Like I wish Andrew were here, so he could answer like all these really hard questions. So it is a package manager though, right? So if we go back, I'd put you to sleep. I regret that. It's my own ignorance on Helm, by the way. I just, I need to do a bit of research, but... Hang on, hang on. It's on a slide, I swear. It was on one of these slides. There we go. No, come back. Which one are you? Your slide's 16. Okay, we're gonna go back to slide 16. Boom. Okay, there you go. Right. So, right, you've got your charts and templates. You've got your values with those in the configs, right? Those get pushed in and the CLI actually does like the work. And yeah, there's the thing running there. Something has to execute the CLI, so that could be GitOps, right? It integrates with GitOps. That could be a person. It's gonna happen from when the CLI commands that are executing, it's going to do... This is one of those things where I would wanna bring up the CLI API reference for you, so you can see all the actions on it. But it's going to basically do... The template's going to have sets of... The template's going to have the anatomy of the application. The chart's going to have your actions, essentially. And then the CLI basically does the execution loop on those things. But does it just create the resources on cluster where there's then something else on cluster that's watching for events of that new customer resource? Yes. Okay, it's just... Anyway, I'll do a bit of research. Kubernetes is just watching for events on that custom resource and your custom controllers. Okay, appreciate it. Thanks very much. Any other questions? You guys gonna let me off easy? I'm like, I need to see how serious you are. Go! I'm not gonna be upset. So, I really... My ears perked up when you were talking about Helm as an entry point, instead of jumping right to an operator. And knowing that I have this particular problem of wanting to bring open source developers in to bring their services into a cloud environment, and their job is not to be running a service in the cloud environment, but to be working on their code. So, how do you feel the ramp is going from zero to Helm? We're trying to do it, and then going and then being able to step up. But like, you just sort of light it across in terms of the things, but what's the... Is there pieces missing in between? Do we need to make a better ramp up to Helm in the first place? What's the kind of state of year you feel about it? So, it's almost like there's a false dichotomy, right? That's how I feel about it. The whole kind of premise behind the talk, although it's kind of, I will say, named in a way that's a little bit misleading, is that my friend asking me, should I write a Helm chart or an operator, was a false dichotomy, right? He needed to use the thing that was right for what he was actually doing. And in that situation, because we're talking about just doing like a quick demo app for a customer, Helm was the right thing for what he was doing. So, if we're looking to, and I know you and I have a bunch of shared context on this that the room is missing, but if we're looking to figure out how to lower the barrier to entry for people to play around in new environments, right? You can output some Helm templates and some example charts for your environments that people can then go and then make changes to that so that their application can leverage it. So, you can basically give people a starting point because it's a package manager and templating engine and there are some things that you can basically say, I don't know anything about your application, but I know everything about my environment. So, if you fill your stuff in here and here are the variables in this template, here's the information you need to know about actually using this chart to push code to the environment that I'm providing you. So, that would kind of be that. Now, in terms of lift to go from a Helm chart to a Helm operator to a different type of operator, right? We actually provide tools to do a Helm operator. There's like a little bit of automation. Don't remember if it was actually built into the Helm operator SDK or if it's just like an additional tooling, but there's essentially a little bit of a conversion that Red Hat supplies. It's not the most well-documented thing. I will caveat it with that. It's not the most well-documented thing, but it generally does work and you can find some demos and so forth on it to start doing some initial conversions. Either way, when you're looking at going from Helm charts to an operator, you're going to have to start writing some things from scratch because the minute you move from Helm to operators, you immediately have the ability to start working in multiple namespaces, right? That's an immediate thing. You're not going to have that pre-built into what you were doing at Helm before. Now we're going to have to start actually writing that in. So it basically just helps you to not have to redo everything you've already done, but you're still going to have to do more. Development goes into it. It's a complex piece of software that does hard things and you need to treat it as such. Once you are treating it as such, then you can really leverage like how powerful the operators are. And again, for me, I don't like going like dependency management at all. It really makes me nuts. So I would much prefer to go from something like a Helm chart or a Helm operator straight into an Ansible operator. Ansible operator, well, first of all, the operator SDK is Ansible under the hood. No matter what you're doing, you're using Ansible if you're using an operator, right? So I'm just going to skip the mail, Madden, and just go straight with Ansible, right? It's declarative, it's YAML driven. You know how to do this. We basically all can lint YAML in our sleep now and every background, whether it's in software, development application background, or a like Sysadmin background at this point should be able to very cleanly and efficiently contribute to an Ansible operator. And I like that a lot because if you're looking at an SRE team, which is typically comprised of people with both types of backgrounds, that really kind of creates a joint area for people to both work very closely together and you get that full spectrum of perspective. So I really like that for a lot of reasons and I'm a huge proponent of it for a lot of reasons and I just also like Ansible. Other things that are very comfortable for me about working with Ansible operators is we're using the YAML, we're also using JINJA 2 templating. So as somebody with a Python background, JINJA 2 templating is very familiar to me. It's also just very easy and like super quick to learn. To date myself, I've been using JINJA 2 since before it was part of the Python standard library. So when I say that it's easy, it's like riding a bike. I never forget how to use JINJA templating. So there's a lot of real power in that. Again, Red Hat has put out some stuff that kind of helps with people being able to automate the moves. There's some scripts that exist. Might have them in this slide deck. No, I don't, I used to. There's some scripts that exist to actually help speed that up for people. And eventually you're still gonna have to do a little bit of copying and pasting. But if you're going from Helm to something like an Ansible operator, that copying and pasting means that you don't have to change as much. Versus really, you can understand all of your logic deeply, but if you're gonna write a GoLing operator, you're looking at writing it from scratch. There are still pros to a GoLing operator though. I don't like GoLing, but they really have a lot of pros. And I just sat down and I did an entire separate talk almost just to answer this question. They do have a lot of pros. So don't let my dislike of GoLing as a language influence you. They're great. Did I get you? No, I think you definitely did. You definitely did. And I think, I mean, especially if you go back around to the qualifying knowledge, like how can you write an operator to qualify knowledge that you don't have yet? You haven't actually operated it, right? Exactly. Yeah, thank you. How did we do? We still got time, like five minutes. Anything else? You can also ask me just for weird stories from the trenches. I always love telling those because those are really fun. No, I'm going to sit down then. Thanks, guys. That doesn't count as fresh air. Yes. Please be HG-minded, otherwise we're going to have a bad time. It's fine. You can, you can, you are okay to heckle. Is there a YouTube live stream thing going alongside this? There is, let's go look at that, I don't think we'll find the one for not that small. There we go. Oh, look, I can see myself in the darkness. It is actually quite dark inside the YouTube stream. Yeah. The puffy-us, right? Let's take a look here. We're going to do a little bit of popping out into a new window. I'm going to hide the bookmarks because, yep. I'm slightly proud that I've broken your systems. At the same time, I'm finger-wagging you for not doing, freezing my repo files. I got to break Facebook. Not many people get to say that. I already have this over here. Are the screens supposed to do something? I mean, the computer detects the other display, so there's, yeah, I'm not touching those. I don't want to break it more. Not doing anything at all. This is excellent, I guess. Hey, the screen things aren't working and now it's not detecting the secondary display. So it's now officially worse. That's what unplugging and replugging is supposed to be. It made it worse. I don't want to touch that. I don't know what cables do what. It was working a minute ago showing the second screen, but not showing up on the things. There's a second display which starts showing two displays here, but there isn't one. Is the adapter thing actually working? Because it definitely wasn't working before. Yeah, it wasn't working until I leveled up. Am I supposed to use this cable instead? Oh, let's try it. What's that can happen is nothing will work anyway. Well, nothing's happening with this one either. Actually, it's not working. I don't know the... I'm almost to the point of like, oh, if I reboot my laptop, will it just suddenly magically start working? That would be weird, not out of the realm of reality. Yeah, is that stuff on? The spread one? I can try rebooting the computer. Yeah, fuck it, we'll try rebooting the computer. Of course, we're going to have one of those waiting a little bit. This is why we're fine for not waiting and speaking, because sometimes you're not getting your 15 minutes. Is it because you are plugged into a cable or something? I find out. Nothing on this, what about the other one? It's not doing anything. Nope, but it broke itself. That's for sure. Yeah, it's on me. How dare you? It's Moses. Well, secondary screen shows is detected. Okay, no window. Now, you're ridiculous, all of you. I don't know if I do mirror. Okay, so mirror works. I wonder if mirror is going to work with the other thing. This is why I was going to be on a different display screen. There we go. Yes, it's probably bad to have the email just to make all those not show up. Unfortunately, I don't have speaker notes, so I don't actually care about those. Let's see here, but I do, however. All right, Sue, I assume this means that the AV side is not actually going to affect us because I'm plugged directly over it. Yeah. Yeah, okay, the first continues. The first continues, but hello, everyone. My name's Yolkamp. I'm here to talk to you about golden images for scaling up with the best of them. Like the big companies do, maybe, sometimes. But a little bit about me first. I kind of consider myself a professional technologist. I've been a Linux user. I should really update that because it's not nearly, it's like over 15 years now. I've gotten old. I'm a contributor and developer in Fedora, CentOS, OpenSUSE, Magia, OpenMandry and other distributions. I've contributed to the RPM package management stack as well as Kiwi systems management tools across the board. And for my day job, which is part of the reason why I'm here, I'm a senior DevOps engineer at Datto. So at Datto here, we have something that we like to call the Datto Cloud. You go look at our website. We talk a lot about this very marketing thing about how we do a purpose-built cloud to serve and save all of people's backups and be able to divert them at a moment's notice when everything goes horribly wrong, like when a meteor hits your building. Meteor is probably not hitting your building, but it's a fun way to describe it. But in the beginning, we didn't do automation because we were small and there were not that many computers. And it was also like 10 people running the whole company. And in the beginning, the physical servers were built by hand by the peoples. And then we ran the installer on there and did the things. And we had a sheet, a paper sheet with the steps you were supposed to follow to install it mostly the same all the way through. This worked up until it didn't, which was very, very quickly. Then we decided, you know, we probably should take the modified steps of doing some kind of automation. And that's when we got to the land of, oh, you can automate the install steps. And so we made kickseeds and pre-seeds and all those other ways to automate your installer. And we did that to hopefully regularize the install process, mostly work, as well as it does, given that the computers were still not identical completely. And that meant that you had variants, all kinds of things. Virtual machines on these machines now got seeds or kickstarts or whatever to install them hopefully the same, hopefully. However, we were still basically having a human press the buttons and type the URLs to load the things. Fun things can sometimes happen when you are doing it that way. Then we got to, we should probably like not do this part manually because we don't have to. And this is where the foreman era begins. So we bring in the foreman to standardize and automate those parts. So now we're not typing in the URLs to access the kickstarts, kickseeds, or pre-seeds or whatever. We are now passing the URLs automatically through magical iPixie things or discovery CDs or whatever made by the foreman. And then when the foreman comes in, it runs the installs and then passes through to connect it back to the system. With that, we added puppet into the whole mix to then maybe try to keep everything in regular and same and sort of usable. This actually worked kind of okay. And then we got into the point of where we were gonna run our own open stack. And then we definitely are glad that we started doing this because as it turns out, once you have the ability to run hundreds of virtual machines and then go into thousands, you are in a very bad place if you don't know how to keep track of them all. And so this really helped. But as ages of automation tends to be, they get weird. They get complicated and because we don't really, most people don't really think of this stuff as code like you would anything else. Nobody refactors, nobody cleans up, nobody looks at it. And so it gets all thorny and slow and weird. But eventually it got so thorny and slow and weird that installs came multiple hours to do almost nothing. And so we redid it. We split up our puppet manifest into layers. We had what we call the common layer. And then we have the various app tier layers. And we started using Packer to build the common layer into our images while running the application specific layers when we're provisioning them. This works okay as long as we only had one or two products that we needed to do this. But M&A happens and then you suddenly have like half a dozen of these things and all different types of requirements. And people wanna do different things and they have different opinions about how everything is supposed to work. And then this starts really falling apart, especially when the self-service phrase goes into vogue. So into the era of self-service, this whole model that we just talked about for like the past couple of minutes, totally bad. Nobody likes it. It's horrible because everything had to be gated by the input people. The input people have opinions that are different from the software people. And then there's this whole thing of DevOps. This whole thing of like breaking down silos of making all the people work with all the different things. Well, if everyone's working on all the different things that that means that you have to segment how this stuff is being managed to those teams. And so you're breaking apart those functions and you're distributing them at the same time. Well, this whole architecture we've been doing is predicated on a model that this is highly centralized. Well, no, no more. And we as part of this change, we started shifting away from puppet. We started shifting towards software engineering things, having infrastructure engineers, you know, SRE types, DevOps types to own the full lifecycle of their products. And so we got new requirements for what we were gonna do. So integrating these new products and teams to do all these different things meant we needed to rethink how our cloud images were made for them. Because I mentioned briefly earlier, we run an open stack. They do stuff on it. We are trying to make it so that they do stuff on their own on it, because having all of us do it for them is very time consuming and very painful. So we wound up, you know, what everybody wants and built down to a few new requirements. In the past, we were mostly dealing with Ubuntu across our things, that's what the primary product started with, as we went with over time through these acquisitions and all these other things that were going on and the integrations, we became multi-distro. We have a split between CentOS and Ubuntu now. There's a smidgen of Ubuntu here and there, a sprinkle of Debian. So we kind of want to make sure that those can be handled in the toolkit. We're agnostic and independent of configuration management tools, because while we started with Puppet over here, some new teams over here are using Salt, some teams over here are using Shepp and some of the new teams and some of the old teams are now trying to transition towards Ansible. And so we're now all over the place. And so because we're all over the place and we're not saying that this is necessarily a bad thing, which it could be a bad thing or a good thing, depending on your corporate philosophy of like infrastructure management. In our case, we're not considering it necessarily a bad thing. That means that we need to have the common baselines and the common stuff that people actually need to do independent of their config management. And it needs to be built into the base layers, which is where the unified tooling and interfaces across distributions and the corporate standard tools that we need to have on all the production infrastructure needs to be set up. So all this stuff, instead of being at the config management layer is now at the image layer. And so we rethink the image build, maybe with some Kiwi. So we started by searching for a new image build tool. But as it turns out, if you look for image build tool, that does multiple different distribution families, your list gets very, very, very small. And even without that, let's say you throw away the requirements, say you're okay with having different tools for different distributions. Some people are, they're crazy, but they can pay off, some people are. A lot of build tools are purpose built or made and then get no maintenance ever again. So they get made and then they get abandoned. And then that is bad because while the tool was made at a one point in time and things are okay at that point in time, the distribution family distributions, they move on, they get new stuff, they change, sometimes there's an upstart, maybe there's a condom or whatever. Some big re-architect of the world happens and breaks the fundamental assumptions of your tool. Your tool is not maintained to keep up with it. You're basically out of luck. And because there's no real maintenance and there's obviously no community around it. With no community means there's no external pressure going inward. So after this searching, looking at all the things, you'll notice Packer is not listed on here. Yes, technically Packer is multi-distroed, whatever. But that's because it's a dumb image tool by the virtue of it boots up a machine and then runs a shell script on top of it. We were already doing that because that's what we were doing for the puppet era. But we wanted our tool to be agnostic of the mechanisms that people wanted to maintain the systems. So Packer was thrown out immediately. Also Packer is very slow and there were certain types of assumptions that we wanted to bake in at the very basic. You can't easily do when you're starting with a pre-existing image. So throwing out Packer was more of, we're not using an existing distribution to provide a golden image. We wanted to start from scratch. So after a fair bit of searching, it came down to two tools, MakoSI and Kiwi. I looked at both of them very carefully and we chose Kiwi primarily because of its maturity and stronger community. Kiwi has been around for about twice as long as MakoSI has. And some of the other things that it offers as built into its feature set made it very attractive for things like being able to do audits of the builds and being able to track how things are evolving on it and being able to do advanced mutations to the images and configurations and things of that nature. It's not to say that MakoSI couldn't probably be made to do all these things. It was just not far enough ahead at the time we were looking at the studies to be clear for context. I looked at this in 2017. So this was several years ago. My understanding is that some of the stuff MakoSI has gotten better, but that's if you're looking at multi distro, image build tools, build your two choices, go have fun, pick one. To kind of elaborate a little bit more about Kiwi, a lot of it is so that it's straightforward and idiomatic for people to understand how it works. It's got declarative part and XML, Yamal or JSON and script hooks and simple shell that you could do whatever you want. You could even run Ansible or Publisher or whatever you want in there. It's flexible because it can build almost any type of image. We were looking primarily at cloud images, but then later we started expanding to live ISOs in self install media, net install stuff, with the images, the works. It automatically produced Sbom ish artifact logs like so you could see what stuff is in there and how it changed and whatever. Importantly for me was that in addition to being free and open source software, GPLv3 plus or later, whatever you want to call it, it is actively developed and maintained with friendly developers and actually has a functioning community. There are people that are there that can answer questions and do stuff. And if you send a pull request, there is more than one person on the other end who might actually look at it and help you. So when as we began our use of Kiwi, we found out that we can't, not fully. We could do it for some things, so in simple cases, but for some of the other stuff that we really needed to start caring about, we needed to fix a few issues that we discovered along the way. So we rolled up our sleeves and did it. So one of the first ones was, it turns out it didn't really know where you're supposed to store app configuration inside the image when you're pre-making configuration. So fix that, made it so that excluded packages also works with YUM and DNF, so that's important. If you're trying to make sure a package doesn't actually get in your image, you probably wanna make sure that that works. This one was a biggie that happened because rel8 is horrible. And when you're trying to build images or redheaded abrasion X8 or it's clones, like Centos 8 or Alma 8 or whatever, you have to deal with the wonderful fact that in most building environments, you're probably not getting fully exposed to modular metadata, and so package installations just randomly fail unless you do stuff about it. The randomly failing was fixed by making it ignore all the modular metadata. So that's what we did. Although notably, it will only do this if you're actually running it inside of the build system. If you are not running it inside of the build system, it will happily bomb out if you have somehow screwed up your sources and you have no modular metadata and you're using modular content. So let's look at the sidebar. Here is the only image build tool I'm aware of that lets you actually turn off and on modules as part of an image build. So you can actually configure them. That's my knowledge, nobody else has done this yet. So point for them. This was actually a happy accident where it turned out we were building an image on a computer and the computer ran out of memory during the process because it turned out it was writing part of the working directories into temp. And in Fedora and in CentOS, temp is temp of us. And so running out of memory, well, building an image is bad. So I moved it to bar temp because that's where it actually was supposed to be and somebody had just moved. So it was easy enough to fix. This one came from a colleague of mine, Matthew Coleman, who found out that app is actually deprecating and planning to completely retire the well-known compact app list form, sources list format for a new yummish I and I style deb822 sources format. And so he fixed it so it makes those files instead. This is supported going all the way back to 1604, a bunch of 1604, deb8 and eight. And at some point in the next couple of years, the old format will be removed from app supposedly. Maybe don't know, we had to work off of what they were telling us, but we made the changes and fixed it. I also suggested to the Ansible people that they should go fix this. I don't know if they have yet, but somebody should fix it. If they rely on Debbie and things, take note of that change. You're gonna get bitten by this if you don't know about this. And it's gonna be a very unpleasant change. So I wanna show you a little bit of stuff here. So I have here, actually I can also show it, I don't know. So for this presentation, I also made some demo golden image descriptions available. You can check them out on the Github's. They have some documentation, have steps and stuff shown. And these descriptions actually will build a CentOS 9 based image or to boot to 20.04 image. Have fun. But I will not build another one again, unless, well, I could actually do one and then talk about the stuff while it's doing it. Let's do that. So this is it running an image build. The steps here are it actually initializing the root environment. And the way that it works is it creates a bootstrap route, which is the environment that it's actually gonna then run the image, the final image file system tree creation in. Bootstrap's done downloading. It's installing the real stuff, but I already pre-downloaded all the packages. So it's just going super fast right now. The image build's actually probably gonna get done while I'm talking. But this shows it like actually going through, running DNF, installing all the packages and whatever. And while that is going on, I'll go over here to kind of show what this is. So here we have, you know, this is the, this is what a Kiwi description looks like. And it's an XML document. You could also write it in YAML or JSON. Don't hate yourself, write it in XML. It may sound worse, but you will really like having schema validation because you really want it to catch errors before it runs the build. Because if you're an unfortunate soul running in a Debian image build, and it is taking like 20-ish minutes for each build because Debian installations are slow, you probably want it to catch errors before it gets to that point. So you can set the repositories here. You can see we just have Apple, Apple Next, CentOS Kmods, and all the standard base repos. And we have these includes, which set up some defines and things like that, and a package set. And yeah, so then we go look here at some of these platforms. So I built a cloud open stack image. So we look at the cloud generic base, which is where the profiles are defined. So Kiwi has this concept of profiles, which you can actually use to define multiple artifacts that it can build from the same description. So you can make one description that can build a cloud disk image, a vagrant box, a live ISO, a WSL image, all the different things. Makes it easier for maintenance because then you can reuse all the common stuff and break things out accordingly and maintain those pieces individually. And then we have XFS for our file system. So I should show what the file system being set up. So block device fanciness. This is what it does for XFS for open stack. And it also defines like what type of disk format it is. So here we got a QCOW2, that sort of thing. And then yeah, at this point, oh yeah, we are actually almost done. So this image build actually took like three minutes, I think for three minutes. Now it's just doing the image convert. So the way that Kiwi actually does the image builds, it creates a folder where it has the root file tree, runs DNF to install all the stuff in it. And then creates a device mapper thing to create the fake disk and then moves all the files over to the fake disk. And then seals it up. This has a couple of advantages. The first advantage is that the disk space used inside of the disk image is completely contiguous. So that means it maximally compresses. The second is that it's faster because you're not writing to a block device and you're not doing the whole double IO indirect path thing. You're writing to a file system, which is very fast. And then you're copying all this stuff all at once to another one, which can be optimized by whatever IO operations are happening underneath. And so I've made this disk image. I will not try to boot it right now because I don't actually have virtualization set up on this computer right now. But what we can do is look at what the output's on. Look at the build image root. And you can see this is actually the file system tree that it made. And this file system tree, I guess, you know what, let's try it. Yeah, well, it's gonna be complaining about missing operands or whatever I don't care. But it's a set of stream thing. It's also a functioning root because I just did charude and it worked and it wasn't broken. My host, as you could probably have guessed earlier, is Fedora 36. Well, one nice feature of the way that Kiwi works, you don't have to use the same distribution on your host as you do on the guest. Another nice thing, Kiwi is SC Linux aware and in the right ways, which means that you do not need your SC Linux policy on the host to match the SC Linux policy that you're gonna be putting on the image. However, your host needs to be SC Linux enabled for it to work at all because it still has to go through the kernel because SC Linux is weird like that. So you need an SC Linux functioning system, but the policies don't have to be identical. You do have to do set in 4.0 before you run the image build though. That's mostly because image builds are weird and they do terrible things and SC Linux doesn't know what's going on and we'll just start randomly blocking things. So you don't really want to do that. The little wrapper script I have here, actually I'll just pull it up on the GitHub because then we'll have nice fancy syntax highlighting. This little script is the thing I wrote that what it does is does all these little checks make sure all the little prerequisites are actually running to make sure that the image build will actually work. So it'll check if it's enforcing, it'll set enforce, it'll record that, disable SC Linux, run to the build, then put it back after it's done. It's kind of like what Mock does, but it's simpler and more brute-forcey. But now to go back to about two here and show you a little bit about what's actually in here. So changes is change logs. So this is an export of the change logs of all packages inside the RPM database of the final image. It is also sorted in whatever the heck RPM sorts it as. So if you do two builds sequentially, it will be difficult for some definition of difficult. I've actually used this before to discover bugs and fix them and actually contribute fixes to send upstream, upstream as a result of it. So very handy and very useful. Some people say this is actually also great for espalming things. Okay, but I find it more useful for being able to triage and debug images. Now, when we're on talk about espalming things, this is actually way more useful for the espalm type stuff. This is essentially an export of the primary properties of every package inside of the RPM database. So name epoch version release architecture. I think that one that says none is the disk URL which we don't use. And then the last one is the license field. Yeah, so then you can see, yeah, I could say never, but I should explain words. Someone in the audience said this is also known. The first few fields are known as the never name epoch version release architecture. I just figured it'd be better to say it out. But yeah, you can use this to be like, hey, I'm worried something has, well, I don't know, maybe a CISL license, that old open office license that nobody ever wants to see in their stuff. Well, we can search for it. If it's something shows up with it, we can ban it. We can also prove that it's not fair by looking at this or maybe CDDL or what is that one that the Mongo people use? I don't know. Anyway, weird license is glory. You can also look for the word proprietary. If the word proprietary shows up, you probably are in for a bad time anyway, but you'll at least know for sure whether you're in for a bad time. But this one is actually something I find personally quite useful. This is the verified one. So this is an export of the output from RPMB of the whole tree. And this tells you what mutations have been done relative to what the RPM database thinks the file system should look like. And this is useful if between one image build and another, something looks funky and there's nothing obviously different, like the packages are the same, the change locks don't bring up anything meaningful. But then you go look at this and you see, oh yeah, we somehow deleted user lib64ldso or whatever and suddenly all my binaries just don't load. You can tell because then it'll show this file is missing. So this will return you the output of, I think it's like BA RPMBA, which does it for all the packages and you can see for yourself for Debian targets. Well, you don't have anything in the verified end, but packages list is less useful because more of the fields don't exist because well, there's less fancy metadata fields in the Debian stuff, but hey, you still get at least never those. So you can at least see what's in there. Yeah, and then the last one that I would like to look at here is this file, the Kiwi result JSON file is actually an input produced as an output. This thing can be used with the Kiwi bundle command if you're like maybe an OEM or whatever or vendor providing a software appliance to somebody else. You get, this can tell it how to wrap everything up so that you give this to a customer and they can say, oh, now I can see like what your compliance is like, what components are in there and my lawyers will be happy with you because I can prove that you aren't giving me something stupid in my machines. I don't know of very many vendors who would actually be so kind as to give you this information but it is super nice that there is a way to do it. This is actually a thing that I haven't seen any other image build tool do, providing these files and giving you an easy way to wrap it up for distribution. And this is what I like to consider the S-bomb type stuff because this is very useful for being able to distribute it or being able to verify and validate it and all that sort of things. Let's see here and then we're gonna go back here. I also have one other thing and I'm not gonna run it but I wanna talk a little bit about it. As part of our process of developing this at Datto, one of the things that people were very concerned about was being able to build their own images built on top of our configuration. Now, we support Packer for this purpose and that's pretty much what we recommend for a lot of teams that already have existing automation but some people are into the immutable craze for whatever definition of immutable you are thinking of. And so for those, I designed a little process in which you can build, you can make Kiwi config snippets and merge them with the base ones and build a whole new image completely from scratch with everything integrated. And an example of this is this container host description that I wrote. And so there's some steps here describing like how to do this. There's a script Kiwi image description merge that will actually merge the descriptions correctly so that you can then run an image build and in this container host description, it's just two real files. One here is the config CentOS 9 which literally adds this extension container host XML snippet. And then if we go back, go look here, you see we're adding just cockpit podman and adding a user and password so that cockpit actually works because if you recall, these are cloud images and the root user doesn't have a password. And so you can't really log in with cockpit without some user and password. And so that's what this does. And essentially the idea here is you always have a way to do this and you can do this. And if your team is one of those more mutable types and would prefer to do something along these lines where things are more declarative and whatever, file means layer on top, have a ball. And we also have, I have documentation files that I've written. These are actually like cut down versions of the documentation I wrote for our internal descriptions. So they kind of describe the features, the support matrix and like platforms and like what we consider configuration and usability interfaces. One of the design goals behind this is that no matter what distribution this is, all the interfaces are as close to the same as possible. With the CentOS 9 and 2004 image descriptions in here, the only thing that you should expect to be different is the package manager. We are using the same init system. We're using the same network configuration tool. We're using the same cloud init configuration. We're using the same ABC, all the things. Unifying two very different distributions onto a similar architecture makes it easier for teams to consider switching from one or the other. And that also plays evidence down the road when you wanna say, maybe I wanna use REL 9, all the Linux 9 or maybe I wanna use Kubuntu 104 Super LTS or whatever, like you have that flexibility or maybe you just wanna go nut so and build it for everything and then mix and match like a crazy person. That is also totally possible because for all intents and purposes, the only thing you're answerable or whatever has to care about is how the names of the packages and how to get them installed because everything else will work the same. Like the configuration files are mostly in the same place, config declarations are in the same and so on. All right, and let's go back here. So this I wanted to show is our production pipeline. This is how we build the images internally. The GitLab pipeline, schedule runs that happen monthly. We build the images. We actually have more definitions for Azure and AWS and so on but this one shows just the open stack stuff. So we go from open stack image to upload to our internal open stack deployment which is called DCS, the data cloud services and we do a smoke test. We actually put up the images, make sure that it actually kind of works and if it works, we flip it from private to public and people can consume it. And then we publish pages because we do GitLab pages deployed to update all the docs and let people know what it is. So here's some references of all the things, the demo descriptions as well. I have uploaded the slides to the schedule as well so you can feel free to download them, check it out for yourself. Now done, questions? Yeah, someone give them a microphone or something because I can't hear him. Is your solution only Linux and Demek or does it work on other free open source operating systems like VSD, Hikoo, et cetera? Theoretically, you could maybe do something VSD but no, this is all an eccentric. If somebody made a VSD that used RPM and DNF, then yes because that's actually the only real mandatory interface here is the, yeah, like that'll work too. I don't know what you would want to do with it but yes, that will work too. Debian K-Free VSD as someone points out is a thing that has existed before and if somebody cares, might exist again. My understanding is it's been more of a bun for a few years now. But yeah, the only real requirement is that it has to be a Linux-ish setup. That means POSIX, Unix File System hierarchy uses the either RPM or Deep Package or Pac-Man it also supports Arch, I'm not making Arch images, don't ask, but it can do that too. Any other questions? Is this packaged for Fedora? It absolutely is. So Kiwi is packaged in Fedora. You can do DNF install Kiwi and it will show up. It's also in Apple. So you can, if you are running Red Hat Enterprise Linux 8 or CentOS Stream 8 or any of the clones thereof it will be there, same for EL9. Is it packaged for going on what he said? Is it packaged for Fedora? No, is it packaged for SUSE and Open SUSE? So I just realized that I never said it but Kiwi was actually created by the folks at SUSE and created for SUSE Studio, which was a service that they used to have to give you point and click ability to create your own appliances with your own custom branding and your own custom configurations. That SUSE Studio is gone, but Kiwi lives on and it's been integrated into the open build service as well as Koji and other places. And yeah, so it is packaged in Open SUSE. You install it by doing Zipper or DNF install Python 3-Kiwi. Why is it not working for you? I don't know. There it is. What about Appian OS 3? Contributions are super welcome. The reason it's not done is because I don't know how to work with it. There's actually a GitHub issue open for integrating RPM OS 3 support into Kiwi, but none of us in the upstream know how to work with RPM OS 3 and whenever I've reached out or any of us have reached out to find out more about it, we haven't gotten useful help in figuring out how to plug it into the tooling. So to this day, we do not have RPM OS 3 support, but we would really like it. So if somebody would like to come in and help, by all means please, we could do bloody near everything else. So any other questions from anyone? All right then. Thank you. All the answers are based on the roundup of down here. Yeah. So in normal class, it's done in operation stuff, you have a pair of these in total, it's like those models, that's where you can get the fairings. But there's no action to start about it. I'll get it. I'll get it. I'll get it. Yeah, I'm going to start. Yeah. I'll get it. I'll get it. I'll get it. I'll get it. Hey, everyone. So today I'll be talking about Kubernetes cluster monitoring. Mike is working. Am I audible? Hey, everyone, thanks for attending this, cool. Thanks for attending the session of Kubernetes cluster monitoring. Today I'll be talking about what why monitoring is important. We'll be talking about few monitoring tools like Prometheus Grafana. Next, we'll talk about alert manager config and how we send alerts to Slack. And lastly in the demo, I'll talk about how I deploy an application on OpenShift and how we use all these tools to monitor it. So I am Twinkle Sosodia, software engineer at Red Hat, and I work with Red Hat partners to build their robust cloud native architectures. So as we all know, we have smart surveillance CCTV camera for the safety and security of the people. Similar to that, we have Kubernetes monitoring tools like Prometheus and Grafana, which act as a CCTV camera for the system. Say your CPU usage or your memory usage cross the critical limits or your Kubernetes resources like deployments, pods, nodes, they crashed. So in this case, monitoring will help in minimizing the risk of server down or resource issues. And with that, it will also help in proactive management of clusters. So for monitoring, we know that we have multiple open source tools and one of them is Prometheus. So Prometheus is a pull model and it scrapes metrics from target endpoints and stores it into time series database. It was basically designed for heavy duty container environments like Kubernetes, Docker swarm, et cetera. Now let's take an infrastructure example. Let's say we have a server and on top of it, multiple containers are running. So to manage this complex system, it becomes really very challenging and to make sure that everything runs smoothly. Now imagine having multiple such infrastructures and you have no idea what's going on inside it, either in the hardware level or in the application level. So these monitoring tools like Prometheus helps in like they constantly watches over all the resources and alerts us whenever something critical happens. So all this automated alerting and all this automated monitoring is what Prometheus offers as part of modern DevOps workflow. So now for us to enable monitoring, we would require a few Prometheus components and to start with, we would require service monitor. So now service monitor specifies which services the Prometheus instance should monitor. We can also use port monitor in place. The differences it will use, it will specify which ports to monitor. The next is the Prometheus rule. The Prometheus rules has like, they have two main components as alerting rules and recording rules. The recording rule allow us to pre-compute frequently used data and alerting rules specify when should we get the alerts like setting up the thresholds. The third is the alert manager. Alert manager specifies the configuration of the alerts and it also has the custom receivers like Slack, PagerDuty, et cetera. In the upcoming slides and upcoming demo, I'll show you how I create all these resources and how we use it. So this is a short glance of how service monitor looks like. It has fields like selector and namespace selector. The namespace selector contains all the namespaces the service monitor is going to watch upon and the selector field will take the application name and it matches with that a label. So this is the Prometheus rule screenshot. It takes the recording rule. The other three are all the alerting rules. So the alerting rules specify the threshold. So in this example, it says that blue request per minute if it's greater than 20, the alerts will, this will alert as low load. If it's greater than 25, it will alert as medium load. And if it's greater than 30, it will alert as high load. Moving on to alert manager config secret. It takes the custom receivers, so Slack in this case. It will have the API URL for the Slack workspace and it will also have the channel to which all the alerts will be sent. So far we have seen what Prometheus is and what components we will require. Now let's talk about what Grafana is and what components we would need. So the Grafana is an open source software which enables us to query, visualize, alert on and also explore metrics, logs, traces. And it provides tools to turn our time series data into insightful graphs and visualizations. So we would require a few components and we would use Grafana operator in this case in our demo. So we would require a Grafana instance, similarly data source and then the lastly we'll import a custom dashboard which I created. So this is a short screenshot of data source. It takes the data source information like Prometheus type, Prometheus is a, we'll be getting the data source from Prometheus. This URL you are seeing is the URL for the service this is the Prometheus service which is exposed on 1990. So now let's talk about how we'll implement like how the demo will go on. So we would be playing in the OpenShift dedicated cluster. We would have two namespaces, one for the blue application and another for monitoring components. That monitor namespace will have an observability operator. Now what does this observability operator do? This observability operator installs Prometheus and creates standalone instances of Prometheus and alert manager. Moving further, we'll configure it using service monitor, Prometheus rules and alert manager secret. Once everything is in place, Prometheus will start scraping metrics from the blue namespace. And if something critical happens in this blue namespace, Prometheus will start sending alerts to alert manager. Now alert manager will handle these alerts and send it to Slack as a notification to the engineer. Moving forward, we'll deploy a Grafana operator and use Prometheus as the data source. So now let's move to the demo. I'll quickly show you how the OpenShift dedicated console looks like. So this is the dashboard of the OpenShift dedicated cluster. I already have observability operator installed and already have alert manager and Prometheus instances up and running. So we have the alert manager and Prometheus instances up. I'll show you the blue namespace. It has the blue app running. Blue is a simple go application. So moving forward, I'll quickly create, I'll quickly configure this Prometheus and alert manager instances. So I'll start by creating service monitors. Once this is created, I'll create Prometheus rule and alert manager. So once all these pieces are in place, we would require the last piece that is RBAC policies. So since a monitor namespace is going to scrape metrics from blue namespace, it will require few permissions. So we'll create a cluster role and bind the monitor namespace to that cluster role. So that cluster role will provide us with the information such as pods, services, endpoints and the status. I'll create the role binding which binds our namespace and service account. So once these pieces are all in place, we can forward the Prometheus and see how the dashboard looks like. So this is a Prometheus dashboard. I'll move towards the alerts. So we see that we are getting high load, medium load and low load. Looking to the rules, we can see we are getting alerting rules and the recording rules. Moving to the targets, we can see that blue name, blue application is up and running now. So now let's trigger this blue application and see how we get the alerts on Slack. So I'll curl it quickly for at least 30 times so that it triggers something. Now I'll move to Slack to see the alerts. This shouldn't take time, should be completed in like 30 seconds. So you can see that the alerts are getting fired. You can see medium load, low load and high load. When expanding one of these, we can get the alert details such as what was the alert name, which container it was, what namespace the alert was coming from, what was the pod name, et cetera. So this is how an engineer can quickly act upon if something critical happens and thus it will minimize the acknowledgement time. So so far we have seen that how we configured Prometheus, how we configured alert manager and how we get the alerts to Slack. Now let's see how we configure Grafana and get the Grafana dashboards up and running. So I'll move to OpenShift dedicated and quickly navigate to operator hub and search for Grafana operator. Install it in the monitor namespace. I'll move to slides to show you what we'll be creating. So we'll be creating Grafana instance, Grafana dashboard, and lastly we'll be importing the dashboards. Okay, so Grafana operator is up and running. Now I'll create the Grafana instance. So Grafana instance has the admin username and password, just admin, admin, which we'll use and by logging in. Quickly create the data source. Once these are all in place, I'll put forward it on 3000 and log in with the same username and password we provided in the instance. So now before moving, I'll just quickly confirm if our data source is working or not. So the data source is working fine. Now I'll import a custom dashboard. So I created custom dashboard. You can also create your own custom dashboard or import predefined dashboard, which is available on Grafana's website. So I'll copy this template quickly and paste it here, import it. So you can see that we have turned our time series data into a visualized graphs. So in the alerts panel, you can see that at what time the alerts were fired, what was the alert name, what was the alert state, the container and other metadata as follows. Following that, you can see blue request per minute, like how many requests were there per minute for the blue application. Similarly, response status, process CPU. And lastly, you see that there are three, there are containers running, one for alert manager, one for blue application, one for Prometheus instance. So this is how you can use Grafana to visualize the critical things happening inside your cluster or your system. And this is how it will help engineer to just like, see what's going on inside it and acknowledge the critical issues and make sure that everything runs smoothly. So just to summarize what we have discussed so far, we have talked about importance of monitoring. We discussed about Prometheus Grafana and its components. In the demo, we deployed an app, we deployed observability operator. We played with Prometheus Grafana operator components. We imported the custom dashboard to see inside for graphs. So this is how one can empower Kubernetes cluster and by adding smart monitoring tools like Prometheus and Grafana. So that concludes my presentation. I would like to thank you all for your time. And if anyone has any questions, please feel free to ask. Great, so you said that both Prometheus and Grafana have rules for alerting, but what would be the best practice? Would you just set the rules only in one of them? Or do you kind of use different rules for Grafana and different rules for Prometheus? So we have alerting rules in the Prometheus and we used those time series database and we use those rules to visualize on Grafana. So Grafana is a dashboard which we see and it's in a human readable format. So that's how they integrate together and like it acts like a pure package together. Okay, so Grafana by itself is not alerting. Only visualizing, right? Yeah, visualizing. Any more questions? Thank you so much. Hey, one minute, we'll wait for one minute. I'll try not to fall off like you nearly did. I have to do it from here because of the mic. I like the idea of just standing over it. Okay, we can kick off. Everyone hear me okay? Yeah, cool. Are you being serious? Okay, I'll go like this Dave. Okay, so this is the story about an SRE maturity journey. This is the Kafka as a service SRE maturity journey. So I know there's some SREs in the audience. Please go easy on the heckling. This is my experience of trying to make an engineering team think about SRE much earlier in the development process and how we eventually on board with an SRE team and ultimately got the benefit of that at a later point. So yeah, this is our story. So my name is David Martin. I'm a software engineer at Red Hat. I've worked on various managed services over the years. The most recent one is the managed Kafka service. So Kafka as a service means you make some API call, wait some time, you get access to a Kafka instance. So this is going back a couple of years. So we have a bit of a journey timeline that I'll go through in the slides. So we're going back to August, 2020 where we have a remit to let's build a Kafka service. And we have some hard deadlines right from the beginning. 30 day goal, 60 day goal and a 90 day goal. And then further on a little bit. Okay, we start to think about SRE onboarding. Now there's a reason why we can't bring SRE in right from the beginning. I'll mention why that is in a sec. And then finally, the ongoing maturity, the ongoing SRE maturity or how we learned to really live with SRE and learn from them. So yeah, the first step, before we do, a little bit of an SRE refresher. So a bit of a disclaimer, I'm not an SRE if that wasn't clear. I may say things that might sound odd to SREs but I mean well, so please take it easy. So if you want to learn more about SRE or site reliability engineering, there's these very good books. The first one on the right is the original SRE book from Google. That kind of lays out the term SRE and some of the fundamentals and concepts behind it. The one in the middle is the workbook and that kind of puts into practice with some examples of how they do SRE. The one on the left, it really goes crazy on examples of how people have implemented SRE and actually built systems. So I'd recommend them kind of in that order if you haven't already read some of these. Not just for SREs, but for engineers, really good content in here. So some of the SRE concepts that are important here that we tried to bring into this. So breaking down barriers between dev and ops. So SRE is kind of a dev ops model, but SRE is very keen on breaking down those barriers. You do not want to throw something over a fence. In fact, there is no fence, that's the idea. You're applying engineering to operation tasks. These are all core to SRE. So I'm probably preaching to the choir a little here. Focusing on automation, eliminating toil. We're keeping, it's about keeping error budgets and service level objectives. It uses some of the terms here. I'm not going to go into detail, but it's all in the SRE book and then embracing risk. So we know that systems are going to fail. We know there's a risk to that. Well, okay, can we use data about that risk to kind of put in place some sort of failure mitigation or even allow for that in a service level objective and an error budget. So there's some of the main SRE concepts at the heart of this journey. At Red Hat, we have a specific implementation of SRE. So it's more of a central SRE team model. Now there's reasons for that. My understanding is having a central SRE team where all the SREs kind of converge. It's easier for knowledge sharing and more manageable for people scaling. I'm sure there's lots of other reasons, like there's, and in general, there are fewer SREs than developers in Red Hat. So it's just not feasible to have SRE representation on every single engineering team that's working on a service. So in SRE, that SRE group really wants to champion the SRE mindset instead of embedding in teams. So that team, they're building cross teams, service tooling, rather than becoming experts in each service. So the example here with Kafka, can you actually get SREs that no Kafka inside out? That's gonna be really difficult. It's gonna be really rare to find some of those skills. So maybe there's a middle ground there where we can get Kafka experts working with SREs and kind of find, okay, where's the crossover there? That makes sense. And in Red Hat, we have this idea of SRE onboarding and there's a process around that. It's a streamlined, get-off-space process. You have a new service, you wanna get that into production for customers to use. Okay, what is that onboarding process? What are the particular things we need to have in place along that way? So that's how Red Hat does SRE at any sort of reasonable scale. So let's go back to August 2020. So pretty much two years ago, the message from on high was, let's build a Kafka service. So what does that actually mean? Well, we didn't know at the time, we need to start somewhere, but how do we actually start? So we need to bring the right people together. So this is where our journey begins. We have a bunch of Kafka experts, we have a bunch of Kubernetes and OpenShift experts, and we have services experts. Now just gonna be overlap between some of them to a degree. What I mean by service experts is people who have, who have built and run and manage services in a production environment. They may know some stuff about Kubernetes and OpenShift. I think the least overlap was with the Kafka side. So the Kafka experts were really needed. But these service experts, that's kind of where I would see myself fitting in with a little bit of Kubernetes and OpenShift knowledge. And that's kind of the basis of where I'm going with this journey, that group of people. But we don't have SRE at this stage, just because it's not feasible given the constraints, the people constraints, the SRE group. And this is a brand new service. So priority-wise, it's kind of like, at some point later, it makes sense for SRE to come in. So first of all, a 30-day goal. Let's start somewhere close by and let's see what we can do in 30 days. So bringing all these people together, what can we actually do in 30 days? Well, first of all, we can try, we can try and prototype an API to deploy a Kafka cluster and expose some Bootstrap URL and provide credentials. So if people aren't in Kafka, the idea is that you have one or more brokers for a Bootstrap URL that you would give your Kafka client. It uses that to connect and then figure out where all the brokers are, but you also need credentials to connect to that URL. So we have a starting point building on the Strimsy project. The Strimsy project is, it's an operator, works on Kubernetes to bring up a Kafka cluster pretty much. So we use that as a starting point. Let's put an API on top of that that can bring up a Kafka cluster of any size and expose those things to the users. Define, so what can we do in 30 days? We could theoretically define an architecture that could scale to thousands of Kafka clusters. I'm not saying we could build it in 30 days, no way. We could define it in some way that we can manage this century. So all theoretical and the theoretical concept here is a control plane and a data plane. So let me explain what I mean by that. So the control plane is where you would sort of define all the aspects of your service. And the data plane is where that actually gets implemented or where things are running. So in this case, the control plane, that's where we define sort of a fleet manager that customers would talk to say, give me a Kafka cluster and manage how many Kafka clusters they have, that kind of thing. And the data plane is separate where they actually run. What else can we do in 30 days? We can think about how it can run in production. That's a little bit fuzzy, but we had a plan how we could start to think about how it can run in production. That how we do that is let's create a team that thinks about it from the SRE viewpoints. So this isn't my naming convention, but it was called the SRE view team. Let's see it from the SRE viewpoint. And we will staff it with people that created services before. So I was on this team with three or four other people and we had some idea of how SRE works. At least we thought we did, we know a few things. So we tried to capture the SRE mindset and we started to build a backlog of things that we know SRE will ask about and need at the later onboarding point. But we need to do something practical. We can't just think about these things. So in the first 30 days, okay, we're familiar with observability from other projects. Let's see if we can get some sort of observability solution running and then what after 30 days? So yeah, the 30 day outcome. We have created lots of flaky scripts. I'm sure people, this will sound familiar to some people. We have loads of simple make targets with hard coded config. Everything runs locally, but we deployed, managed to deploy Peridius Grafana alert manager with some simple alerts and dashboards from the upstream. And we have some hand carved, aggregated logging proof of concept with EFK. So it's elastic search pretty much. Yeah, we got something working. We also narrowed down what metrics could be used as a service level indicator. So what I mean there is the service level indicator, some sort of signal that tells us that the service is doing what it should do. Kafka has a ton of metrics. I think we had a spreadsheet with like over a thousand rows. We narrowed down that to like maybe 10 or 20 metrics, still too many, but ones that we thought, okay, these couldn't be meaningful to what a customer actually sees. And we also, this is kind of from left field. We built a proof of concept for a Kafka canary tool. So just something that regularly produces and consumes from each Kafka instance, just to prove a concept at this stage, but that proved to be very useful later on. So I'll mention that again. So that's the 30 day outcome. So lots of flaky stuff. So next, the 60 day goal. So you show this stuff and then all of a sudden project managers and potential sales people, they're like, okay, let's keep going. Yeah, but what else can we do? All we can do is keep going, but what do we do? More automation. Yeah, keep on automating more of those scripts, more dashboards, more alerts. That seems like a good thing. Oh, yeah, not just local. We want to let others try this out. So that's the thing that really kind of, I think gives engineers a kick. When we want people to try it out, you have to think about it from a different point of view. We're gonna run this thing. People are gonna have issues with it. So we need to really think about that. And some testing would be nice. Yeah. What do we do? We respond with action. So I should mention that this, our team wasn't the only team. There's like three or four possibly more other teams working on other aspects of this service. This is just our team's journey. So we are working on observability stuff. So the name of the team was changed to the observability team. We defined a simple SLO, a simple service level objective and added alerts for it. So we thought we're doing great here. We have an objective that's based off an indicator that tells us the number of connection errors seen by Kafka brokers in retrospect. It wasn't actually a great indicator, but I'll mention more on that after. But hey, it was a starting point. So we did move from scripts to operators at this stage. We made that jump. It was actually, I believe it was, using the Java operator SDK for some parts. I think there was a goal line operator SDK for one piece. Just drawing back to one of the earlier talks. So the canary testing tool, we moved that from a concept to an integral part of our service deployment. So we saw the benefit of that. We're thinking, okay, we can get some metrics from this canary tool. It will tell us when it's failing to produce and consume. So it's a round trip that we can get some information on. And later on, we can start to use that as an objective. We also started looking at logging aggregation options. We saw that elastic search was not a good fit. It was a resource hog. And also we liked the idea of low key and prom tail with Grafana. It seemed nice to have the metrics and the logs kind of coming into that same visualization tool. So we also liked the idea of observatorium, which brings TANAS and low key together for centralizing those metrics and logs. So we started looking at that. We started to maintain, to allow other people to use this. We started to maintain a long running and staging environment for, yeah, temporarily for eight months. Because yeah, nothing's temporary. So moving on, the 90 day goal. So we're getting further along here. So what do we want to do for 90 days? That's when we start preparing for SRE. So there's someone smart decided, oh, since you're running the service anyways, let's use it as proof to SRE that the service is maturing or so we think. So we keep running the service internally, but we've reduced SLOs. We say 99.9, that's not even possible. We just have an engineering team that is, it's barely spanning two time zones and these people contract, they can't be on call. So what can we do? We can reduce the SLOs. And since we're starting to run the service is renamed to the running the service team. And that is actually still the name of this team two years on. So yeah. So we weren't quite SRE and we weren't quite software engineers at this point. So we coined the term pre-SRE. I don't think it stuck for much longer than that that kind of period. But we were acting as the initial on call function. We had some, we had page duty set up. During work hours of the team, we have some rota where people would get paged. They'd go to their phone, it was not outside of ours. So it was establishing the ground work for an actual on call team. We were also running incidents as well. We had some internal customers. We had some incidents that happened. And when that happened, we tried our best to follow some sort of incident response, response document and format that we knew that the SRE group was using. And we produced RCAs, root cause analysis docs, as best we could. This all proved to be very useful actually when it came to SRE onboarding because they could look back at that and see, oh, right, so you have had some issues. Here's what you did, okay. And we actually took actions out of it. So it was good to see that we had run the service in some sort of anger so far. We also ramped up the load, stress and fall testing to validate our assumption. So we were starting to build out alerts for the Kafka service, but we wanted to really make sure that those alerts are actually valid. So we started to build testing around that and I started these alerts for firing at the right point. So all of this testing, we started to get a lot of feedback and we started to pump that into the other teams. So we were actually building a backlog for the other teams, which actually was one of the original goals of the team from our project managers. It was start to understand what it means to run this in production, start to gather those requirements of what we know we'll need and feed it back into these other teams. I missed one point there. We also started the SRE onboarding conversations. So at this kind of early stage, we're starting to line up some sort of SRE onboarding. So that's the 90 day goal. So eventually comes the SRE onboarding. I should clarify here that the person on the right here is actually SRE because they're already on that mountain and they're helping us up just to be clear. We're not kind of pulling you up, you're pulling us up. So yeah, SRE onboarding. So first of all, a transition plan. So let's start to involve SRE in our team planning, our incidents, our pull requests. Day to day, just try and pull them into various things as best we can. We were doing some incident response at RCA's. Let's bring them into that as well when it happens. Get them familiar with the service like we were. So we had a parallel page of duty rotor for a time. Whenever an alert would fire, that caused an incident, someone from SRE and someone from my team would join in and ultimately run that incident and find out what was wrong. We also identified some SRE transition opportunities for people on my team. So what actually happened in the end, a few people transitioned to the SRE org. And with that, they brought the knowledge of the Kafka service. And they also had an initial goal to help upscale other people. So that was kind of a secondary goal. It was mentioned at times during the early stages that that could be a model that we follow, get these people up to speed and transition them over. So it was good to see that actually work. And eventually we phased out the software engineers from the page of duty rotor and more recently from production access. You do not want software engineers to have production access. That is not a good recipe. So good to see that eventually got through. So ongoing SRE maturity, working together at the top of the mountain. So a few things here. So how do we keep a good working relationship? So we're identifying and addressing toil together. So any sort of toil addressing opportunities, any automation, we can work together on that on a backlog. We did a lot of effort in documenting procedures where automation wasn't feasible at the time. And we created a lot of alerts from the beginning. And that's kind of one thing that I would kind of say, try to avoid. In retrospect, that wasn't ideal. We created too many alerts and each alert had to have some sort of procedure for when it fired, which meant that we were just bogged down, documenting procedures and trying to find ways to automate things at this late stage. So I would say start with much less alerts, focused on the most important things. And also jointly working on an operator that uses these alert events to trigger procedures for known states. So Kafka is a very complex application. It can get itself into all sorts of weird states. If we can have alerts around that, that fire off something that gets back into a good state, then that's a win on our part. It means someone doesn't have to manually figure out and follow some scripts. Yeah, let's not forget about those SLOs. So let's come back to these. So our initial SLOs were okay, but not great. So client errors could count towards the error budget. So what I mean is it was technically possible for a client, so for a customer to configure their client or even use a bad client in some way that could cause our SLO alerts to fire. That's really bad. Also, if there were issues with ingress into the cluster, it meant that there was no traffic getting in. No errors were reported because we were only looking at it with a canary tool from the inside. So everything seemed fine, but in actuality, customer raises an issue and says, hey, we can't get access. When the customer tells you there's an issue, then you know there's a gap in alerting usually. So we moved to a combination of indicators for the SLOs. So we captured the errors internally where you can and where it's definitely a service issue. To equate it to like a HTTP service, it's alerting on 500s, those kind of errors rather than 400s, but in the Kafka world, it's a lot of other possible errors. So can we filter those out? Also captured around trip data from outside the cluster. Can that canary tool come from the outside, from the external ingress and then use that data and combine the two of them into a single SLO. Also thinking about zero downtime during maintenance. So Kafka clusters sometimes need this thing called rebalancing, which is complex, I can be very heavy weight. So at the moment, I believe the SRE team is working on introducing this thing called cruise control. And it has some manual bits first and then once we gain enough confidence, we can automate that and get this rebalancing to trigger automatically. So a few things there that we're doing with the SRE team ongoing. So yeah, that's the journey so far. I've left out a lot of stuff, a lot of detail. If you wanna know more about any particular area, just give me a shout after. But maybe there's some conclusion I can draw from this journey. Hopefully this will make sense to people. So every journey is different. This is our journey. This is my journey. Every organization will do SRE differently. I've read half we do it one way. Your company might do it different. You might hire SREs directly into the team. You might not even have SRE, but you might want to do it. So if you have the right mindset and find a way to apply the SRE principles, your team can benefit from day one. That's something I definitely believe in. So if you can't involve SRE from the start, establish an SRE function within your team. And that doesn't have to be a whole team of people. Luckily, in my case, it was a team of three or four people at start. And we had that benefit, but it could be just one person. And that one person champions the SRE mindset and those concepts. And the ultimate goal here is to have a stable, secure and scalable service. So working together here is essential to make that happen. And so if you do have SRE people in your organization, you talk to them, you get their opinion and their knowledge because it is very valuable. So that's it. That's my journey. Thank you very much for listening. And if you are interested in SRE, please do hang around. I believe there's a birds of the feather session happening shortly in this room. And there might be some drinks or snacks. Yeah, that's it. Thank you. Any questions? You're happy, Hilary? Champion of SRE? Every SRE. Good. Yes. Cool. Okay. Yep. Do you want to grab the mic? Do you have any stories around resistance failures or persistence failures? Do you have any specific fun stories around persistence failures, copy storage? I recall there was an actual issue around persistence failure. I might need Hilary to help me out because it's escaping me. It might have been at the point I was kind of joining a different team and transitioning. If you want to give it to Hilary, yeah. Oh, goodness. Okay. So we had an issue with a persistence failure, which actually, so for some additional context, the Kafka data is stored in a persistent volume. So this is broken into three persistent volume claims and there's some mathematics involved to ensure that those are never 100% full, except for that there was a mathematical error and it became over 100% full. So it, we could not get a quorum from our brokers. We could not get a message ingress or egress. We could not get the data. We could not get the persistent volume claims or the persistent volume to increase in size. And the only way to quote unquote restore the service and we'll use this very loosely as restore because the point of Kafka is the historical data. So when you lose that, you haven't restored a service. But the only way to restore, we'll go with messaging ingress and egress was in fact to delete the persistent volume and all of its contents, at which point the operator recreated them. Now, where we got very lucky here, this was an internal Kafka. So we caught it with our own stuff. We caught it dogfooding. No external customers were harmed in the deletion of this persistent volume. But that was a very fun RCA and interesting set of arguments that I had to have with our BU and so forth about what SLOs, what data metrics created this priority as such that we would choose to restore messaging over persisting data. So that comes back into choosing a good SLO, feeding that back into making sure your SLOs accurately represent what's important to your business values was an additional lesson learned from that particular instance. Thanks, Ellard. Okay. Thanks again, everyone. I have one question. Oh, yeah. Yeah. So you mentioned you have 99.9% as one of the SLOs. So what happens when you run out of that budget of 99.9% like what are the consequences, so to speak? So at this time, I think the wording in the air budget policy may say to, it may say to get the engineering team to focus completely on that or some subset of it. Again, I might need to defer to you, Hillary, if that is in practice or if there's still some things to work out on that. Thanks again, Hillary. You're welcome. So to clarify, now that we're running with full 24 by seven SRE support, the Kafka SLA is in fact 99.95. So we're three and a half nines in the situation in which error budget is completely burnt. That means that we've exceeded our ability to be down and continue to meet SLAs. Yes. So not only do we now have ways of paging and raising engineering to get them up in the middle of the night if necessary. We also have some, yeah, there's complete refocusing on the engineering team. We have some additional processes in place to try to make sure that we're back up. The nature of Kafka being super multi-tenant, really it allows for some very user-to-user experience, right? So we're running a multi-tenant service. Everybody benefits from whatever's happening to one customer because it's multi-tenant. But we do have the ability to, the error budget is not for Kafka as a whole, it's Kafka per customer. Yeah. Thanks. Is the answer okay? Yeah. Excellent. Thank you. Thanks again, everyone. Have a good evening.