 I've got one. Oh, wow. It's on. Hello, everyone. Welcome to the Systems and Infrastructure Track here at Scale. Today we have Jess Males speaking on, this is open tofu internals. Thank you guys. Thank you very much. Pleasure to be here. I will call out some of my fonts. Do you get a little bit small? So if you're in the very back of the room, unless you have binoculars, you might want to sit forward. So before we get started, just out of curiosity, how many people are SREs and you use Terraform or open tofu on a regular basis? About half the room. How many are developers that are interested in contributing to the project? Okay. Anyone not used Terraform previously or not aware of where it's at? Anyone looking for a talk on tofu or tofu manufacturing? Excellent. That'll be covered in part two to be given sometime next year. My name is Jess Males. I've been a sys admin for 20 years. I do very much strongly consider myself a sys admin. My preferred pronoun is ops or ops. As such, give me a terminal, preferably bash. Give me an editor, preferably them. Give me a command line game like doom. All this squarely puts me in the camp of graveyard, which I don't mind. I accept it happens. Graveyard are known for knowing their tools well, but also there's a little bit of attitude of get off my lawn or stuck in the past. But that being said, there's some very formidable graveyards out there. So again, it's a camp I'll happily embrace. But to be not stuck in the past and just to make sure that I'm comfortable with modern tooling, not that I wasn't, but to really dig in, to really embrace that, I wanted to look at the open tofu code base. And so what I'm hoping that you get out of this talk is that if you wanna break into this thing, it's very easy, it's very accessible. And if I can understand it, you can too. My whole goal for today is I want you guys to be able to walk away from here and not have any fears about being able to look at this code base or work with this code base, even if you've never contributed or looked at this kind of level code before. So what is open tofu? Again, you guys identify that you're broadly familiar with what the tool does. It's infrastructure as code. On my left here, right, I probably got that backwards. Anyways, on one side of the screen, you'll see a sample bit of Terraform code. On the other, of course, you'll have to plan that out. The code tells you what it's going to implement. You apply it, it implements that. And by codifying all this in code, now all of a sudden we've got a time machine around what our infrastructure is. We can move forward and backwards, we can make changes and we've got that history and also that identification of what needs to be built and how it needs to be built. Open tofu, of course, is a fork of Terraform. Open source, community driven and managed by the Linux Foundation. Why was it forked? Mid last year, Hashing Corp. After nine some years of open development under an MPL V2 license, this decided to switch to a BSL license. There was little to no warning given to the community. And so there was quite a bit of reaction about this. I'm not gonna dive into the politics of what they did. I think there's other benefits coming out of this. But just as a gauge of reaction, there was quite a bit of blow up. I'd watch Hacker News on a regular basis. I'm sure there's other gauges of the community. This is not statistical or highly validated at all. But again, quite a bit of community interest, quite a bit of blow up. And with that, the open tofu, they started as open TF, but open tofu as a project was formed and started going. And so some of the timeline is that August 10th, Hashing Corp announced their change by August 15th and then August 25th, open TF is announced. And then we've had releases since then. Public repos were started on September 5th and then alpha release in October, as we see beta release in November. And actually, considering the current state of the projector, no slides of who's dark is this, so we'll quickly move on. But my view is that competition is healthy. And so open tofu as an upstart is very much interested in expanding features and expanding capabilities because they're trying to convince you to use this new tool. As much as there might be etiology involved, it's very hard to run your business on etiology. And it's very hard to convince management to make changes regarding etiology. But separately, as the incumbent, Terraform should want to innovate because now all of a sudden they can't just rest on the laurels, they can't just sit still. As a quick note, most of the providers were not changed in their licenses, though they're still released under MPVL2. And this should be a strong unifying factor making sure that there's not too much drift between the products. So with the idea that there's going to be competition, have we already seen that competition start to manifest? And tentatively, yes. I did a quick survey. This is, again, by no means scientific because, again, as we all know, semantic versioning has no definitive increments judging by release numbers. Some projects have been in beta forever. Some decide when they want to move on. So, but counting roughly days between release, we have seen that 155 was when they announced that they would switch over starting with 1.6 and 1.6 was released in October 4th of last year for Hashi Corp. And then beginning of this year, 1.7, they released, they're already on track with 1.8. The beta just came out two weeks ago. And as we can see, all of a sudden that change, that increase is happening over time. And so understanding those changes, we use tools like version control to understand what goes on. With version control, we can quickly see the differences and changes over time. Unfortunately, as a side effect of copying that repo, they dropped all previous tags. So if you go to open to a food today, you'll find the 1.6 tags and that's about it, which is a little bit unfortunate because you want to compare against very past iterations such as 0.12, 0.15, even 1.4, you'll not be able to reference it by tags. Again, it's get, the commits are still there, they'll still be able to read all the notes, but that handy ref spec isn't immediately available, but we can restore it. And so clone your open tofu repo, add your remote ref to terraform, and then fetch away. Now, you will want to take a couple extra steps to make sure that your tags are maintained. You've got a one-time line of disabling tag pruning during the fetch, which means that as it goes out to update its repos, it doesn't drop any existing tags that may not be present on that remote server. Separately, this second comment line here just clears up all tags because in the next step, we then go in and rename all the tags, prefacing them with either open tofu or terraform. And so now, if you want to, you can do that diff that we had in the previous slide of terraform v.1.4 against v.1.6, see those changes and get those on with an easy reference. Now, doing it this way will grab all the changes, and so we've got that very important line at the bottom. Because of the re-license in the code, we don't want too much bleed there. In fact, we don't want any bleed. Do not copy your code from terraform and try and commit it to open tofu. So if you want to be a little bit safer, you can specifically fetch the tags that are historic only. That being said, it is a little bit of a bear because you're then going in by individual branch names to filter out what not to collect. And then as time moves on, if we ever move to a two-hour release, three-hour release, that's going to keep on continuing unless somebody's much smarter with Glob and wants to suggest a much nicer way of filtering that. So what's our model of what open tofu does is infrastructure as code? Again, we already saw some code. We saw that the plan goes out and identifies what needs to be made. The apply actually goes out and implement it. So we've got a series of resources that are dependent upon each other. Maybe we've got RDS instances, EC2 instances. Maybe we've got a buy-in and policy that needs to be instanced. And so all of these things start building up and they start associating with each other. This all comes in together as a directed acyclic graph. Nobody ever calls it that. It's a dag. But what that means is the acyclic part means that the pointers only go in one direction. So you'll not get loops, which is very important because if loops happen, you're going to get circles in your resolution. And then of course it's going to get stuck. It's going to run forever. And that's not something you really want to desire. So then, boom, we've got a code on one side. We've got our tofu when it runs starts matching up the resources with its instruction memory of what that graph looks like. It starts matching up the resources that you've defined in your code. And then it goes out and looks at the infrastructure and matches that up. And then as part of the plan operation does the processing comes in and tells you, okay, this is what I'm going to do. If you were to apply it goes and does it. And so with that mental model in place of it constructing this, it taking that graph, it walking that graph, it trying to decide what it needs to do in each individual case. Let's go ahead and look at that from the outside end. And tofu has some very nice logging features. It's just TF log equal trace in your environment. And then all of a sudden it starts spinning out. There's several options for tracing. If you want to get more or less for both. If you want to get only half, the core part is for CLI. The provider part is for, of course, provider AWS local, null provider, what have you, GCP. So if you want to collect one set of logs, that's fine because as we'll see later, those get quite voluminous. And then if you want to, you can also set TF log path. And all of a sudden it starts appending all its logs, whether they're writing to standard air across the screen, it starts writing to those individual logs. So let's look at that real quick. What have I got here? Terraform data, it's basically a null resource. All it does is it creates a revision. It does this local exact touch a file. And this is for demo purposes only, this code pretty much does nothing. It's a great example because what does it do? To initiate this, I did the command previously described, that TF log, I've captured the trace log already, that's in these two, four files. The output is just the normal standard out that you would see the standard error was captured in the trace files. And so real quick from this null provider example, how many lines does that even produce? 780 lines. So quite a bit of output, quite a bit of detail about what it's doing and how it's doing. The reason why I particularly started doing this count is because when you start looking at other providers, AWS specifically, here, all I'm doing is just get call our identity. It's a basic call into the token service to say, who am I, what am I doing? And it's just a data call. It's not even trying to create resources. When this runs, how many lines does it produce? 3,500, so five times the amount in just making that simple call. If you actually wanna create infrastructure, here's a lambda, it goes on, I did not save, so let's do this count real quick. Yeah, excellent, that's even better. So again, if we're doing a much larger resource, how many lines is it gonna produce? I expect this number 25,000. So lots of output as soon as you start doing those cloud providers, it's useful information, but again, it's quickly very easy drowning in a sea of details. So swapping back, what's in one of these trace files? And with the cut, I'm just taking out the timestamps that would be the first column because that quickly becomes distracting. So let's just walk through it. It starts up, it tells us the version, it goes in, tells us some CLI output. It says, hey, attempting to open configfile.terraformrc. Did you even know you had a generic system-wide configfile? Here, we actually use it in development. We use this file to override our providers. But again, in watching the output, we can follow along with what it's doing. So remember, it's gonna try and build that graph, it's going to then look at that graph with the known state, it's going to then compare that graph to the actual state in production so how does it do that? Coming along, we see, it starts up, CLI arguments goes down, tells us about what the terminals are, great. Meta-backend, so the backend starts to become a thing because it's now saying, well, we've got state somewhere, I need to load that up, and the backend is where that happens. It notes, hey, this is the override that I had in my terraformrc, that's where it's getting that. The backend that it called out to is local, and it happens to note that the workspace is default. It's looking at the file system for that backend and noting that no previously storage snapshot exists, so it's going to create one, and create a lockfile associated with it, it creates a context object, presumably that means something, or we'll find out that it does in a minute. It starts looking at plugins, then builds and walks the graph, as we said, we'll see more details on that as it works, it then starts seeing config transformer, node validate resource, and then move on. All this exercise is to say that terraform itself is going to tell you voluminous information about what it's doing, how it's doing, and even though it's a ton of information, it's pretty easy to read, and we'll put some of those details about backend and further context in just a minute, but in looking at all this stuff, we started noticing transformers, and so out of curiosity, what are those, what are different types of transformers? Presumably they're doing something significant for us, and so I take the output here, I look for anything that's ending in transformer, single words, the dash O is single words, looking for transformers, and then against those, the all command saying print out the first time you see it and skip all the rest, I'm not looking for a count of how many times I see these individual transformers, I just want to know the number of distinct transformers that it's coming across, and there's the list, 47 different transform operations. We're not gonna look them all, we're not going to look at them all, but they're there, they're doing significant still for us, and they're all to be found in the code. So taking a look from the outside, the traces, what they're gonna tell us, let's flip our view and take a look from the inside out. So if you've got the repo handy, you can start looking at it, I'm gonna reference everything there, but now it's as if we're reading straight through the code to understand what's going on. There's a very nice architecture.md right there in the docs of the repo, and it has this very helpful graph of process execution, and as it kind of noted, as we're calling out as reading through the trace log, the CLI starts up, loads of backend, starts talking to the state manager, then we've got this graph builder, graph walking, and at the end, transforms going on, which is all that very text evaluation. So let's dig into this a little bit. The CLI, it's the internal command package right there in the repo, so open to foo slash internal slash command. It's mostly scaffolding for us, so it's very easy to read file, it's going to parse the arguments for us so that we know if we've got changed or as a flag, it's going to know if we've got output specified. It does all that for us. It can encrypt, which is something I'll get into in a little bit. It's a new feature in tofu, but not necessarily present in HashiCorp. It starts a backend for us and it creates a context, and then from that context kicks off, and context is the main process and memory that holds the DAG and gets walked and moves on from there. So CLI is a basic structure for us. It does do the, keeps lock on the state so that too many things aren't changed at one time and we don't overwrite stuff. But importantly, if you don't know go, it's got a very nice defer feature so that we don't permanently lock that. And so CLI takes care of making sure that that final lock doesn't happen. The configuration loader, which was on the side, basically takes care of tofu init processing. Meaning, goes out, grabs the providers, makes local copies available in your .terraform directory underneath your code repo where you happen to be running. It installs all those child modules into that environment. And then while running, as it reads those from disks, if you've got multiple instantiations of a module, those all get instantiated in memory as multiple copies and they're fully interpolated so that namespaces don't mix up. And so if you've got two versions of an application, each of them get their own copy of that module so that variables aren't globally accessible across and then they don't get confused. With a slight caveat that at this point, we're just loading up the config out of the modules, so basically the source code out of the modules. And so it's going to interpolate as much as it can, but it's still gonna leave quite a few big chunks of HCL body or HCL expression. Just because it doesn't know yet what the count's gonna be, it's not actually pulled up those variables or what those are gonna be, it'll get all those later. The backend is, represents our workspace first, even if you don't specify one, you're working on a default. If you specified something else, it takes care of managing that, particularly TF state file. The backend interface as a go language construct has two versions, basic and enhanced. Most backends do not actually implement enhanced, so they actually run through local and local taste care of loading that enhanced on top of it. The enhanced interface basically asks for apply remote plan features and so when the execution happens, it happens to do that interface of go lang, do that. The backend calls the state manager. Again, for us, it's file system local. Again, just picks up off the disk, pulls that in, it brings us and then now all of a sudden, it loads up that major context, which is represented in the context.go file. It points to the root of the graph and then kicks off graph walking. And so at this point, we can kind of imagine that the state's been picked up, all the code has been picked up, all those are present in memory, but variables haven't been interpolated. We've done the refresh so that we know what the current state is, but boom. It's still got some work to do, it's chunky. Also part of the context, it creates stop books, which is, I think about it as a tin can of, hey, I'm gonna talk on this end, you talk on the other, so if you ever hit control C during the middle of the operation, you've got that channel, not exactly a go channel, it's implemented through hooks, but it's got that way of communicating to those sub-processes that I bailed, save your state, you do have a little bit of grace period on that. Our config files are loaded, each of them basically assigns a resource to a vertex. Be careful reading the code, you will see vertex and node used interchangeably. Vertex is preferred, but I didn't actually do a count of how often each term is used. But then context kicks off GraphWalker, which then starts navigating do that graph. Context is the entire execution environment for our code, but then as it comes across each module, it enters enter path and starts off that tofu eval context, and so now you've got, again, those per module environment variables separately namespaced. And so, again, a sub-environment, so I can't step on anyone else. It's pretty much doing the same thing, we've subdivided the problem, and then as we go across each vertex, it's gonna try and do these concurrently where possible, but as it's built up and it's transformers later, there's going to be dependencies, so it will respect those. The transformers as they happen, here's a sample of four, but you can read your own, there's plenty in there. The config transformer populates config from code, turning them into resources, what does that mean? It means it's actually loaded up the module library, whatever HCL you've written in your local files, and then it matches those to resource vertexes. State transformer basically does the same thing, except it's the side pulling in from state, then finally reference transformers actually come along and say that's the actual matching of determining, okay, these are the things associated and pairing them up, and so if you've got cross dependencies, it's doing that matching on that graph boring. And then provider transformer comes along and says, okay, well, you expect to be a type AWS EC2 instance, great, we're gonna map you back to the provider to make sure that code gets loaded before we even go on. How would you write your own provider? In doing that, it's actually pretty straightforward. There's a very nice series of developer examples available, but there's two versions, so I will caution, make sure you get the latest one, that being the Terraform provider scaffolding framework, which hopefully they made the name very similar to Terraform provider scaffolding, so make sure you get the framework one there. The newer framework is more strongly typed previously, in the older style, you had to do a lot of casting and other manage yourself, everything fell back to a go-lying basic interface, so getting the types right could be very difficult and sometimes trip you up. The structure provider basically has, for itself, a metadata method, a schema method, a configure method, data sort of methods, and these are basically registering its capabilities with Tofu itself. Separately, you will have a series of CRUD functions, create, read, update, delete, which are faced more toward your code. In fact, I'm gonna skip ahead. Your provider being the center, it should be a very thin mapping, but on the one side, you've got the configure side, which faces OpenTofu, you've got configure metadata, data sources, resources, those are all facing so that OpenTofu can come along and read you, and then those CRUD functions for read and delete, talk more toward your API to come out and do those functions, and so that's what I'm referring to on this slide here. They're basically very thin registration methods and writing them, you're gonna write five lines of code just saying, hey, here's the functions I'm gonna look up for resources and you just start mapping those forward. As you do development, I will caution you. The OpenTofu framework, there's just some assumptions written about, hey, if you're writing a Hashtagorp Terraform provider, you should be referencing Terraform, which was true up until August of last year, so in developing OpenTofu, you do wanna overwrite those. But if you are considering writing your own provider, keep it as thin as possible, it should be a simple mapping, and then we see that here. Don't try and reinvent the wheel, follow the leader, there's some very good examples out there. Grafana, I like because it's smaller and more tractable, of course, Terraform, a provider AWS being the 800 pound grill in the room, it's solid, it's large, it does a lot of great stuff, and so it's useful to look at, but you might get lost in it too, and by means of comparison, the repo size for Grafana is three megabytes, 180 Go files, maybe 30,000 lines of code, if you start looking into AWS, then you've got 200 megabytes to deal with, 7,800 files, two million lines of code, find Grafana much easier to read from that standpoint. And then also, when building, the document helpfully provides this line, and I'm gonna say for the love of the flying spaghetti monster, don't do this, rename your binary, because the thing you don't want to happen is going into troubleshooting and making changes to what the binary does, and you don't want those happening in production, and by having the same binary names, there's a chance that you mix those up. And then for the truly lazy, if you don't like typing tofu, tofu, tofu, give yourself a convenient alias, OT, I like this, because I often forget when I need to net, and don't want to pay attention to it, so by default, it's gonna check and see if I need to run a net and create it if I do, and then move on, and then that else statement, just if I type in plan, everything else falls through, or if I type in apply, everything else falls through, so if I need to do shortened state or anything like that, it just falls through. If you're not familiar with Golang and want to get started involved, I recommend this book highly, particularly chapter three is a great overview of Golang. How many have done significant projects in Golang? Okay, chapter three for the rest of us, I can't remember the page count, but coming from Python and being very comfortable with Python, it's just a great adaptation, runs into the major features, it's a very great way to get up to speed with what Golang does, how it works, and what its larger thoughts are. If you are looking to get involved with the open-toe community, Slack is a great place to chat with others, also there's Issues Boards, which is very nice to see what features people are missing or what's not working well. If you want to stay involved with a larger project, we actually publish weekly updates in the repo itself, which is very nice to read, the Steering Committee meets as needed basis. The last time they met was in February, but their notes are published with part of that. The Steering Committee makes decisions about longer-term project efforts, hopefully getting a yes or no, we want to go in that direction, we don't, just to help people understand that, yes, this is where you want to spend some development time or not, just so that people don't spend their wheels. And then finally, the release notes, of course, are good. The project has been, of course, running since last August. What work is there to show for it? State encryption has been added, so if you want to further protect your seeker at your other production values, API keys, what have you, you can do that now. You can either choose to protect the whole file or just partial, you can also choose to protect plans. So if you want to run, or if you want to protect secrets while they're in your CI CD environment, that can happen as well, that's pretty nice. If you are going to contribute, following up and raising issues, there's a couple templates on their GitHub page for doing that, you can file simple bugs, you can file feature requests, you've got the idea of a longer feature and really know all the details of what you want to do, you can write it up as an RFC and they've got some structure for that as well. It's a minor bug fix, go ahead and write a PR, but if you think there needs to be a larger change, we appreciate the enthusiasm, but do open up a PR first and get some community interest because if it's not going to align with larger goals of the project, they don't want you to spend your wheels and spend too much time developing something that isn't likely to get used or have to be refactored. So save yourself some time, get the alignment first before you start hacking out too much code. That was my quick overview, any questions at this point? Well, thank you for allowing me to present and if you do have questions, I'll be up here after the panel. Slides will be available. Who do I give those to, by the way? The slides. Check, check, check. Check, check, check, one, two. Test, test. It's working. Might be a little loud. Alan's gonna fix that for me. Put the microphone a little further away from my mouth, it still sound okay? That's okay. Okay, we are at time. Welcome to Scale Systems and Infrastructure Track. Your mic's not on. Test. That's a good picture. Welcome to Scale Systems and Infrastructure Track. Today we have Heather Osborne who will tell us about why organic isn't always good for you. Thank you, Alan. Welcome to my talk. I'll let you know right away. We are not going to be talking about extolling the virtues of GMOs over organic. This is not that kind of talk. I am not that kind of person. Okay, you're welcome to go, that's fine. So the title is a net result of me being a snarky nerd who sometimes thinks she's clever and or funny. So what we're going to talk about here, well, first we're going to introduce me. So I am Heather Osborne. I am a 25 plus year systems engineer, DevOps manager, architect kind of person. I spent about 24 years at Ticketmaster where most of my career was. I worked on legacy and on-prem systems. And then a couple of years ago here I did a talk to that effect. Modernization as a forcing function. Talking about how to encourage people to move on from these legacy paradigms and move to the more modern solutions. This talk is not on purpose an extension of that talk but it is just by the nature of my career development over the years. More recently I've been at a couple of startups and I'm making a little bit of background noise. A couple of startups on public cloud. Everything I've done since 2020 has been entirely public cloud. And I've served as both a senior director of DevOps as well as a system architect. So a little bit more about me. I gave you some of these systems ideas here. A little bit more about me. I'm a bona fide crazy cat lady. I've got four of the beasties. I had to air gap my clothing and my laptop bags so that I could come here and look moderately presentable, not covered in cat hair. They are particularly adept at walking across my desk when I'm in the middle of a production console work that I'm doing. And I'm also a distance runner. So I get a lot of time to think about these things while I'm out running. And I will say I'm an immersive camping enthusiast. And if you wanna know what that means you can ask me later but I basically considered I go out and be a jackass in the desert at really weird events. So what's the story I'm telling here? Because organic isn't good for you in every case. The analogy fits, particularly in the case of infrastructure. Just because it's not artificial, just because it's not, just because it's fake doesn't mean it's not good for you. I wanna remind you that because it's organic it doesn't necessarily mean it's good for you. And as I said, this particularly applies to infrastructure. You have the rapid organic growth, you have all of these things that you can no longer decipher. You can't figure out where the vine goes on the fence. None of these things. So how many of you work in a well architected environment? Raise a hand or two? Oh, I see the architect over here raising his hand. Were your systems painstakingly designed and laid out with a thought towards all the illities? The security, the stability, the maintainability, the sustainability, the general promoting the go to bed on time ability. Not, nope, nope, okay. Oh, Alan over here is going well. Yeah, I wasn't expecting to see a lot of hands here. So this is where we're going with this. So what is the story we're trying to tell? And I can't read my screen. This is great. Okay, I got it, never mind. So the first segment we're gonna go to is how you end up creating an organic mess. Next one is we're gonna tell the story of the background spaghetti. And then we're gonna figure out, do you really know what's wrong with your systems? Probably not. And then you realize, whoa, this is actually a bigger problem than we were expecting. And you might as well just eat the whole thing. You're done with this. There's no business in keeping this around. It's more effort than it's worth. And then finally, we'll talk about what we did. Solved it with tools for fun and profit. So I said I spent my 24 years or so at Ticketmaster dealing with legacy systems on mostly on-premise infrastructure. I'm gonna give you the story of my more recent journeys at startups. My unfortunately recently ex-company, it's not a good time for tech startups, so be it. But the name of the company is irrelevant. The story still needs to be told because I feel this is something that I've hit every time I've worked anywhere. Every systems I've encountered have been this organic mess of a beast that grows over time. I will say that one benefit of this being at a startup is that it's not as deeply entrenched into the infrastructure and there's, it's less far to back out of it to get yourself on a good path. So the model of a startup. You have, my microphone is bugging me. I can hear, pull it out a little bit more. I can hear myself moving around. So the model of a startup, you hire some people with some brilliant ideas. Super smart people. They have this idea that's gonna make you a million dollars. Let's go run with this, a million dollars a little undercutting it, but it is what it is. So these people are not necessarily infrastructure people. They don't know what they're setting up. They have no idea what a VPC is. They have no idea what an internet gateway is. They don't know what database to use. But in this day and age of cloud services, you see these ads come up, set up an application in a day. They see this on the AWS site and they go, ooh, I can do that. And you get a whole team of people working together. They're really smart developers, but they still have no idea how you should back things with a database, how you should set up security groups, none of these things. But we use Elastic Beanstalk as an example. This quickly launched web applications is one about Elastic Beanstalk. Bam, the site's up. You've got this great site. It was a great idea. You're making all sorts of money on it. You've got people asking for features because this is something that you wanna expand and run with. Well, what happens here? Well, it's booming. Let's hire some more engineers. Let's get these features built in. This expands into organic growth. You start adding things to the things that you already created that you didn't necessarily know how they worked in the first place. You just bolted some things onto the side of it. Then you go, hey, we should probably hire a DevOps engineer. And I will say DevOps engineer is a misnomer. It's not an occupation. It's a philosophy. But for the sake of argument and for the sake of ease, I'm using the term DevOps engineer and or SRE interchangeably. So you get your first DevOps engineer. They come in and they go, wait, what? What is this business that I'm looking at? What is this nonsense? What is this organic beast that you've just now unleashed me on? And we have to be compliant. How do we get our security laid out in such a way that makes sense because I don't even know how this application works. How do you do your releases? What is it that you do? How do you promote it through your environments? Where's the code in the release process in the software development cycle? You got your software in my infrastructure as code. I got my infrastructure code in your software. Things are tightly coupled. You can't unravel it without going very much in depth. So how do we start untangling this and how do we start thinking about it? So as I mentioned, this is the tale of my DevOps journey at startups. We've got all very knowledgeable and talented people, industry best practices. This is me referring to my DevOps team in particular. They understand the industry best practices. They know how to work with our stack. We'll say AWS, Kubernetes, GitHub, using Terraform for infrastructure as code and back by Postgres database. For the most part, it's primarily Ruby on Rails and we use Istio as a service mesh. So this sounds absolutely straightforward. You go, okay, that's, yeah, standard web application built on public cloud. If you go back to the concept I was saying about a startup and the folks who built it, not having necessarily an idea of what they were doing when they built it. You end up with things being put together by trial and error. You got the organic growth and you've bolted some things on the side of it. You have copy-pasted code because it worked before doing this thing so let's just change a couple of values in here and make it do something entirely different that was not related to what it was designed for in the first place. You end up with an Epic Spiderweb, a big pot of spaghetti, a ball of yarn. So before the current team got involved, there was a great concept of how to enable developers to move quickly. We used ephemeral environments we referred to as ODEs or on-demand environments. This worked great for a monolith. Things don't always stay a monolith. You, again, bolt things onto the side. You got that organic growth. The spaghetti is leaking out of the pot. And this is driving me nuts. A little below my chin. Okay, let's see. No, I appreciate it. And it's like, I don't need to hear myself breathing. Whoa, maybe not. Okay. So this on-demand environment is a great idea for the monolith, as I mentioned. It spins up a temporary Kubernetes pod and basically puts your code there, runs all the tests, spins it down and then it's ready to roll to production. So that's great. But what happens when you branch that application? What happens when you throw some microservices in the mix? You add an authorization service. You extract the front end from the monolith. Now you have these short-lived environments that they spin up, they do a thing, they test the monolith, but they are not QA'ing your code at all. All you have is just this very small piece of the picture that is being tested. Oh, so fine. You need a more robust test environment. But you're not allowed to touch dev because, as I mentioned, things grew organically. The dev environment was originally supposed to be a proof of concept cluster and not actually an environment. You have, in prod, you have Beanstalk. In dev, you have Kubernetes running. That means prod and dev are nothing alike. The code's the same, but the mechanisms behind it are entirely different. Well, people need to test their stuff. So what do you do? You build a release environment. The release environment being a little more like prod, kind of like dev. It's got the microservices, it's got the auth service. It's part of the dev cluster. It does things that moderately resemble prod, but are not actually what you want to be doing. Another problem. Only one person can use this magical test test. Okay, thank you. It was very loud and disconcerting. Only one person can use this magical test environment at a time. And you have no visibility into what it is that this test environment is running, what version of code is running, whether it's currently being used for UAT, whether it's currently being used for microservice testing, et cetera. So we can't mess with dev because that would impair developer velocity. We don't have the resources or the time to build out this on-demand environment to be a little more robust and include the microservices and the other services. So what do you do? You know what? You might as well test in prod. That's the closest thing to prod is you're gonna get. So the team sat down one day to make a simple change. We had an alert saying we weren't getting all of our audit logs. That's a critical alert because we were required to have audit logs. So one of my engineers sat down, started debugging it. I went, okay, it's a problem with fluency. We're just gonna upgrade that. Take a second. Nope. We don't know why that's where it is. Okay, well, we don't know why if I put this block in here, it doesn't do anything. We don't know this. So we pull in our lead dev. We pull in another dev. We pull in another dev. Pretty soon, there are five of us, including myself, sitting there debugging a very inconsequential problem because, you can adjust it again. Which means when I exhale, it's gonna make some really interesting sounds. Can you still, okay, I'm still on the mic. We'll see how this goes. So this simple change that we're making, theoretically just upgrading Fluent D, figuring out why it's throwing an error. We ended up with five well-paid engineers sitting on a call for the better half of the afternoon. This is a huge waste of resources, and this was by no means an isolated incident. You have all of these problems that because your code makes no sense, because your infrastructure's code makes no sense, because it wasn't developed by someone who gave it half a thought before they put it in place, you end up playing this game of unravel a little bit of the sweater. Oops, that's not it. Let's tie it back up. Let's unravel the other sleeve. Let's go figure things out. You spend days and days on end debugging simple issues. And don't even get me started. I am the one who was responsible for cost optimization. Like, oh look, there's a Microsoft SQL database sitting out there. What is that? Nothing's tagged. Another addition to that pile of problems where we can't figure out where the source of the problem is. So nothing's tagged. We have a Microsoft SQL database out there. Nothing in our shop runs Microsoft. Why is this thing here? It's costing us three grand a month. Well, let's destroy it. What was it built by code? Was it built by ClickOps? How do you get rid of it? If I destroy it ClickOps and it's in code, is it gonna come back? What do you do? All of these problems just resulted in us going, wait, what is this nonsense? It is just worse and worse and worse. So, we need to figure out what's wrong. What in the name of organic nonsense is this infrastructure that we're working with? We can't provide the software engineers with what they need. That's a testable environment. We can't unleash them on their own to operate in the true sense of the term DevOps because the Terraform was so convoluted that DevOps would spend half a day figuring out what was going on. They couldn't reasonably dissect it. I fondly referred to this as Terraform. And we also couldn't be happy DevOps and security engineers because there's zero consistency, there's zero documentation, there's no failover, there's no disaster recovery, and there's no best practices. When we read a white paper and we go, ooh, that's great, let's incorporate that. There's no path to that when you have this overly complex environment. So, what do we do? We need to determine where it hurts. One thing I've learned over the years, developers will say, it sucks. You know what? It sucks is not an answer for what's wrong. It is a blanket term that's used for every technology issue. Well, what's wrong with it? It just doesn't work for me. Also, not helpful. Doesn't tell you a thing, doesn't tell you where the problem might lie, doesn't tell you what we can do incrementally to help get you out of this blockage that you're in. So, to start trying to unravel this issue and move to a model where people can speak about things intelligently as a factor of tech maturity more eloquently than it sucks, you need to actually talk to the people. You can't throw out a survey monkey and say, okay, what's wrong with X, Y, and Z? They'll just answer, what's wrong with X, Y, and Z? And it won't be data that you can actually use and make into something or make into pain points. So, what we had to do, I conducted an interview with every team lead and a member of their team to figure out what was going on. My intent when I kicked this off was to find out what are the low-hanging fruit? Organically, of course. What can we do to unblock the developers? So, I'm gonna show you a big nasty diagram here. And I don't know whether you can read any of these, but this was the net outcome of my interviews. And apologies for the typos. I, because I no longer work for the company, I transcribed this from a screenshot. So, it's my horrible typing. But I broke things up into groups about whether things were taking too long, whether it was a consistency problem, whether it was a documentation or a policy issue, whether it was a test environment issue, a simple complexity issue, well, that's irony or oxymoron there, whether it was a release process or a testing process issue. And broke down things by the teams, broke down things by the security team, what things impacted everybody, what things specifically impacted the data team, and found some patterns around these things, found what it was that was really the root of people's pain points. And you'll see here, there are a lot of things. Can you actually read this from somewhere? Okay, that's great. You are one up on me. So, I'm gonna move over here to the side so I can see this one. So, we got some of the release process, testing process ones, no decommissioning or off-boarding consideration. This goes back to my original talk about getting things to move to end of life. We have out-of-date documentation. If you look at the data column, we have an entirely separate data ecosystem than we do app ecosystem, which means that they are doing things very differently from how the application teams are doing things. We have things like, the on-demand environments are called out there, that we don't have the microservices attached to them. We have some things like the product and the product design teams are trying to do UAT against the release environment and there's only one of those release environments, so that's not sustainable. The complexity, we couldn't set up a new firewall. We couldn't test out new software, such as adding confluent to the mix because of the way the network was set up. So, it was, okay, we have to go rip off the band aid. We can't test these things in advance because we don't have a infrastructure or security-specific environment to test in. And, let's see what other ones. No standard, no load testing. All of these things are very common. As I said, none of this is very specific to the company I was working for. This has really been a plague of all of the companies that I've worked for, but this was in particularly interesting to me because I had the luxury of being able to actually collect some data and get some patterns around what it was that was causing the legitimate issues. So, if I was following, yes. The colors, the legend is down the side. And, like I said, never mind my misspelling, I know consistency is spelled wrong. Among many other things, I started to try to load a new copy of this with the spelling corrections and I couldn't do it from just my laptop because my screen was not big enough to take a screenshot. I just, never mind, you get to deal with my typos. Don't have budget? No, budget, budget comes in in the next part of the discussion there. So, because this was so huge, if we were doing an agile process, we would have done an inception, we would have figured out what we were doing, we would have figured out the plan, but there was way too much going on here. So, we had to take a two-pronged approach and I'll say we also had the luxury of having a developer productivity team. They were more developer-facing, they were a lot more in tune with the very small bits that they could correct to help the developers in their day-to-day life. So, they were identifying the really, really low-hanging fruit, like there's one up here that's moved this to a script and then there's one that's, we don't know in slack when something's been released. So, those were the sort of things that the developer productivity team was able to take care of. They decided to be on the side working on a number of these things and this was great because it gave the engineers the idea that we were working on things really hard for them. We were trying to unblock them, we were trying to make it suck less for them, you name it. So, a couple of the things that that team did and that these are their story, these are not the area the DevOps team was going but just so you get a little background. We added some observability to the release environment so we knew what versions were running there. We added GitHub merge queue so you no longer had to sit there and wait until a Slack channel updated with this person is done. Now, Bob can take over because George is done. We set up a golden DB because we always had stale data in these on-demand environments and releases data, as you can see here, releases data was nothing like prods data was nothing like on-demands data. So, you wanna have a set of data that is consistent across the board. We improved the pipelines so that the testing, more of the testing could be done locally instead of chewing up our GitHub action minutes. There's some cost for you. Incorporated API contract testing which hadn't existed previously or it had and nobody had bothered. The DevProd team, the developer productivity team was a fantastic partner in this whole endeavor but as I mentioned, these ones I just listed here were their story, not the DevOps side of the story. They were happening in parallel to knock out some of these more immediate pain points. So, what was the DevOps side of things gonna do? Well, we had a huge infrastructure problem. Obviously, I see Mark over there rolling his eyes. We've got a couple of choices. Well, choice one, Greenfield. Yes, that's a bad word in a lot of cases. You know why? You have this really wonderful system that you've worked on really hard and it's perfect and everything's great and then you release it to the wild and 60% of the people move over and the other 40% of the people are like, I'm not gonna bother. And then you end up supporting both the brand new shiny system and the old system that you wanted to get rid of in the first place. So, we were understandably hesitant to go down this route. Choice two, mentioned, prod is different than dev. Dev is on Kubernetes, prod is on elastic bean stock. Technically, dev is the newer version but prod is the more consistent version. What do you do? Well, in theory, you clean up prod as in let's get it upgraded to Kubernetes, let's clean up some of the logic issues around prod. I don't know if you saw on the past slide it was dev had overlapping subnets. That was why we couldn't just upgrade dev and move forward with that. The concept being you clean up prod, you make a copy of it to dev, you use that as the new dev, you use the new dev for doing your more complicated release that you need to do and then you go back and forth until you iterate through what it is that you need to do. So, the estimated time on that was at least 12 months because you needed to be slow and cautious in getting these things cleaned up. You're still messing with production, you're still doing things untested, you're still doing things that are the whole reason you wanted to have a solid dev environment in the first place. And I didn't mention at the beginning of this, this was an urgent issue. This was something we needed to resolve now. We didn't have the luxury of 12 months to work on it. And then, choice three, this is the one that makes me cry, improve in place. This is a continuation of the effort where we just kept spending all afternoon working on a very simple item that should have taken us 15 minutes. Now, the problem with this is we didn't necessarily have the time, nor did we have the personnel to lock in a room for half days, figuring out each step. It doesn't necessarily have a quantifiable time to market. We couldn't even throw that 12 months to get it to market. And it also doesn't have a definition of done. We don't know when we finished what it is that we were trying to do. So we took this and we presented it to the people with the money back to the budget thing. And we laid out a very careful matrix of what the risk assessments were, what the time of market, time to market was, what the associated costs were. For example, the cleanup prod copied a dev that was gonna result in a duplicate environment for a significant period of time. Greenfield was gonna have an extra environment, but she could move a lot faster in that extra environment. You name it, we did all of that analysis and I don't have that sheet to share with you, but let's just say we went through all of these things. We ranked things by scores. We tried to figure out what it would do, what we could do most quickly, what incremental improvements we could offer, what low-hanging fruits we could give to the developers to show that yes, this is something that's helping them. So our path to buy-in, here's, we've chosen the Greenfield environment at this time because both of the other two options we had out there were really not a viable option. Like it pained me to say that the Greenfield was the best solution because I've been through it before. But there really wasn't a better option. So we presented that matrix and everybody went, yeah, you know what, I think you're right, I think we're going with the Greenfield version. Nobody wanted to continue with the convoluted, untestable, maybe secure infrastructure that we had. And we wanted to have a path to getting people to upgrade by making it so appealing, making it so fun, making the toolbox so easy to work with that they wanted to move there. It wasn't a, let's go play in your Terraform for six hours to get you unblocked. This was a, oh, here's the thing and this is the way things should work together. So we got the go ahead. I segmented my team a little bit as a management person. I don't like getting one section of the team something new and shiny to play with and making the other guys play with the old stuff. It's just bad practice. It lends itself to a lot of unhappiness. But what we were trying to do, we were trying to cut down the existing support of our current infrastructure so that we didn't have to, it was a keep doing, start doing, stop doing type of thing where we could say this is not helping us. This is not something we want to sustain into the next year. So let's just stop doing this thing and we could then put more resources towards building the new instead of just focusing on fixing the old. Obviously break fix and things like that were going to be still an issue but the bulk of the effort was put towards building the new environment. So what do we want out of a new environment? Well, we never want to do anything that's not in code. We had just a mix of done in code, not done in code and we had no rhyme or reason to why things were the way they were. We wanted to always include code validation or include everything validation. We want to make sure that things are ready to go. We know from the ground ground up that things are ready to go. We want to keep documentation close to code. How many of you know what a graveyard confluence is and where all of your documentation goes to die? And unless you have a full time dedicated tech writer which I don't think most of us have the luxury of you really can't find anything useful. You do a bunch of searching and hope that you find a keyword that might match. You put your code, you put your documentation in the same repose as your software. That's how you find your documentation. We simplify the product engineering job. They don't need to know the ins and outs of Terraform. They don't need to know the ins and outs of Helm charts. They just need to know what they're asking for. They need to know what it is that they're trying to build. They need to know why they're trying to build this. And mind you, we're providing a paved path. They don't have to go down the paved path. They can do off the path and support it themselves but that's a choice. That's not the way that we're making them go. They can diverge if they support it on their own. If they have the know how to diverge from that paved path. If they don't, they can go down the normal path and everything works the way they're supposed to. Testability, that was most of that grid that I showed you was we had no way to test things in prod environments, et cetera. Consistent environments, that's another one. Dev and prod, nothing alike, security, what environment? We don't have one of those. We wanted a low touch release process with visibility. We wanted to know what version of code we were on and where it was at any given point in time. We wanted to have some pretty sturdy guardrails so that we didn't have terraformed spaghetti again. We wanted to make things more straightforward, be able to see the bigger picture without having to spend the hours dissecting the solution. No spaghetti, spaghetti is bad, especially organic spaghetti, especially those gluten intolerant among us. So, it's so clean. What are we doing here? We're doing some declarative GitOps here. We want to promote the desired state to the intended environment. So, we're creating a platform cluster. Oh my goodness, DevOps has a playground. DevOps has a place where they can do things. We have an environment that we can now test changes without impacting production. Previously, we were entirely test on prod. The folks working on our product delivery, allerger and stuff like that, why are you skipping over the section about testing? Once we merge code, it's in there. Sorry. As well, security was going to be thrilled to have a testbed as well. I want to test this new firewall. I want to test this WAF. I want to do these things. They wanted an environment to do this. So, we're going to create an app, a platform alpha. We're going to create a platform beta. We're going to create a platform stable. Code will never be deployed directly to platform stable because things should be promoted through the lower environments for all of the validations that I mentioned before that we were aspiring for. Obviously, there will be a break glass because reality. But that's not the way things are supposed to go. And then when we start going to app, everything will have had to pass through the alpha, the beta, and the stable again. It's equivalent of dev, staging, and prod. So, I'm showing you some diagram here. What is the problem we ran into? We want to use the same tools to deploy the base infrastructure as will be used to deploy the applications. Who would have thought that we would actually want to use some consistent seed or tooling across the environments? So, this actually caused us a little bit of slowdown. We realized that our old templating process, the cookie cutter, wasn't going to meet our needs because it was as spaghetti as everything else was. It didn't follow the normal paradigms of something like a cookie cutter application. So, it was unnecessarily complex. And as well, my team sat there for days going, wait, what is this thing doing? Why is it doing that thing? So, the decision was made to do this before we moved forward. So, we moved all of this cookie cutter concept to a GitHub application that uses GitHub topics to subscribe to template repos and get updates. So, this saved us from our current complication which was we make a change to the cookie cutter. We push out all of these PRs that everybody needs to merge. Everybody looks in GitHub and goes, oh, there's a PR there that needs to be merged. It's not mine, I'm gonna leave it there. And then we end up three months later where nobody has updated their templates. So, this was a fairly quick win. As I mentioned, we wanted to give the developers a quick win so they could see the benefits of what it was that we in DevOps were trying to do. It took a lot of the heavy lifting off of the engineer's hands. They no longer had to go merge all those PRs that meant absolutely nothing to them. They now had a version control. They could see what it was that was being changed with every iteration that went out. You name it. So, once we got this aspect of deployments straightened out, we were able to run with starting to set up our EKS clusters. So, I'll say we worked with a mix of the Amazon EKS Blueprint as well as some of the grunt work reference architecture for coming up with the general best practices for setting up EKS clusters. And I'll call out here. This diagram was drawn in Excalidraw. If you don't know Excalidraw, it is fantastic for making simple diagrams. It's open source. We found it. Some salesperson was doing a demo for us and they were using it while we were on there and it was super easy and we went, ooh, that's neat, I need that. I see everybody looking it up on their phone. I don't work for Excalidraw. Okay, so I mentioned that one of the things we need to do is validate everything. We want to validate in local. So, we're going to use things like pre-commit hooks. We want to detect the problems before they move to version control. We want to keep it out of the more expensive side of things, don't put it in GitHub if it doesn't need to be in GitHub yet. We want to validate in CI. So, we've got Conf test and things like that, testing against the structured configuration data so we know exactly where we're going with that. We want deployment validation. So, we've got gatekeeper OPA there. We've got an admission controller for Kubernetes to enforce the policies in real time during the resource creation. We explored a little bit with Kyverno but we opted to go the gatekeeper route. And then the validation in cluster runtime. We used AWS GuardDuty for that just to figure out the threat detection and things like that in advance of having. This is independent of our application necessarily but we want to get everything completely validated before we even spin up the infrastructure and then continue on into the application. So, tools, what tools did we use for building this all out? So, we've got cargo slash Argo. So, the continuous delivery and a lifestyle orchestration. So, as I mentioned, we never deploy too stable. We promote things through the environments and use the concept, the terminology called out by cargo. They refer to things as freight that you ship through the process. We do all the validation at the lower environments including the container images, the Kubernetes manifest, the chart, the home charts, the repositories, et cetera. We use Harbor to store the artifacts and have them pre-scanned. We use GitHub Actions and GitHub Applications. Istio Service Mesh, we did have Istio before but in the idea of folks that don't know what they're doing, setting something up. And Istio being a fairly complicated thing if you don't completely rein it in, we ended up with this very big mess of Istio that wasn't doing at all what it intended to do and the thought of upgrading that thing struck terror in our hearts. But setting up Istio from the ground up so that you knew exactly what it was doing and it was specifically set up for your environment was not a bad call. And then Terraform, hopefully this time with less terror. Had I still been employed, I would have been showing you a demo right about this point. But I thought that the story of how we got to thinking about building up environments that are using things in ways that they should be used was still a valid story to tell. So with that, any questions? That is Bishop, right. Not for nothing, back in the beginning, did it ever just, could you just shut down your Microsoft SQL database and see who's screened? So Mark, you probably dealt with me doing that at Ticketmaster a couple of times. We did, that's ultimately what we did, but it didn't answer the question of whether it was in code or not. A team respond to change happening and what did you have to do to be able to get people to buy in or were people pretty bought into wanting to go to the golden path in a new, shiny way? I couldn't quite hear you. Okay. So what did you have to do to get the team to buy in? How was it received? Did you have to do kind of an internal evangelizing effort or did everybody kind of realize that you had to change and that this was a good thing to happen? So this was what I was saying about needing to show the incremental improvement. We didn't wanna go develop a new infrastructure in a vacuum. We needed to show little bits and pieces that would be appealing to people as we went along. We had to keep the regular demos going of this is a thing, this is the way it works now, this is what we're going to move towards. The developer productivity team I mentioned was very in tune with it. They were very gung-ho about it because, like I said, they were doing the let's pick the little bits off the corner of the glass and make it a little more transparent. They were evangelizing it more than us. We were working on getting it set up in such a way that we would be able to move to it sooner than later. But when I mentioned the replacement for cookie cutter and being able to get those things, people went, oh, that's great, I don't have to merge all these crappy PRs that I don't know what they are and why they're there and why are they even here. I'm not gonna touch them and get the most recent version of things more easily than they have been in the past. Yeah, a similar question. Do you, I have written several product plans as a developer because I want to be able to convince my bosses or just lay out all the options. But you talked about cost and risks and do you have any recommendations for general best practices? Because most of what I do is I lay out a feature matrix but I don't really consider maybe the business impact and that sort of thing and I do think that's the next level for me. Yeah, that was a very useful exercise for us to be able to sit there and we basically did a ranking of one to five of how risky is this, how much is this gonna cost? What is the duration of having something, an environment that's not getting any traffic, having that stood up for X amount of time versus having a exact copy of the huge environment that we have that probably has a bunch of garbage like that Microsoft database that doesn't even need to be there having a full clone of that and things like that. So it was a super useful exercise to be able to think about it from the business perspective. That was where I got pulled into it was trying to convince the people with the money why it was that we were doing these things. The tech people are usually relatively easy to sell. If you say I'm giving you something cool to work with they're gonna go like, sure, it's the business people who are signed in the checks that are going to be the ones who are a little more hesitant and if you can convince them by way of data then usually you have an easier time of it. How long did this take you? This, the part we got up to, so from it's time to start thinking about what's wrong with things. So all of the assessment and the interviews that I'll say that started a little under a year ago. And then all of this assessment, we had kind of buy off buy summer. And by summer we set up a POC of like, yes you want everything to be done in code but sometimes you need to take a break and do things, some click ops for a little while just to get your ideas down before you start committing it to code. So we did a POC that was due in September. From there we were gonna do a MVP minimum viable platform and that was going to be where we could start rolling through the process of setting up the infrastructure using what it was that we were gonna use to ultimately deploy code. So that was, I'll say that was due like a month ago. So it's a better part of a year still but it was, we realized a significantly heavier project than we thought it was going to be when we kicked it off. Like, oh, we'll just fix it for the devs. It's no problem, we'll just do that. And that original project when I first started got tossed between team after team after team. Like, oh, I can solve this. Oh, no, I can't. Oh, I can solve this. No, I can't. You just needed to rip off the band-aid and go build it fresh. And I know you said you're no longer with the company so I don't know if you can answer this but I'm curious about adoption. Was it successful? I'm still in touch with the team, obviously. The team is still working on it. They are, the infrastructure is stood up now. They are working on getting the initial applications ported over. And one of the things about getting the cookie cutter type situation set up was that you could use the same one on the legacy code as you could use on the new code. And that was something that was going to make the step a lot easier to take to moving to the new environment. But what we wanted to do was move the monolith over. That wasn't necessarily going to be the easy one at all because that was gonna need to get migrated to Kubernetes and so on. But it was a let's pick off some of these microservices and get them moved over. At least get them using the new cookie cutter. Get the, or templator is the better word for it but let's get them moved over to using that and then when it comes time we can just stand them up on the new infrastructure. So this was not customer facing. There was not an impact to the end users that were seeing it. But it was the folks on the inside that said, hey, we need to upgrade to this version of Postgres because it's gonna go end of life on AWS if we don't do this now. Well, how are we gonna do that without causing customer impact? And that's where we started to have to think about things like how do we do these upgrades? How do we test these things without just, we had to do a database upgrade where the easiest route was let's just cut over to the new one. Do you have it works? Okay, cut back. And that was not an ideal state to be in and we, the nature of the business, we didn't wanna impact customers at all. It's healthcare industry and we really don't want to leave somebody hanging on that front. And it really was grown from the developers going, this really sucks. And we had to figure out why it sucked for them. Like, yes, DevOps knew exactly why it sucked, but to try to serve our customers, the developers were our customers, what was it that we needed to do to make it better for them? And ultimately, time to market for products and so on was serving the customers outside of the company as well as the customers, our DevOps customers of the product engineers. Was a originally something that was like a startup that had built into something with a whole bunch of spaghetti and I imagine that you got a whole bunch of metrics for and then when you grew into this, you had an idea of what it was you needed and ended up deciding what you were gonna create afterward from that. But if you were to instead just create an entirely new organization, what different things would you be doing as a result of what you had learned from this? So I think the question was, I had the benefit of a little more flexibility because I was in a startup organization, but if I ended up in a different, bigger organization with a little less flexibility, how I would? Not quite. If you were to do this all over again kind of thing, right, like I don't know if you were here when the business had originally started, but the, I imagine because you are in this and you had time to figure out what it was you needed, right? Like you would have gotten metrics, for example, for what it is you needed in your green field. But if you didn't have that, for example, and you were just setting up, here's a new business, would you have some new lessons you would have learned from it as a result? Okay, I see what your question is, is if I had the information I have now and I was setting something up from scratch, would I be a little more successful at it? Yes. And I would say, yes, hell yes, the more you know, the better you are. And I am thankful that I had this experience because I now know what sort of things can go absolutely wrong. And obviously that's gonna change because what you knew yesterday is different tomorrow. And being able to at least form the concepts around dissecting what it is that needs to happen is something that we'll carry on into the next round. You gotta keep up with the current technology, that's why coming to something like this is great because you get to hear what the next generation of things is and be able to work from there. If you know what it is that is the problem in your environment, you can start something new using the new and shiny. That's outdated a week later, but you know. Yeah. Would you have any particular things that you would like come to mind for if you're making something new? These are the things I have in mind. I'm sorry, I didn't catch that. So if you were to do that, do you have, these are the new things that I would have in mind to do, right? Or would you just say now that I have the experience, we can do it better. Is this an experience thing or are there takeaway lessons you could make from it? At least at this, you were asking with the knowledge of the new stuff coming out versus the experience of which way I would go kind of thing. I would say at this point, I like what we came up with and I would run with that for a little while because I am still a fan of Argo. I've actually used it in my previous job as well. It gives some visibility and a little more flexibility and you can see what's happening in your environment without having to manually break it apart. And I do enjoy that. And I was a bigger fan of GitLab. I know GitHub is a sponsor here, but I'm now a fan of GitHub as well. So I appreciate those areas that we can roll through. I do like the idea that was actually a good idea that we were trying to keep along was the ephemeral environments, being able to spin up an environment and test off sides and not have to manually do that every time. There's a lot of concepts over years of experience, you pick up things that you go, oh, that'd be great, I'm gonna reuse that idea. And then next time you go, ooh, no, never touch in that thing again. So it's really just experience, trial and error, knowing what's going on for a particular company. They're all gonna be different. I'm not sure the microphone is working. How did I convince the business people to go along with it? Yes, TLBR. I caught that you could convince the developers to go along with it, but how do you convince the business people to go along with it? I'll say that unhappy developers are very loud developers. So you'll end up with them telling the business people that it's the infrastructure causing them problems and that's why their features are not making it to market or that's why DevOps is a roadblock and things like that. So if we could get the data out there, get the assessment of the environments, get the assessment of what it was gonna take out there, then it was a lot easier to convince the business folks because we were impacting their developers, impacting their new features to market, impacting these things and effectively you're just taking away excuses. So once they got the concept that yes, this was a problem and this was impacting the business, then it was easy to go. As long as you can say what the money looks like and who it's gonna help, then I think that's a fairly easy, I won't say that's always the case, I've been in companies where you gotta convince somebody that I need one extra server, please, just that extra server because we have that much more traffic. In a reasonable company, you should be able to convince them with data and with squeaky wheels. Might need to be the last question. Hello? Yeah, I wanted to ask you, what was the timeframe in terms of weeks or months for if you could break it down into three parts? So the planning where you were doing the figuring out what to do and you came up with the three plans, the three options, how long was that? And then the second thing from that point when you presented it until the people with the money said go ahead and do the green field, how long was that the convincing? And then the third one, how long was it from that point to actually get it running and in the prod and everything's up and running and the apps are in the thing and in the new green field product. So like, you know, was it one month for the first one? The second one was two months and the third one was 12 months, something like that. Okay, so I did answer a little bit of that previously. The assessment, the matrix of the cost benefit analysis and why, which of the three plans we should choose and all of that, I mentioned this was urgent. So that was a pretty tight timeline. We needed to get all of this data together. I spent some late nights coming up with numbers figuring out how much things cost so I could present it to the people that were gonna pay the bills. The second question being how long did it take that decision to get made after that data was presented? That was, I'll say a week. We had to make sure the correct people were around. We had to make sure they understood what it was we were trying to sell them and that went pretty quickly. And then in terms of building out the infrastructure, that one was still a work in progress. I mentioned the POC that was TLCOPS, mentioned the minimum viable platform which was a work in progress and soon to be adopting production workloads. It's still not done. No, I sadly had to leave before it got to see its day in the sun. Can you just say like, you know, weeks or months? Is that a month, six months, 12 months? Before it was going to be up? Or? What is your estimate from when they gave you the money until it will actually be working 100% with prod for everything? With prod for everything, I'm gonna say that's probably a year and a half. A year and a half? Yeah, because every application, different. Not even slightly similar, not using. Yep, a long time. You still have some spaghetti to unravel, but at least you're unraveling it to a clean pot. And will the slides be available? Yep, thank you. Yeah, and you figure we only had one and a half to two people working on it because we still had to support the existing environment. So that always being the challenge. Can you hear me? Oh, yep, it's working. All right, cool. Come on, check, check, check. Testing, one, two, three. Y'all hear me okay? Test, test, test. Check, check, can you hear me okay? Check, check, check. Oh my God. Well that's, y'all, I'm gonna have to keep my voice down because that's too loud. All right, is that okay? All right, cool. Either way is fine. I don't think so. What is it? Okay. Hello, this is Thomas and I believe is he actually going to have a presentation really to answer this a ball. It's like a software stack that you write in YAML and actually produce various stuff. And I actually don't know what AWS is. I just, I hope you'll bring us to your because it's also enlightening you all. Thank you very much. Yeah, yeah, it's basically a server for running your Ansible Playbooks, although it's a little bit more complex than that. And I believe, I could be wrong, but I think we've got about like five more minutes, right, about four more minutes. Three more minutes. All right, so I generally try to allow for any stragglers. So we'll just hang out for a little bit. How's that? Anybody got any good jokes? Thank you. All right, cool. So let's give it just a couple of minutes. I wanna make sure that we allow for any stragglers. And I know like it's five o'clock and I'm one of two sessions between you and Happy Hour. So I recognize that and I promise I'll try to get through this as quickly but completely as possible. And just out of curiosity, is there anybody from either the AWX or the AAP team here today? Bummer, because I got a bone to pick. Oh, y'all are gonna hear all my pops and hang on a second. Let me see if I can do this any less objectionably. There, there we go. I'm not doing the plosive like destroying your ears. Yes. Actually with upstream because upstream is driven by Red Hat and they are moving in a direction that I understand but frustrates the hell out of me. But we'll talk more about it. And I made a whole bunch of animation changes in the last session while I was watching. So I guarantee you something is gonna go sideways and I'm probably gonna get something flying in from the wrong direction or showing up that shouldn't show up. We'll just roll with it. So there's like a gigabit switch in the lectern here that's blowing hot air on my feet and making an incredible amount of noise. It's probably, since it's coming towards me, y'all probably don't hear it but it's pretty impressive. Okay, new, okay, good. All right, we'll make the best of it. That's right, that's right. It's like a little foot massage kind of thing. But all right, so I am seeing that it is five o'clock so we will go ahead and get started. So first off, thank you very much for coming folks. I appreciate it. My name is Thomas Cameron. I am the Red Hat Practice Lead at Sparksoft Corporation. We're a company in the DC area who serves the federal government specifically around healthcare. So if you're familiar with healthcare.gov, we are one of the contractors on healthcare.gov. So we do all kinds of stuff with, gosh, the Centers for Medicare and Medicaid Services, National Institutes of Aging, National Institutes of Health and stuff like that. So even though I work with the federal government, we do good stuff around healthcare so don't hold it against me. And we're gonna be talking today about enterprise automation with AWX and we'll talk about what that is in a little while. AWX is the upstream project from which Red Hat builds the product Ansible Automation Platform. So I'm gonna talk about kinda how to get familiar with AWX because AAP, there are trials available but it's kind of a pain and they expire after 60 days. So I wanna show you how to get it up and running in your environment so that you can become familiar with Ansible Automation Platform because it's a hugely valuable skill set to have. Lots of folks are moving towards Ansible for Automation. In fact, to get your RHCE exam, you have to, it irritates me because like, they teach you nothing about the underlying technologies like the NFS server, DHCPD or any of that stuff. But boy, they will show you how to write Ansible Playbooks for automating those things which I've always thought was a little bit weird. But, so let's talk about what we're gonna be talking about. The agenda is that we're gonna do a quick introduction. I'm gonna talk about what AWX is, the architecture, the installation, logging in, building your organizations, adding your users, setting up credentials, setting up a simple project, just to test connectivity to make sure that your machines are responding and you have your SSH credentials set up correctly and all that kind of good stuff. And then, so we'll set up the project, we'll ping, we'll set up your inventory just like with Ansible. You know how you have to have an inventory file, similar concept, but we're doing it server-based instead of on a per playbook perspective. We'll talk about how to set up SSH, set up your credentials so that you can log into the machines that you manage. And then, I will talk to you about what a template is and then I'll set up another project for Apache web server. And then I'll set up another project for MariaDB with a goal at the end of showing you what it looks like in the real world when you use AAP or AWX to run configs on your environment. So, time permitting, I'll do a live demo at the end. I think everything should be fine because I spun up all my VMs and it does take a little while to spin up your AWX VM because it's gotta spin up Kubernetes, it's gotta do all the networking and stuff like that, so it does take a little bit of time. Now, just a little bit of information about me just so that you understand why I'm talking about what I'm talking about. So, as I said, I'm Thomas, he-hem pronouns. I have been doing this since 1993. How many folks are under 31 years old? Oh, God. So, I've been doing this literally longer than some of y'all have been alive. I started out, and I'm dating myself, as a Novell certified network engineer back in the days where you could run a server on a 386 or a 486. Then I went to work for Microsoft, don't hold it against me. During the Windows 95 launch, I became a Microsoft certified systems engineer. Then I went to work for Red Hat. It's been about 14 years there. Was a Red Hat certified architect plus a whole bunch of specialty certifications around security and performance tuning and all that kind of good stuff. I got recruited by AWS, went there for about four years and got all of the certs from there as well, so SA associate security specialist, SA pro and so on and so forth. And generally, by the time I get done telling people all the stuff that I've done, the answer is, you don't get out much, do you? Because I spend a lot of time doing nerd stuff and that's cool because it works for me with my ADHD. You hear people talking about the superpower of ADHD. You can hyper focus on stuff and there have been times where my wife has come in going, I'm getting up for the day. Have you been to bed yet? So yeah, it's a good thing to be a nerd with ADHD. So my contact information is up there. Don't hesitate to reach out to me. If it's community stuff like what I'm doing here which I absolutely love doing, my email address and my website are on there. If you wanna talk to me professionally, you've also got my work email address. So whatever makes sense, I'm available. So as my daughter likes to say, dad, you woke up and chose violence. I'm gonna ask y'all some questions. Wanna get the blood flowing? Wanna get the juices going? I have a question for you. How do you pronounce this? How about, raise your hand if you say system cuddle. Okay, system CTL. Okay, cool. How about this one? Same? If it's cube cuddle, no? Okay, so cube CTL. All right, cool. What do you call this? Is it K8S or is it Kates? You call it Kates. Really, okay, cool. And then finally, is it K3S or is it K3s? Kies, just Kies. Okay, cool, cool. I was just curious, right? Cause I've seen some people call it cube cuddle and a lot of people call it cube CTL and I was just curious if that was just me. Now finally, oh no, never mind. I don't even wanna bring that up. All right, cool. So let's move in to the meat of the subject. Let's talk about what AWX is. So AWX, as I said, is the upstream for Ansible Automation Platform. And this is literally from Red Hat's website or the AWX website. And it basically just says, hey, this is where we put stuff out into the community and this is one of the things that I love about Red Hat. That's why I went to work for them is that they really try to take an upstream first perspective. They're not always successful at it but I genuinely believe that management's heart is in the right place. So as I said, AWX is the upstream just like Fedora is the upstream for Red Hat Enterprise Linux. AWX is the upstream for AAP. The UI that you see with AWX is the same. I mean, there are a couple of little subtle differences but one of the things that I love about using AWX is like I said, you get the experience, you can play around with it, you can learn kind of the foibles and stuff like that. And you're not having to pay for an expensive subscription or do the silly like, oh, I gotta come up with a new email address for another 60 day evaluation. Not that we would ever do that, right? Right? Okay, all right. So very high level 10,000 foot overview of the architecture is you could spend like two weeks talking about this. In fact, if you take the 394 test from Red Hat, it is literally like 40 something, 48 hours of training if you do the Red Hat online training. So I mean, there's a lot to it but the 10,000 foot overview is it is Kubernetes based. There are a number of options. You can do K3S or K, what did you call it? Keys, keys, okay. So you can use keys, you can use Minicube but don't. You can even use OpenShift. There are a number of ways that you can do it. And basically what it does is there are containers for job management. There are containers for the engine for AWX itself. There are containers for the UI. So there are even containers for Postgres. If you so desire, you can actually tell AWX to use an external database that's totally cool as well. Just whatever makes sense. But for what I wanna talk about here, which is basically like just getting up and running and used to it, you can just let it build the Postgres containers. And again, this 10,000 foot overview, this is an advanced architectural diagram. So you can have the automation hub, you can have the EDA controller, you can have the automation controller itself. You can connect to external databases. It can get really big and really complex. But again, with AWX, it's really kind of designed more for just getting familiar with the technology. And so we're gonna take a really simple view in this session where we're just gonna do it on a single node machine and play around with it. So, like I said, AWX, there is a how-to in the AWX GitHub repo. It's terrible. Don't do it. It's seriously, it's painful. And I'll talk about what I ran into. So, but for single node, I tried on Fedora 39. I tried on CentOS Stream 9. I did one successful installation doing the Minikube and I did one successful installation using the Docker compose in the developer thing like three weeks ago. And I was like, this is gonna be great, man. Cause I mean, it worked all the way through and then they revved the operator and then they revved AWX and it was awful. Everything broke. And so I struggled to get a working setup. I literally did not get my final slides done till last night because it was just failing and failing and failing. And I was seriously at the point where I was like, I may have to just cancel this talk because I couldn't get the demo to work. I finally settled on using CentOS Stream 8 and that seems to be the most stable build and using the keys installation like just worked. So that's what I did. I used Kurokobo's keys method. I tried using the AWX operator as it was described in how to and Minikube just, it would come up and then hang. And I never would and I was looking at logs. I couldn't figure out what was causing it. So I fought with that. Tried it on CentOS 9, same thing. Then I tried the Docker compose developers build and it wouldn't build the containers. It would throw errors about Postgre's file systems and that wouldn't work. And I was starting to get a little bit panicky there and then I was finally on the matrix channel for AWX and somebody was like, dude, just use Kurokobo's keys build, it works. And I tried it on Fedora and it crashed and I tried it on CentOS Stream 9 and it crashed and I finally was like, oh my gosh, I'm cursed. This is never going to work. But it said in his documentation, I have tested this on CentOS Stream 8 and so I was like, all right, I have tried all my best thinking and it didn't work. So what I did was I did a super minimal installation of CentOS Stream 8 to give you an idea. Like if you do, if you look at my DNF group list, like you see everything is labeled as available. It was a super minimal installation. Nothing, nothing installed. The only thing that I installed was at base and at core. That was it. So only about a 576 RPMs. Now, if you go look at the URL and y'all can have this slide deck, I will post it as a PDF on my website. But if you go look at the GitHub repo, you can dig down into the readme and it talks about, you know, we're gonna use the slash data directory. You're gonna set your passwords for both AWX and Postgres, et cetera, et cetera, et cetera. I was like, okay, cool. And then I read down and it was like, I tested it on CentOS Stream 8 and I was like, okay, message received and I did it on CentOS Stream 8. Now, interestingly, it does use a slightly old version but just like one, like a few revs back of keys. And it does use an older operator. The 2.13.0 dropped last week and as soon as you shut down your cluster, it deleted the storage for the Postgres server. So that was something I was fighting with. They did come out with 2.13.1 but even that one, somebody else on the Matrix channel was saying, I can't get my containers to build. And so again, I came back to this documentation. He's like, 2.12 is the way to go. And so that's what I did. And it uses an old version of Postgres, but it works. And so, now I also looked at the requirements. Two CPUs with four gigs of memory. 10 gigs for Varlig Lib Rancher, 10 gigs for data. So I made sure that those were both there. And so, again, sent off stream, made sure that I had 24 gigs of memory because I was like, no, I wanna make sure this thing's going to run and compile and all that kind of good stuff. It says you gotta have four cores. I might've gone a little bit overboard. 16 cores for this VM. And made sure that again, Varlig Rancher and data both had, I think, 20 gigs of space and so I was good to go. Now, the deployment instructions tell you to do something which hurts my heart a lot. They tell you to disable firewall D. Anyone ever seen my SE Linux for mere mortals presentation? Yeah, so security guy, like, you want me to do what? But I did it. And then it has you remove some stuff just to make sure that it's, or to turn off some stuff and then install git and curl. So, again, I hate this advice. In the real world, I would probably, you know, make sure that I set my firewall rules up but for this demo, I went ahead and did system control, disable firewall D, make sure that those services are turned off if these are cloud instances. And then I made sure that I had git and curl installed. That was installed, we're good. And in the instructions, it said reboot and I was like, you know what, I am cursed so I'm gonna do what the instruction said and I rebooted the machine and it seems to be just fine. So, I totally understand why it installs this way but I'm just curious, does anyone else hate mixing package software and stuff like flat packs and snaps and all that stuff because yes, I really want three different distribution mechanisms to manage my 10,000 servers, right? I hate that. Yeah, I'm looking at you snap and flat pack and all those guys. But, so then it talks about installing keys and the way that you do that is you use curl, pipe it to bash, I hate that but again, we're doing this for a demo. And so, I ran the commands. One of the things you wanna remember is make sure you tell it what the mode is for the config file because by default it's 600 and it's owned by root and as a regular user you can't read the file and you can't start your cluster. So, make sure you do that. It will then create a bunch of sim links including there's a kill all script, there's the service file that it generates and then does system control enable. So, that's kinda nice. That means that when you reboot the machine all your services come up and it actually works as expected. So, that was really cool. And then once I ran the installer, I mean it installed pretty quick, right? And I was like, oh cool, it's up and running. And I did a kubectl get nodes. And I was like, okay, yeah, it's up. But then I remembered, oh yeah, no it's gotta spin up all the pods. And so, you can see that for the first several minutes it took probably 10 minutes for my machine to come up. You can see that it was doing container creation. I used watch with a 10 second delay and I just ran it until everything was either up or completed and expected. So, I was like, okay cool, it's up and running. We're in good shape. The next step is I need to use AWX operator. And the way that we do that is you gotta do get cloned. You've got to go to the directory, check out the correct version and then run some commands. So, I created a get directory, went to that get directory, did a get clone. Once I did the get clone, I did the get checkout to 12.2. That worked just fine. And then the instructions say, now we just use kubectl-k operator. And so, I ran that command. And one of the things that I love about this guy's documentation is, it says these are the commands you can run to check and make sure that everything's working. So, you kinda get to follow along and make sure that things are working the way expected. So, I did kubectl-apply-k operator and you see all of these things that got created. And again, the nodes, that was already up, but I did get pods and you can see that again, it's doing container build. So, you gotta wait for a little while. You can watch CPU utilization on your machine. And then after a little while, and I used watch again. And after a few minutes, all of my processes, everything was done and my machine was up and running. And so, then it even says, like you can look and see all of the, you know, everything. You do kubectl-nawxgetall and it shows you your pods. It shows what services, the deployment apps. It shows all of this stuff. And so, you can watch and see that it's okay. And then you need to make some customization for your site. So, you go in and first and foremost, and I love this about this repo, is he's like, let's go ahead and do this, using SSL on port 443. We're just gonna use a self-signed certificate so you get the warning, but you don't have to go, oh, I'm gonna connect to port 30870 or something like that. You just go to the web server and it's good to go. So, it has you set an environment variable, run your open SSL command, and then you have to make some changes in the AWX YAML file. And so, I did that. First and foremost though, I wanted to make sure that DNS resolution was working the way that I was expecting and I found that I had done something silly. You know, when you're using DNS mask, when you define your IP addresses in your Etsy hosts file, whatever the first field after your IP address, that's what it assigns as a host name. And I had forgotten that. And so, I had a little shell script. It was like, you know, four I and one through 254 do echo host or the IP address and then host $i and then host $i.virtual.land, which is what I use for that. And it screwed up my DNS resolution in my virtual network because everybody was like, oh, my host name is host123. Instead of host123.virtual.land. And it did screw up something that I was trying to demo. So just be aware of that. So I made sure that DNS was working, it worked forward, reverse, everything was resolvable. And then I said, you know, I used the command that it said. So I did AWX host is my host name. And once I made sure that that was set, then I ran the open SSL command so that I got the self-signed certificate. It wrote it out into a key file so that I could use for my web server. And then, yeah, I know I'm putting a key on there. This is the key for my virtual network in KVM. So like, steal it, I don't care, unless you get physical access to my machine, whatever. So then it talks about you have to customize your awx.yaml file. And it's pretty simple fix. So you go and you see that the default is awx.example.com. I knew that I was gonna be using host 209.virtual.land so I just edited the file, made that change. And then it talks about you have to change the password if you want to in your customization.yaml file. And so I did, I went and I edited it. That's the file, it's under base. You can see that it had the default password which is ansible123, ansible123. I just edited it and I came up with a silly, 20 sparks off, 24, bang. And I did that in both places. One is the password for the postgres database, the other one is for the login. So I made those changes. It talks you through making directories. We've all done that, right? So I just, I did want to show you though in the pv.yaml file, the persistent volume yaml file, you can see that it sets the claim up, allocates eight gigs in postgres13 and then the other one is the persistent volume for slash data slash projects. It'll use two gigs of that space. So I made that, I did those things, I created those directories, I set the ownership per the instructions in there. One thing that I'll point out is it says to do chown1000. If your UID is not 1000, don't do that. Like use your UID so that you own that directory. But I did that, I checked the permissions, everything was good. And then finally it says go ahead and do kubectl apply-kbase, base and I did. And then it talks about some commands that you can run to see the progress of your build and I did that and I'll show you what that looks like. So I ran kubectl apply-kbase, base and it started all of those containers and I watched the log files and, oh man, it took like 15 minutes. And I've got a fairly beefy laptop, it's fairly new, it's got like 12th gen I think until, like it's a pretty snappy machine and it still took a while. So watch the log file because nothing's more exciting than watching a log file scrolling by. But it also talks about, you know, some things that you could do. So I would watch, you know, kubectl get pods-a and you could see what was coming up and you could see what was in the AWX namespace, watched that for a little while, came back to the log files because it was still scrolling by, opened another shell, did get pods and you could see that there were a bunch of them that hadn't started up yet, looked at or I just did a watch command on the kubectl in the namespace of AWX and I just watched it. And after a few minutes, it finally converted over and you saw everything was up and running. I was able to do get pods-a and I could see that initially, you know, everything was initializing after about 10 minutes. It came over to where everything was running or it had completed and exited. And then once I was sure that it was done, I could use kubectl-nawx and you could look at everything, basically your all ingress, secrets, all that kind of good stuff. And so again, this just shows you that yes, everything came up. It'll tell you what port numbers are exposed. It'll tell you, you know, whether your deployment apps have completed. So this is, it was pretty cool. The instructions, like I said, were really good for this keys deployment. So did that. Finally went back to the command where I was watching the log file. And this is the thing that you care about. When you see the recap at the end, you just wanna make sure that failed is set to zero. It will skip a bunch of stuff based on what your architecture looks like. That's cool. As long as you don't have anything failed, then boom, the system is up and running. Now, whoo! Spent all this time trying to figure all this stuff out and then we were able to log in and start using things. And that's kind of where the meat of it is. So went to connect, same as you've always seen it's self-signed certificate. It says, hey, this is not to be trusted. Okay, whatever. Thank you, Google. Log in using the username and password or the password that you define. The login is admin. And then you get the dashboard. And you can start poking around in there. There's a lot of neat stuff. But we're gonna talk about sort of the minimum necessary things that you need to do to start playing around with the UI. So the first thing that I did was, I don't know if you can see it up here, I went to help about, and it says that it is 23.9.0, which is the latest version of the build. I went to the AWX repo and sure enough, the last stable version was 23.9.0. So cool, got the latest version, good to go. Now, once you're logged in, what you need to do is start to set up what you want your environment to look like. So what you could do is, you can set up your organizations first and the way that you do that is pretty straightforward. You just go over here under Access, go to Organizations, there's a default organization. I do recommend that you not just use the default organization. It's kind of one of those best practice things, like you may, whoops, you may be in an environment where I know for sure we're never, ever, ever gonna have another organization. Sure. What? Somebody squeaked. We're not all staring at you or anything, I promise. So, go ahead and define the organization. Don't just use the default. So, I go in, it's real simple. Click on Add, put in the name of the organization. I work for Sparksoft, so I made an organization called Sparksoft. I gave it a nice, pretty description and that's it. Boom, you've defined the organization. Now, everything that you do will be associated with that organization, all your user accounts and stuff like that. If at some point down the road, all of a sudden you acquire a company or you get acquired or something like that, because you've already set up your organization, you're not as likely to run into the oh no moment where oh, I gotta burn it down and start over from scratch. It's not a huge deal. Generally what you'll do is you'll have to do a little bit of work. You'll have to basically create new inventories and create new users and stuff like that in the new organization. But beyond that, if you have a really super complex build, you're probably gonna have a bad day. Because there's so much stuff that gets poked into the database, like your users and your credentials and stuff like that, you're probably gonna have a tough time with it. So try to, if you can, just build your organization from the outset. Ooh, I don't think I've ever tried because I always just ignore the default and create a new one. I don't know. I don't know, I'll play around with it. I'm gonna do a demo in a little while. We'll take a look, I don't know. All right, now that you've defined the organizations, you gotta create users. Pretty straightforward, you go into the UI, go over under access, go to users. There is the default admin user, just click on add. You'll get a form that looks like this. You're gonna put in name, first name, last name, email address, the user name, you can define the password. And here is where you have the opportunity to set access levels, right? Do you wanna do a normal user where you're gonna have to explicitly be granted permissions over an org or over a work group or something like that? Do you wanna be a system administrator, which means you have god-like privileges or you even have the ability to have a system auditor type of account. So someone can go in and take a look at what you've done but they can't make any changes. So, set all that up, make sure you assign it to the correct organization. Life is good, lather, rinse, and repeat for all the people in your organization or in your IT team, right? Pretty straightforward. So, when you get done, you'll see the users in there. Life is good. Now, this one, it took a little while for me to wrap my head around what we were talking about here. You can set up, and this is, I think, I don't know if I would say it's a UI bug, but it's kinda weird. When you go to credentials, you have the ability to define credentials for outbound communications, like to GitHub or any of the registries that you wanna attach to, but you also use credentials for inbound communications to the systems that you manage. And the first time I was like, shouldn't that be too different? But we're gonna go with it. So, in this case, I have a GitHub repo. If y'all look at my GitHub repo, I'm not a developer. Understand this, my YAML code is ugly. I know it, you know it, everybody knows it. Please don't say, oh my god, dude, what were you thinking? So, what I'm gonna do, though, is I'm gonna set up so that I can pull stuff from GitHub. Now, my repo is public, but there are a lot of people who have private repos, and you can't just do an HTTPS connection to them, you have to use SSH. So, let me show you what that looks like. So, I'm gonna go into my credentials, I'm gonna click on add, and I'm gonna say what it's called, I'm gonna put a description in it. In this case, this is Thomas' GitHub repo. It could be my corporate repo, whatever. Assign it to the organization, choose the type. There's a huge dropdown that talks about all kinds of different types, whether it's like upstream repositories for containers or whatever. But in this case, I'm gonna say it's a source control repo, I'm gonna get the username and password that I log in with, and I have to put my private key into AWX or AAP. Now, I recommend that you create a separate private key that's not like anything that you use anywhere else for just for security standpoint, right? So, if you have an SSH password for the, or a passphrase I should say, just put it into there. Then, down at the bottom, or no, that's it. Then you set that up. We'll test it in a minute, but this is basically saying if I wanna be able to SSH into a private repo on GitHub or whatever, this is the private key that it's gonna use, then you have to upload the public key to GitHub, but it's public key. You can do anything you want with it. So, I've got that credential setup. We're gonna come back to credentials in a little while. Now, what I'm gonna do is just to test to make sure that all of these moving parts are not grinding any gears. I'm just gonna set up a really simple Ansible Ping. And so, what I've done is I have to define a project. The project is used to go fetch that content. So, I'm gonna find my SSH URL, or my SSH command to grab my Ansible Ping repository and then I'm gonna go in to create a new project, give it a name, give it a description, assign it to the organization, and I'm sorry, this is small, I did 1900 by 1280 or whatever it is, but I should have made my fonts larger, I apologize for that. But assign it to the organization, tell it what type of source control it is, in this case, it's Git. Here is the source control URL, which is actually just a login. If, yeah, and I chose the source control credential as being the type GitHub, and there are some check marks that you could do down here, like do I wanna clean out the old version before I pull a new one, or do I wanna just delete the old version? There's a bunch of things you could do for the most part, unless you really know what you're doing, like you can leave it blank, just read the context sensitive help, you can click on the little link next to it and it'll explain those to you. And when you complete that, it's really important that you see that right here, it says that it's running the sync, it's going and fetching that stuff from GitHub. You gotta wait till that turns green and says that it was complete, otherwise you have a problem with your authentication or something like that. But after a couple of seconds, that'll turn from running to successful, when you see the green, you're good, you have pulled your content from GitHub. And there it shows up, yep, there's Thomas's Ansible Ping project. This is, the project is where you go and get your content, your playbooks. Now, just like in Ansible, when we have an inventory file, right, you have to have an inventory file that's gonna have your host names and maybe you can group them together into web servers or database servers or Dev, QN, Prod or whatever, we can do the same thing, but we don't have to do an inventory file, we just define the inventory inside of AAP or in this case, AWX. So, what machines do I wanna run this ping test on? Well, I'm gonna set it up, I'm gonna go to inventories over here under resources and you're basically just filling out the inventory, the systems you want to be associated with, whatever it is you're doing. So, I'm gonna click on add, choose add inventory, give it a name, in this case I said this is all of my WordPress servers, this is the database servers, this is the backend, I'm sorry, the backend database servers, the front end web servers, whatever. So, I give it that, associate it with my organization and that's really all you need to do. Then, once you've defined the inventory group, then you're gonna come over here to hosts and you're gonna start adding hosts to that inventory, just like if you had edited your inventory file. So, I'm gonna say I've got host 183 and then I'm gonna add host 120 because 120 was my database server and 183 was my web server. So, I put both those in there, now I've got an inventory for that one thing for my ping test, right? Once I've got the inventory defined, now there is a break that I wanna make, y'all probably know this, but I just wanna make sure that it's really clear, you have to set up SSH, right? Remember that Ansible is gonna log in to your machines using SSH, so my recommendation is you do SSH copy ID or what I do is when I kickstart my machines as part of the post install process, I write the authorized keys file, but normally, like I said, I build it with kickstarts, I add that remote user, sysadmin or ansible or ansible svc or whatever, I generally do it with no password so that no one can ever log in using a password and I just make sure that I've created the authorized underscore keys file in that service accounts directory and the beautiful thing about that is that machine could be physically on the internet with no firewall rules, no one's ever gonna be able to log in as that user because you haven't assigned a password, you can only log in using the SSH key, but in this example, I just wanna kind of go over what it looks like, so in this case, we do SSH keygen, I chose ecdsa, created the file, look at the permissions on it, notice that the private key is always read, write for the owner only, public key, we don't care, anybody can have the public key, that's not a big deal, so make sure I know those permissions, then I do SSH copy ID with that ecdsa key over to whatever the machines are and I did it on host 120 and I did it on host 183 so I made sure that those things were there and then we're gonna go back to credentials, remember I said earlier credentials is, you could do one set of credentials for logging into GitHub, but now what we need to do is we need to tell AAP or AWX about how it can log into your systems, so I'm gonna go into credentials again, I'm gonna add a credential, there is my private key, again this is on a virtual machine that I'm gonna destroy after this demo is done, but I get the private key, I go into and create a new credential and I called it sysadmin's SSH key, give it a nice description, associate it with my organization, the credential type in this case is a machine credential, in other words I'm logging into an OS based machine, there in that dropdown there are things like network devices, there are services for doing like info blocks and stuff like that, so there's a ton of options that are available there, but again we only have an hour so I just wanted to talk about kind of the nuts and bolts. So how to find the machine, I can also put in the sysadmin username and password, in this case like again I don't use passwords, I only use SSH keys, but you can do that, I put my private key in there, down at the bottom, you remember how in your ansible.cfg you have to do things like what the remote user is, you have to choose whether you're gonna become or not, if you're gonna become, how are you gonna become, so same type of deal here, we are gonna become, we're gonna use sudo, we're gonna sudo to root, so just like you're used to in your ansible.config this is just putting it in a GUI and you don't have to do a new inventory for every one of your projects, so that's done, now I have my SSH key that's been defined inside of AAP or AWX and there it is. Now I've gotta set up my template, templates are not playbooks, except there really are, really what you're doing is the playbook that you've checked into GitHub or whatever is gonna be downloaded and put into your library in the AWX or AAP machine, so just remember that when you create your playbooks, don't do anything like, I uploaded some playbooks that I was only running against my database servers, so for the hosts it's a DB and I tried to run it on here and it was like I don't know what that means and it failed, so always set them up to be all, and you can get super crazy with your playbooks, you can get really sophisticated, you can use Ansible roles, I mean there's a bunch of cool stuff, but again, because we only have an hour, I'm gonna keep this really, really simple, so I go into my templates, I say I'm gonna add a job template, give it a name, give it a description, the job type in this case is we're just gonna run a command, right, I can associate it with any of my inventory groups, so in this case I chose all of my WordPress servers, all two of them, associated it with the project that I synchronized from GitHub, and I don't actually have to define an execution environment, the default execution environment that it sets up is usually fine, if you go out of your way to define other execution environments, which unfortunately is kind of out of scope for our one hour thing, you would define it there, but you can leave it blank. Now, something that's really cool about setting up a template, you get a dropdown for the YAML file that you wanna use, I'm gonna come back to that and show you something that I think is kind of cool, but right now the default is it's the ping.yaml file that I checked into GitHub, I can associate that SSH key, the sysadmin SSH key with credentials, then down towards the bottom, even though we're just doing a ping, like I just wanna make sure that Ansible is working on the target machines, you could do things like do I wanna do privilege escalation for this, in other words, do I want to become root, and I checked the box and I said yes, because I wanna test to make sure not just that it's pinging, but that it does privilege escalation, all that kind of good stuff, do I wanna enable concurrent jobs? Absa freaking lutely yes, especially if I'm running this across a thousand machines, so there's that, I save that, and now it comes back and it says okay, it's using this playbook ping.yaml, and it gives me the option to either edit the template or run the template, and so hey, it's time to test. So I click on launch, it runs just like if you ran Ansible Playbook from the command line or Ansible Navigator from the command line, this oughta look really familiar, and so basically what it does is it runs the SSH session to the target machines, and you can see that it gathers it up, and it says the events processing is complete. So at this point, we know that our SSH key is good, we know that we've pulled the yaml file, like life is good. Now, if we wanna get a little bit more sophisticated and this is just showing that when you get done with one of these jobs, it'll show up under jobs, and so you can go back and see, this is a great tool for if you did something wrong or if you fat fingered something, it'll come up and it'll say, it'll show you what the Ansible run output is. So there's that, now let's get a little bit more complicated. Yes, sir. Oh, like under the hood? I don't, what's that? Okay, so yes, it is actually running it in a container in your Kubernetes cluster, and the reason I know that is this confused me because I'm not the sharpest tool in the shed, and I'm the first one to admit that. Let me go back and let me, let's see where to go, where to go, hang on, let me see, whoops. Oh, it's actually in a much previous slide. So the reason that I know that it's running in a container is when you define the project, and the project is where it goes out and actually fetches the playbooks, it says, oh, this is being stored under varlib AWX. I was like, ooh, I wanna see what that looks like. So I logged into my AWX machine, and I looked and there was no directory called varlib AWX because that runs in one of the containers that is running in the Kubernetes cluster. So what you would have to do is you'd have to use Q-Cuttle to log in to the container, and then, so yes, it is definitely running in a container. Is that answered your question? Okay. What's that? Okay, let me finish up, and then we'll do questions at the end. All right, cool, so we did that, we did that, we did that, so again, this is just the same thing, except now my playbook is a little bit more complex, right? Let me show you what that looks like. So I'm gonna go in, this is a, and again, y'all don't bust my chops for my crappy YAML. So I went to my repository, github.com slash tdcam, and there is my apache.yaml playbook. I went to create a new project because remember, the project is what fetches the content. So it's Thomas's Apache installation, organization, it's a Git source control type, there's the URL, which is actually an SSH login. I click on that, it synchronizes, so now I know I've got my playbooks. Then I go over to the inventory, and I'm gonna create the inventory, so I create a new inventory, I say this is for the web servers, and when I fill that out, I go over here to hosts, I add my hosts, in this case it is host183.virtual.lan, it's a VM on my Linux laptop. Once I've got that done, life is good. I go over to the template, and again, the template's not really a playbook, but it kinda is a playbook because it uses the playbook that you got. Define it, I'm not gonna go through all of this because you've already seen it, but it's a run job. I'm gonna say it's on the web servers and it's using the playbook that it pulled down. I am going to make sure that I choose privilege escalation and concurrent jobs, and I'll usually, well sometimes, I will enable fact storage because when Ansible does a connection, every time it does it gathers facts about the machines. Do I wanna keep those facts on my AWX server? More data is never a bad thing, so yeah, I do. So I create it, now once I've created the template, I can click on the template, and this is what it looks like. So I go to the template, I click on launch, it logs into my machines, and so what I've done is I set up a DNF call, or a DNF function, so I installed the HDDPD and mod SSL, and I think WordPress, so I made sure that those were installed, I made sure that I started the service, I opened ADN 443, and when I get done, at the end, nothing failed, so now my web server, my web tier is up and running, and so sure enough, I can go to host 183 and yep, it was successfully deployed. I actually, as part of the playbook, created an index.html that says successfully deployed. Now, this is something that I just, I'm kind of a geek, and so I don't like leaving messes, so I actually updated my Git repo with a, I had the Apache install.yaml file, well I have an Apache remove.yaml file, and basically I'm not gonna bore you with it, but basically what it does is it goes through and it removes HDDPD and mod SSL, turns off the firewall ports, and so I can create another template for that, or I'm sorry, I can go to the existing template for that, and you remember I talked about how you've got the drop down? Well, I've got my Apache.yaml, which installs everything, and then I've got my remove Apache.yaml, which does DNF auto-remove, so it's not just the package, but all the dependencies and stuff like that, so I created that, run that, and so I can back out what I've done so that the machine is exactly like it was before I installed. Now, y'all know that that's a pets approach, and we're supposed to be dealing with cattle, but I just thought it was kind of fun, so there's that, and after I get done, the web server's not there, so all of a sudden that's failed. Same thing for now the backend server, I'm gonna go through this super quickly because you've seen it, but I go through, I get my repo, I create the project, make sure that it's been synchronized, and then I've got that, the project, now I associate it with the inventory, which is just my database server, again y'all have been through this, so I'm not gonna spend a lot of time on to it, define the inventory group, add the machine to it, which is host 120, and once I have the inventory defined, then I can go and create the template, which again, it's just another yaml, it's just another playbook that I checked in to get, so I go in, define it, associate it with the database servers, now right now there's only one, but I could adjust that inventory to be 10 or whatever, check the box again to make sure that we're doing privilege escalation, and once that happens, I run it, it goes out, it installs the MariaDB server, it starts the server, it opens up port 3306, and it even runs a little shell script where it reads a SQL file so that it creates the WordPress database, creates the database user, like all that kind of good stuff, and that brings us to what I would say is kind of the minimum you need to understand just to get familiar with AWX. As I said, if you do the Red Hat online learning for this, I think it's like a 48 hour class or something like that, so clearly I can't cover anywhere near all of that in here, but I just wanted to give you a resource so that if you wanna play around with this, that you can, okay? So let's do this, what does this look like in real life, and you gotta love live demos. So is this the real life, or is it just fantasy? Let's do this. Let me bring up my web browser, and this is gonna be a challenge because I'm gonna have to do my web browser, like on your screen, hang on just a second, or actually what I could do is, let me see if I can just mirror this, give me just a second, y'all. So let's go to settings, let's go to display, let me change this so I'm mirroring, this may get ugly, stand by, give it just a second. All right, can you see, yes you can, all right, cool. So as a, just kind of a what's going on, I'm gonna fire up a couple of shells, give me just a second, so let me make this a little bit larger so y'all can see it, there we go, and I'm gonna do two of them, so I'm gonna do SSH, host 120, and then I'm gonna do SSH to host 183, and what I'm gonna do is I'm gonna pop this over to another desktop and we'll come back to it because this'll kind of make sense in a second. On that other desktop, I'm just going to do while true, do PSAX, and I'm gonna leave those running because we're gonna pop back over to this in just a second. All right, so we're just doing a process listing over here. I'm gonna pop over to here, I'm gonna log into my AWX instance, and like I said, let me make this a little bit larger so you can see it, can y'all see that okay, we're doing, okay, cool, all right, so again, I'm gonna go into my templates. I have set up my inventories, this is my inventory for the database servers and the web servers. I've created projects for the Apache installation and the MariaDB installation, and I've created templates for each one of those. So let's say that we want to install MariaDB, now I only have this defined on one machine, but again, you can run this on as many machines as your system can support. So I'm gonna go into MariaDB, scroll down a little bit, and this is how difficult this is. So I'm going to fire up that, and I'm gonna pop over to here to host 120, and you'll see in just a second because it's just doing a loop of all of the commands. Should, hold on, I love live demos. Hang on just a second, that is obviously not happening. Hold on just a second, let me cancel that. Did I, let me make sure I didn't like screw up my project. Let me make sure I have network connectivity. So let me do the MariaDB installation. Let me sync this and make sure that I've got connectivity. Okay, that worked, let me try this again. Templates, install MariaDB, launch on, you gotta be kidding me. What's that? In, oh there it goes, okay, it took a second. This is why live demos are always like, all right, cool. So, but you can see, and I'm gonna pop over to here, and ah, there you go. You can see that you've got a bunch of processes running in here, a bunch of stuff gets fired up, and after a few seconds, hold on, where is, oh, I saw it, there we go. MariaDB is running on that machine. And on that machine, because I did run a SQL script on it, so let me do this, I'll do MySQL, select star from MySQL.user, and ooh, look at that. There is the user, hold on just a second, right there. So I defined the correct user from the correct machine, so wp-admin from that machine, and so I can go back over to the UI and I can see that it completed successfully and no errors, no nothing like that. So if I go over to my web browser and I go to host, no wait, nevermind, that's the database server, there's no web server on there. Let me do the other one then. So now that we've got MariaDB running, I can go back over here to my template, now I can do the install Apache command, click on launch, hop back over here to this machine and hopefully, there we go. See how it just changed and you can see a whole bunch of jobs being run on the back end. And in just a second, what it'll do is it will have the web server and the PHP service running and hopefully, yep, there we go. So I've got PHP FPM and I've got Apache HDDBD and hopefully if I come back over here and I look at my output, boom, there we go, no errors there. So if I go to host120.virtual.lan, oh come on, host120.virtual.lan slash, there we go. Oh, I'm sorry, it's 183. 120 is the database server, sorry guys. Let's do the right thing, 183.virtual.lan, there we go. So now the web server's up, it's working and if I go to WordPress, wp-admin, check it out, it's already got the database connection to the machine on the back end because I deployed the wp-config.php file that had the right database connectors using Ansible. So now you just saw that I set up a database server and a front end web server. This could have been load balancers, this could have, I mean it could have been anything, right? The caveat that I have is when you deploy your machines, whether they're VMs on VMware or on KBM or on EC2 or whatever, just make sure that you've set up your SSH key and the beautiful thing about it is once that happens, once you've got that SSH key, obviously you have to be smart about setting up your firewall rules so that no one can SSH into your environment if it's out in the public except from your machine. So, oh my gosh, I'm over guys, I apologize, I'm way over. So this would be Q&A, but I got people over there that are kind of glaring at me. So was this helpful? Did this talk y'all through it okay? Because I tell y'all what, I beat my head against the wall forever to make this work. So hopefully this is helpful, I apologize, I'm like, I thought I still had about 10 minutes left. So again, let me do this, let me pop over to, hold on, there we go. So again, if you have any questions, catch me afterwards. My contact information like I said is up there, you are welcome to reach out to me. I don't claim to be like the best playbook author but on the infrastructure side, this stuff I know relatively well. So thank you very much, I appreciate it and have a great rest of your weekend. Yes, I have to, I already created the PDF and I need to give it to scale and I also need to post it on my website, camberntech.com and that'll be within the next hour or so. Yes, hang on just a second. Hold on, oh, that's not it. There you go. I never said I was an AZ person. Yeah, test, test, test, oh wow. You think, all right, let's see, we're doing it like this, gonna look all cool, right? No. Oh gosh, hair, is it? Like this, left hand side, all right. So I'm doing like a Britney Spears thing. Like my ears be here, okay. I didn't think this would be the hardest bit. I mean like, RT, does it look stupid? Okay, cool. Oh good, okay, well as long as I've got, I'm nice and crisp and fresh for ya. Okay, okay. Yeah, sounds good, yeah. Oh yeah, yeah, these are my buddies, so yes. They would, I'll let them be, yeah, and so that would be with the microphone, I'm assuming I'm not gonna have the headset. Okay, perfect, perfect, okay. So, oh, I love a dim setting. Is it controlled, I mean, yeah. Looks like Ryan is coming to investigate. There is, yeah. Oh, that's great. Everyone can see all right, these slides? Great. All right, all right, I'll sort of put this, oh gosh, I don't like have a belt or anything, so. We'll just put this right here. All right, okay. I think I'm properly hooked up. Yep, okay, now I'll stop fussing with this one. Now I'll fuss a little bit, okay. One more fuss, okay. All right, how am I sounding? Everyone here, oh yeah, great. I also naturally project, so yeah. But I'll give some, I'll give some mic love. All right, why don't we get started, y'all? Why don't we get started? Hey, so, you're at this talk. So, let me ask you this, like who does this for a job? Yes, show of hands, that's what I want. So, like, y'all work with production systems in your job. May I have some job titles, please? Just shout out, anybody's job title, cooler or the better? DBA. DBA, I mean DBA. Assistant engineer. Yes, assistant engineer, assistant engineer. You got any performance, you know? Any network engineers in the house? No, I mean, mad respect for network engineers. Anyways, yeah, glad to have y'all. It sounds like we all have a pretty good idea of, oh, next slide, yeah. We all have a pretty good idea what production is, and so I'm gonna skip all that part. And the difference between the production and your test, your home, your dev, your lab system, your, you know, punching bag. So when you have an issue at home, what do you do? Well, you could do, yeah, you could restart brand, you know, restart the whole note if you wanted. You could, I mean, just putter. You know, it works sometimes. We all need to be cleaning our cases more often. You know? I don't know what's causing the problem, as you do, as you do, as you do. Yep, or maybe, maybe it is just time for an upgrade anyway, you know? That's, let's use that as the, production, our job, our data, our livelihood. No, no, no, can't, can't, can't. No, no, no, no, no, no, wrong. I'm gonna get you fired, mm-mm. So, it would definitely get me fired because this is not any of my stuff. This is other people's stuff. So I'm a systems engineer at LinBit. I work with high availability clients, so they have clusters of highly available storage that they need to fail over. And we all know it. This isn't a talk about DRBD or LinBit, by the way, this is just, yeah, too bad. That's just context for what I am doing, and it informs my need for knowledge on performance and how to troubleshoot, et cetera. Diagnose those issues so that I can help out the people that have these systems. They also have other clients that they will make available their infrastructure for, so by the time that I am getting a support request from them, I mean, there's some definite zeal for that issue to be resolved. All right, yeah. Oh, I guess I did put, yeah, so that you would need to, you'd be able to know what an AHA cluster is. But again, if this is a, see me in the booth tomorrow if you wanted to know about that. All right, so do you ever look like, you know, run to anything like that, and you're just like, so these are numbers, but are they good? I mean, is that all right? I'm breathing. All right, so it's gonna go way over here because I'm pretty good at projecting. Yeah, I mean, is five good? You know, mm-hmm. So it's unfortunately a little bit complicated of a question because it depends on your hardware, it depends on your network, it depends on your workload itself, your application, your SLAs. There isn't a good answer for it, so you can't just go Google, is it good number? It's just too hard of a question, but I'm hoping that I can establish today for you some direction towards is this gonna be good or how can I make it less bad at least? All right, what's the lot I have next? Ah, yes, so I'm going to give us a few rough concepts and methods for how we're going to be doing this. And then I'm going to dive into, I have six demos for us, live terminal demos. That's that they're going to use those sort of concepts and we're just gonna go through them. They're imaginary scenarios, but they are based on true stories, things that I experienced day-to-day as I support HA clients at Lynbit. So hopefully we can learn along the way. Oh, yeah, the slide seemed a little empty, so I just needed to put a little graphic there. All right, so overhead, I mean, it's not that. Overhead is what we have when you have maybe Rantop or something like that, and it's going to be costing you processing power as well. The performance impact to actually gathering the metric, which can be difficult because you can't stop on a production system. Yep, so there's ways that you can make it better, make it worse. I can't just be coming up in here and running a bunch of S traces and just being like, yeah, I mean, I'm going to get some data. It's not going to be good. I just wanted some examples. So I mean, just here's top. One of the slides look good and I circled it. So one of the ways that we're going to rely the most on in diagnostics for production system, they fall under observability, which is observing the systems as they are, where you're not simulating anything, you're not applying any sort of test workloads. So all those metrics that are static to the system, they exist and you can just peel them, peel them out of there. This also can apply to static configuration. All right, but sometimes that isn't enough and we will actually have to do some benchmarking. Oh yeah, I guess this is a, yeah, if you're just catting like the pressure statistics for CPU, that would be an example. Yeah, pressures, all right. So microbenchmarking, that's going to be a lot better than macrobenchmarking. It is a taking a component of the system. So something within the data back, but not the entire application workload. And we'll simulate it. And, but it would be an example of like, you are running some packets across the network. It's going to have lower overhead than just doing like whatever your application might be, assimilating that entirely, and then saying, I don't know what's doing it, but I'm just going to apply that application again and again. That's not going to get you any closer to the diagnostic of the specific thing you need to improve to get rid of this bottleneck. Macrobenchmarking definitely has its place and sometimes you need to do a mix of the two, but I would say you're going to want to, if you can get some observability metrics, then we want to go for some microbenchmarking, some targeted microbenchmarking. But yeah, you might need to do some macrobenchmarking. This one example, I mean, paying is technically an example of a microbenchmark. It's a little one, lightweight. Oh yeah. So yeah, it's a simulation. You aren't, this isn't, macrobenchmarking wouldn't be where it's just running in prod and you are observing it. You are applying it in this case. You are applying a workload yourself and then you are checking on how the system performs based on that. All right. Yeah, settle that. All right, well, I mean, I guess I tried, I wanted to demonstrate a little bit for each one and then I was just like, okay, yeah. You based on that, you, all right. Well, here's how it performed. In order to do that sometimes, you're going to have to characterize the workload that you have because not all IOPS are created equal. They are, you need to know how your application is impacting the system, the qualities that it has. This is especially nuanced when it comes to storage. There are so many aspects to storage that I'm so excited about that we'll get into later, but they will have immensely different results in the system. They might look similar, but there's a lot of different qualities of them that you can just admire the nature of. Once you can break that down, you can do more effective microbenchmarks that are actually going to be reflective of the workload you intend to put on the system. All right. I didn't, yeah, I just, I needed this application to be a little more explicit so that I could, yeah, decide that I needed that block size. And so this one just happens to tell me that it writes 512 sequential bytes to the disk and so I am doing a DD that will, has a block size of 512. And then I'm writing it to the disk. Performance tuning is what we would want to do kind of last with our production system. That is what you'd expect. It's changing things and seeing if it makes it better. That's something that I would probably do right away at my home lab. I am going to really narrow in on what I'm trying to do because you don't, you also could just, you're not only you're going to be impacting, your day-to-day money machine, you are probably going to be getting different results and you're going to be confused with the issue. If you do not target it first and you're not, you don't have a diagnosis first. You don't have a specific component that you're zeroing in on. You just change your random things, you're going to confuse everybody. Yeah, so I guess I read nice this as an example. So anyone familiar with the use method? No, well I'm about to tell you what it is. So there's a lot of methodologies you can use to diagnose performance bottlenecks. The one that is used most commonly in LinBit and that I find is my favorite that I'm going to zero in on today, that is the use method. And this was developed by Brendan Gregg. He is a, well he was a performance engineer for Netflix and he has written some really interesting things. Oh yeah, I just, honestly, the book is great. He also has another book that he recently came out with which is on EBPF that is really exciting. I don't have time to get into EBPF today but really nuanced stuff. If you want a deeper dive on this, I would definitely recommend. He also has some really good utilities on the website. If you just need good like checklists, they're just super handy to have. And so it'll save you a lot of work. In case you wanted to make a checklist, now you just take what he's already done. It's on his website. But yeah, it's use is utilization, saturation errors. I am going to go into what those all mean right now. All right, well I guess I'm going to go into what a functional block diagram is first. So for every resource on the system, you want to know what you iterate on for each thing. Oh yeah, this is the one we would use. This is from the DRBD user guide. That's a piece of software that Linbit supports. And this would be, it's an example of a functional block diagram. These are, it's not a picture of the system but rather a picture of the relationships and the components inherent to it. So we have, these have drivers on them but I'm looking at the physical hard disk here. You have the file system, the caches also and then the service layer. So you can tell where everything is interacting. That'll help you, if you don't have this for a system, I recommend you draw it just to get started or just ask your client if they have relationships like that, relationship diagrams so that you can have somewhere to start. But this is how you determine what your resources are that you're going to be applying the use method two. Continue. All right, so utilization as it is defined within the use method is the time, the percentage of time that a resource, a component like we saw in the functional block diagram is performing work. So let's say we have disks and they're totally idle. They're 0% utilized. We have some disks. They are churning, spinning 100% of the time. There's not a time that they're not active. They're not moving around. That is 100% utilized. But I wanna be clear with utilization in this context because with definitions it could be a little bit nuanced. A fully 100% utilized resource can still potentially accept more work. I don't know if I say that, yeah. Oh, well, saturate some context. It's related to utilization, but let's say you have a CPU and it's waiting on this guy or something like that. It potentially could get a higher priority process that comes in and it's able to move things around. And so because it's able to re-queue things, it could be 100% it's doing something all the time, it might not necessarily be saturated, just something you keep in mind. Saturation is related, but that is the time that a resource is, it has more work than it's able to process and it's basically how much work it can't process. Something to note about saturation and utilization in your interaction between the two. So let's say that you have a restaurant that is slow most of the time, but then you have a real banging lunch hour and people are waiting outside. Yeah, you could have overall low utilization. That's just something to consider when you see some scenarios later. Low utilization, overall, you can still get peaks. Those peaks are important because they cascade through the system and cause saturation, but then everything else is waiting for it and yet just keep in mind those bursts can make a big impact. So even if you have overall low utilization. The E of it is a little bit easier. Airbors don't really explain errors. I think we've all encountered one or two in our day. One of the most, with the quickest ways often to diagnose what's wrong with system is say, what is it telling us? So ask it, look in the logs. There is a, Brendan quantifies these as like the number of errors. I don't necessarily do that because I don't, yeah, I can have a bunch of errors, but what are those errors can often be pretty important, not necessarily how chatty the logs are because I feel like you could just sort of toggle that into a sort of thing. All right. Ah, yes. All right. Now we're getting to some fun stuff. Let's get to demo number one. All right. Oh, it's getting all, it's all dark now cause my, my terminal is all dark. Great. Okay. So I'm going to set up a hypothetical for you. So I have a, yes, probably. Condro plus. Oh yeah. Is that, that's not true. I've got to find the plus sign. I have never done that before. Excellent. Thank you. Yeah, I would not have noticed that because I can see that just fine. Thank you so much. And if you can't hear me or anything like that, yeah, feel free to shout it out cause I'm not going to notice. So just, yeah, hey. All right. So this is a hypothetical scenario of, let's say that there is a client that I have and they are running a DRBD cluster. So they have a, like just a few notes in a cluster. They have said to me, they have configured the instance and they've said that they've recently seen some performance issues and they believe that DRBD is the cause they would like me to investigate. So first, first. Oh yes, yes, yes, yes, yes. Oh yes, yes, yes. Mm-hmm. Oh yeah. Ha, ha, ha. That's, don't look at this. Don't look at this one. I didn't want to run it early cause I was afraid I was going to tank everything. All right. So I have my CPU one up. Now, whoops. I did not have the other CPU one up. Control plus, right? How's that? Oh, a couple more time. If y'all don't know about Vagran, by the way, it is super nice for if you need to just test some things. I, we use it all the time at Linbit so you can just get Vagran stacks that you sort of, you set them up in advance. You, then you could use Ansible Playbooks on them so you can quickly configure some VMs to whatever use and abuse you have. And let me see, Vagran CPU zero. And I get in there. All right. So let's take a look at what could be happening. All right, well, like, what I want to point out here, don't look at the obvious, but what I want to point out here is some things about load average. So, load average, that number, like, is it a good number? It really depends on what your system is. How many cores does your system have? How do you find out how many cores your system has? You can, let me see what I got in here. I've been in the history. I saw this in the Ceph demo, so I kind of like it where they sort of just have all of the command history and they just sort of paste it in for themselves. You know the demos never go right. Yeah, I'm getting to that. I'm getting to it. Yeah, but I just sort of put the command history in and then I could do all that just so that I don't forget anything. So if we wanted to use, like, Nproc, and we can come up with that. Let's say we're gonna use Nproc. Okay, eight. So the load average we saw was kind of high for eight because my stress NG is being particularly performance right now, which it wasn't earlier, but that's okay. Let's say that we wanted to use Listopo. Oh, okay, I didn't install it, but maybe I should install it now. I think I need Herberluck GUI. Oh, I want to use Herberluck. Yeah, sure. This is nice because if you have systems with Numa nodes, Listopo will be able to show you a diagram of them in some cases. Otherwise, it'll look like this. It's good to look at the relationships between things and sort of make sense of the system. If you aren't in this system all the time and you look at a lot of systems, it's just really nice to get your head on straight and what the number means and what you're seeing when you dive into a system. Yeah, how many cores does it have? What is it? What are we doing with it? So just keep that in mind. You could do that. What else we got? Yeah, what did I say that we should do? Oh, yeah, you can cat proxy view info if you so desire. What we got for that? All bunch, but these are examples of how you can take a look at things. But yeah, let we go back to top. Let's see if this has been distributed evenly. Maybe we can't really tell from this. Oh, man, it's really tanking this guy. Anyways, let's go ahead and we wanna see how this is being broken down by processor. So I wanna take a look at the MP stat. You need the capital P flag for PubM, you know. And then we can take a look at, so what we should have seen, my test might have just completed. So what we should see is let's do a PID stat. Don't worry, we've got five more of these. So notice all of these that are pinned to, well, one might consider the fourth CPU that is entitled three here. So let's imagine this isn't StressNG and let's imagine it's somebody clever and they have their application on here. And then DRBD, usually it's pretty smart and if you don't do anything to it, it's going to pick what CPU, oh, it's going to pick what CPU it thinks is going to have the least impact to the system. It doesn't want to mess anything up for you. However, you can decide you do know better than it and you said to hit, I want a CPU mask of one processor or a certain subset of the many processors you have, eight, you might be less confused, but let's say that you have like 64 of them. That person might have said, they might have set the mask wrong, they said a set of CPUs that are going to be associated with particular NUMA nodes. Did I tell you what NUMA nodes is? If anyone doesn't know what NUMA is, that's non-uniform memory access and just really simply oversimplified is the memory module is closer, just physically closer to the processor, a particular processor, so it's more efficient because it's just physically closer together. So you would want to pin it to particular parts. So you might think, okay, well, I want to do this because I want to make the system more efficient based on what I know about it. And this person made a mistake in this hypothetical as they made a mistake and they decide, you can have an overall load average that is low, but you want to look at that particular processor breakdown to see if there is a particular one that has odd use. That's usually not going to happen, but keep those tools in mind to see what you've got. PID stat, PID stat's a great one, and MP stat is one I like to use for looking at individual processors, what application is churning on each one. And if you intentionally do things with them, just keep that in mind. All right, all right, I'm done with you CPU. I'm done with you. I don't want you doing this anymore. Koi, all right. Do I have my history in here? Oh, I need to control plus all of these. And, and, number two. Just so we know, we're moving on. We've had enough of that. Let's see, cool. So here's our second hypothetical scenario. We have a client. This isn't our machine. This is someone else. They might be in a different time zone. They might be halfway across the world. This is inconvenient for us because they write us emails saying, the system is slow. The system is not working. And then you come and you shell in when you should be awake and everything is fine for you. What do you do? Well, so they said the system, in this case, they said the system is being intermittently slow. We've got, you know, it's, we got replication speeds for DRBD slowing down at particular points. But what do we do if we take a look at, I don't know, top and we're seeing it looks fine. Well, let's set up SAR. So SAR is the system activity reporter. And yeah, it's make sure I have it in my history here. I actually think I uninstalled it from this system just so that I can install it. All right, so it's part of the sysstat package. And at least it is for, you know, Debian based systems in this case. You want to make sure that you enable it as well. You can't like you installed it, but we do need to enable it. So, let's see where we go to enable it. I think I put it in here as well. Ah, yes, I am going to. Okay, well, it's not there. Oh, I sysstat, whoops, sysstat. Okay, right now it's false. Right, let's get on down here. I think you all have an idea of what's gonna happen with this that troop. You need to do that, at least for Ubuntu. I believe you also need to enable it, I'm not sure. But we're gonna do it to make sure. See if we want historical data. Okay, now we need to restart it. This is the initial setup. If you already have this running that this is just for your first time, and then you have it. Okay, now all we need to do is like wait a day, right? So we, you got some time. Nah, okay, I have just like a cooking show. I've already done it on this one. Oh yeah, I forgot on the other one. In order to configure it, because I didn't even say that, this is how we configure it. It's just like Kron. If anyone set up a Kron job, it works identically, at least in Debian-based distros. So we are going to go into, I'm trying to use all of my things I prepared for myself, but I'm just gonna go into Etsy. And then Kronty. See, we got Sysstat hidden here. It has already dropped something in. We just installed this. It already has that file in there. Oh, that's why I didn't put it in the command, because it already has it the way I want it. But I want you to look at it anyway, because just like you would expect, it looks like you would have in a Kron table. For anyone that needs a refresher on that, we have for five minutes, every 10 minutes, and you have, I believe it is, what is it, month? It's a day, hour, day, month, 10 minutes, yeah. Just all the time, week, day of the week. Yeah, and that's what it says. But that's what it means. And then it also rotates nicely for you, so you don't have to remember. But we have that, since I don't know more about the particular behavior, I just came into the system, somebody said it's intermittently slow, they didn't give me a capture of when to focus on. I'll just take five minutes, every 10 minutes, sounds good to me. Maybe we can see something from that. So let's do that, let's just keep that at that. And let's see. All right, where did I put that? All right. Yeah, so it is, I actually would love you to see as much as I already had that prepared. Let me show you what we look, what a Varlog sys stat goes, because that's where everything is kept. You can change how many are kept. Yeah, so I have, since I set it up, how many days I have, and you can reference it by the day of the month, in this case. And so if you wanna reference, so what I'm doing here is I have a particular SAR command that I want to do. In this case, I have, this is a network one, and I have, I'm passing it to a file here. Again, refer to the man page for more detail, but overall, just the general sense, this is the 12th day of the month, because this is when I set up this VM. But it has all of that information in here. And this is the return I get from that. It's real nice, cause you can. Got all those, you just scroll around, and you can see that the 24 hours of whatever you just set up. All right, so let's see if we see anything going on here. Da, da, da, da, da, da. This PM looks fine, looks fine, looks fine. Was there a question? Is there a huge jump? Yeah, are we still going up? Down, the other direction. So there was a point of time it did shut off this VM, so that might be the huge jump that you're seeing. Yeah, 420, and then 420 over here. I'm still, oh yes, oh, the huge jump. I thought you meant there was a skip in the logs, cause that would have been obvious I shut off the VM because I, yeah. So there was a, I was doing a lot of things that, this is not, so in this case, it does look like there was some pretty heavy utilization during this time. You would, this wasn't part of what I was doing. I was rerunning some tests at this point, that's why that showed up. Okay, but this is what you would look for because that is, like this is a huge amount of utilization here for having nothing on it before. If someone's saying that I have an intermittently slow re-sync, why do I have all of this traffic here, just at 4, 10 p.m., in this case it was, because I was running tests on the system, and yeah, I had a low generator that I was working with, so this is what I wanted. So, why is it starting at 2 a.m. exactly is the point that I was trying to make. You're seeing a lot of traffic happen at a certain time. You can be looking for patterns of behavior because you get this output. You're saying, what is the pattern behavior that I have for the particular things that I'm searching for, so network activity on particular interfaces? I don't think there's a way to break it down by interface, or at least if anyone knows, let me know, because I don't know how to refine, sorry, finer than these flags that I have, but this gives me each of the interfaces and how much throughput was going through them at a particular point in time, so excellent. So, yes, hello. Oh, absolute, that's a great question because I have a whole, you said memory, right? Oh, yes, great, because I have a whole thing on that. Yes, yes, SAR is super useful. It's incredibly versatile, and you are able to filter down, also notice when I set up SAR that I wasn't like, okay, I wanna do a network test. I didn't need to do that, I just let it get everything and then I filtered for what I wanted at a particular day. So, I know that we all are sort of inundated by information, sometimes it's an information wall. You need to figure out ways to break it down, make it smaller for your human brain to pack it in, and this is one way you could see it. You can say the time, hello. There is, it also is SYSTAT. I do not believe that it is based on Cron. I do believe you can figure it slightly differently. I didn't have time to get into both scenarios in this case. I also wanted to show Cron for that setup because I do believe in, like, ALBA, you do it, I think you do it through system D, but it is slightly different, but it is through SYSTAT. SYSTAT package is one you would use. All right, so why do we have Activity Spike at 2 a.m.? I don't know, maybe it's a backup.shell at 2 a.m. Could be, it's something to look for. I would have, in a normal situation, in an actual day to day in my job, I would ask, like, I would probably look in Cron myself before I, you know, set it and forget it. However, let's say that this wasn't just directly on the system, maybe it is just traffic that's just hitting that system. This isn't a cluster anyway, so not everything's going to be scheduled directly on it. It's just something to keep in mind. If you're, especially if you look at multiple days over time and you say the same thing happening at the same time every day, something to keep in mind, just something that you can say, is this a scheduled activity? Why are we seeing this now? And then you can look for processes or other activities that are happening within the data path that might coincide with that. Just one way you could do it. Yeah, in this case, this was just, yeah, this is an IPER-3, just congest the entire thing, throw it at it for a while, just to make a little demo. So that's SAR for you, but we're going to continue to talk about SAR throughout the rest of this lecture because it is really useful, right? We're going to go back to demo number three. It's not gonna be all like this, don't worry. Okay, so now we're going to the next scenario. Ah, yes, here we are. This one, control plan, we good? Oh, cool, wonderful, because this one actually, I will want you to see the details on here. So I appreciate you letting me know about this. So this is a scenario that we have sort of building on the other scenarios. You go into the system, I'm not just gonna go through top again, like you're gonna go through a system, now it looks like the activity is fine right now. I'm not seeing any huge amount of utilization anywhere, not a particular process that seems like it's hammering on one single processor. And I also looked through SAR earlier and I didn't see anything historical that seemed of note. So, but why would that be? This is just a particular case that you might want to check for. So, let's not forget about the E-section of, yeah, the E-section of the use method. We wanna look at what we got over here. So I didn't see any heavy CPU utilization, but like it's running like hot. So that's an interesting thing to note. Why is it so hot? And don't like, it would say de-message, not de-message, oh, I actually didn't know what would happen if I just replaced de-message, so I just put it at, I just made a different file. Let's say I wanted to look at just the, okay, well not, oh, extract. Okay, well not that one either. I'm gonna go into our log. Oh yeah. Oh, okay. Okay, so this is one that isn't, it's specific to the application that I work with a lot, and that is a Corosync, that's a, it's got a heartbeat between nodes. The nodes need to know that the other one is alive, so that it doesn't, so they could do what it needs to do, avoiding split brain, knowing that it needs to kill the other one in case it stops fawning, things like that. Really important to be stay connected. So I happen to know that this process is a real-time process, that's basically the highest priority you can have. It'll like everything at the expense of all of this process, it needs to happen. But it's telling me that this has not been happening, like this second, full seconds at a time, we're not seeing this being scheduled. But again, why are we not seeing some heavy CPU utilization? Why are we not seeing in SAR? We're not seeing all of this traffic having the CPU, why is it not being hammered? Well, it's because in this case, this is a VM. This is a VM that is on a hypervisor. What can happen is there's a few things that could happen when you've got a guest and you have noisy neighbors. You have a hug, a different tenant, perhaps, that uses all the CPU on the hypervisor, and then this VM didn't get any. He can't schedule anything, but it also can't really tell you anything about that. No, I'm dying. It can't do anything, because it can't be. It doesn't have that visibility in most cases. In most cases, it doesn't have that visibility. This could be also a case where the VM was paused on the hypervisor, it's not gonna be able to tell you. I don't think it was in this case because the heat, the temperature, but that's something just to consider that it might be. Let's see, what did I say else about that? Yeah, one thing that you can look at that is an indicator is VM stat has, if you do a VM stat, you can see. Let's see if I had the VM stat in here. Do I have a VM stat? Well, let's see what I got. VM stat, no, I just had, because it's just VM stat. VM stat's one you can use. I wanted to show this, there's a lot of ways to see the ST here. I think I wanna use the V stat on there, but. We see that in top as well, that ST. Do we see it in top? Yeah, we should see it in top somewhere, but the ST, that's CPU steel time. That steel time is when a VM is waiting for a CPU from the hypervisor, and that's the amount of time it's waiting for that. So if you see high steel time, that might be an indication. It's not necessarily going to always show that way, because of the situation I'm describing, especially if someone paused a VM to back it up or something else happened. So that's not going to be a guarantee. That is one metric, though, that you definitely wanna keep a look at, especially if you know that you're working with a VM. A lot of times in my work, nobody's gonna give me the hypervisor logs. They are just saying, well, this cluster is slow. They're not gonna say to me, well, because I didn't give it resources to run. That's also another thing is that people could, on the hypervisor level that someone could get, could intentionally under provision something, and to keep it and reign it in to a certain amount of CPU or any sort of resource usage. Arbitrarily. So it might be kept at a bottom up because it's been told to, but it doesn't know that. So just something to keep in mind. These are factors to consider, questions to ask, and defining a problem statement. One question to ask is this a VM? Could you possibly give me some information on what's happening with this hypervisor during this time? That would be helpful. If they can, I would love that. You would love that. Everyone would love that. Everyone liked that. Okay, so there's not gonna, surprise, it's different. Okay, so this is demo number four, but I wanted to show you something about it before we dove into it was I wanted to show you, recall that functional block diagram that we were talking about earlier because this is to give you a primer on how this is working here. Is that DRBD, it does a synchronous write in most cases where you get, you have a primary where they get the rights, something's written to it. Now it needs to make that synchronous copy and it needs to do that over the LAN. So the network here is oftentimes the weakest point. That's often one that I'm gonna look at if I'm looking for a bottleneck. It's also one that's fairly easy to check. Something that's easy to check is, something I wanna check first because our production systems are fussy and I'd rather do the easy things first. So that's just giving you some context of why did I check the network first in this particular scenario where someone is saying that they're not getting the eye-ups that they were promised on a newly deployed system. They said that they did test before. They thought that they should be getting a particular, that the network latency that they're saying is higher. There could be a bunch of things in this case. I'm not sure what I wrote about that. But as yet, they're not getting the eye-ups that they originally benched Mark IV. But I wanna check the network. I could do it if I was not on a production system. I could do some A to B testing. I could shut the network, like the functionality for DRBD. I could say, just do the reads. Or just write to the primary. Don't even do the replication to the secondary. Because then it's not waiting for the acknowledgement across the network. It's not going across the network at that point. I could do that, but they're using this system. I'm not gonna do that. So what do I do? Well, well. Okay. We're gonna go back to Boink. I'm gonna use Control plus all the time. All right, what do we got over here? Oh yeah, all of my stuff. Great, wonderful. Oh yes. So I wanna see how they've configured this. I wanna see, in this case, this is a DRBD specific command. But that's just so that I can see what IP, what is their traffic running on what, how are they configured? So we see some IP addresses here. We usually say, for a DRBD setup, if you could put it on a, like a replication network specifically dedicated for replication specifically. So it's own, it's own little lane of traffic. Because then you don't have to deal with other traffic on management or other applications that are happening within your network that are potentially going to congest everything and delay replication. We just want something nice and clean and dedicated. One thing that is helpful for that and this client did do this. It's helpful is that I could also, it's better for diagnosing particular issues with if we're seeing some flowness there. Oh, close again. Okay, all right. So what did I say that we could do now? All right, so I do want to see at that point, what do we got? How's the, what's this configuration look like? What interface is that? What interfaces do we have on this, on this network here? And we've got these ones here and we can through the, we can take a look here at IPS. I'm sure people use if config. There's some deprecated ones. IP I think is like the one that we should be using lately. IP, IP link show should give you some information on the interfaces and that's specific to the traffic that you're seeing on them. You can see some things that could be indicative of issues with the network could be a large amount of dropped packets. But what we're looking for right here in this case, we just wanted to see what interfaces we have and the overall setup of this interface. Let's see, did I do the link show? Yes, I did. And that was the S-length, yes. And then I do the ETH tool. Ah, yes, cat, cognitive. That's another way we can look at it. A lot of these are, they're looking at the same systems counters. Here's just a different way to get it. Use what feels good to you. Because a lot of times look in the documentation on what system counter it is referencing. That's another good thing you might wanna do when you are doing a diagnostics especially if you are using some manner of observability tool is to use these different tools and use tools that do the same thing and see if you get the same number from it. And if you get different numbers for running what you ostensibly thought did the same thing, well now you've got reason to look at it. Is it maybe not quite what you thought? All right, so we've got, that's just another way that we can look at this interface and we can see the number of packets that are transferred over it and if there was potentially anything associated. Do I have my ETH tool command? Because I have a ETH tool, yes. I wanted to look at particularly that interface that was configured for DRBD in this case. So that's ETH tool. ETH tool is great because you can use it to find what is the, the static configuration. How it is, what is it? It's currently negotiated speed. See it's full duplex here. It is, yeah. This per, like so in, I wasn't very creative with this particular demo because like I ran out of inspiration. So what I thought in this case was that this person said I did some nick-trunking and I should be getting double this. They said that, oh well, but no, I'm just checking behind them and I said no, no, no. This is a misconfiguration. So the reason they're not getting what they expected is that it's not configured correctly but this is just one way that you can check. What are our assumptions versus what is the system telling us? It's quite easy to misconfigure when you're trying to do some nick-trunking. And I have seen people that are trying to bond the interfaces together and they believe they have but they made a mistake somewhere just because it's difficult to do sometimes. So just something to keep in mind. Check behind yourself. ETHTools, great for doing that. Let's see, what else did I say about that? Yeah, yeah, SAR. We were talking about SAR earlier and it's other uses. I love what SAR does for what you can look for the network statistic. We looked at these earlier, I think for the, yeah. Now, so what I wanna say about this one is that if you wanted to be able to measure throughput, notice I'm doing SAR every second. So how many things are, how many packets are transmitted? How many are received per second? Well, you might need that metric but now you're doing it per second. Now you haven't. Just keeping that in mind. Like that's a way you could do it. You can average these out and do what you need to if you're not necessarily getting that metric where somebody is doing the math for you. You could do the math also just by running SAR every second. I don't know if I also just showed SAR before. This is just, you can say how often to repeat it. This is a, if you put one here, it's once every second. You also can enable live mode with SAR without having to enable historical data. Just a cool thing. If you just have it ready to go in the chamber, you don't need to set anything up in advance for it. My use SAR all the time. Couple other things that aren't, well they don't fit into the demo but I'm gonna do them anyway just so that I can say some things about them. There are a couple ways that you can check for latency on the network but let's say that we want to do to Ping, you could do, yeah, Ping is a valid way that you can measure round trip time of an ICMP echo packet. I just wanna say something about Ping. A lot of times firewalls, if it doesn't outright block them it will put them at a lower priority these packets that we're sending so you're not necessarily gonna get an accurate measure of what your TCP traffic, we're gonna shut that off. Trace route, same way, same way for trace route but trace route, so it will also get de-prioritized by the firewall. Sometimes shows each particular hop might be useful for you. You can, I'm just, I think I did eight, I was just doing it through the Google DNS, just eight, eight, eight, eight, eight, eight. Might as well, easy remember. But you can use, I don't know if I did this here but you can use the T flag on trace route and you can switch it to TCP traffic so you wanna get around where it's gonna be de-prioritized just switch it to TCP, you could do that with trace route. Just something to keep in mind. All right, moving right along. Now we're gonna get into, well, it's gonna say what you expected to say. Demo five. All right, I'm doing on time. I just wanna make sure I think we're, we started a little bit late but I'm going to make sure that we can, I'm gonna kind of speed through these so that we can make sure to be, get y'all to dinner and whatever you need to on time. Let's see. So this person said for DRBD, they extended the disk size. You can do that. Let's say that you've extended the backing volume through a variety of methods. Extend the file system because it's, you're able to do that, LVM. They made the backing volume bigger and then they resized the DRBD volume. You don't need to know necessarily how that works but that they did work for them. They said that worked, it got resized, I could use a space but now all my system is crashing intermittently. What do? I do. So what do we look at when I have a hunch? So let me show you. Oh, got him. How are we doing? Yeah. Thank you. Thank you. All right, so SAR with the B flag, that is going to be showing us particular, you see what we're looking for is a large amount of page outs. This one is not necessarily showing a lot right now because I think I did not do what I was expecting to do with this, but neither here nor there. Okay, so this is one that we want to look at. Also, if you have swap, not every, swap is not necessarily used anymore on a lot of dice. You might be like, well, it's not hitting swap. That's, a lot of things don't have swap anymore, but if you're looking for page outs, that's a high number of page outs is something to keep in mind. Look out for when you're trying to diagnose a memory issue. Sidebar memory issue, if you do a free, free command, that's a common one you use for memory, that can over count your memory because it also says you have the available memory of what is also shared between like you have a library. It's shared between different processes. They're all using that, but it's saying that you have more than you do because it's counting those multiple times. Just keep that in mind. It might not be a necessarily accurate measurement. I like to use the pressure stall information. So we can do that by cutting over here. Okay, so we're not seeing anything necessarily here. This is an average of 10 seconds, 60 seconds and 300 seconds, and it is where you have IO stalled waiting for memory. So if it doesn't have it and it's waiting for it, this is what it would show. It doesn't show anything right now. It might not show anything for a number of reasons. It could be a really recent spike. I would suggest you use also they have the totals. So if you had a spike in the past, then you can say, well it wasn't necessarily reflected in these measures yet. Take a look at that total, see if that's been increased. So you can take a look at it, even if it hasn't been now averaged out. Let's see, yes. Okay, so what was the problem? Well, I didn't actually run the problem because it didn't work the way I wanted to, but in this case, why did I suspect memory because they resized the DRBD volume. DRBD requires one megabyte of memory per tebabyte of storage. They increased the storage. They did not remember. Well, maybe I didn't remember, but they didn't remember. They could get the size. Maybe they were right at capacity for memory, then they increased, but now they need all that more. And so maybe they had enough, but now they've increased their map that they need to set up for. So all right, all right. One last one. And it is my favorite one. So it's the last. All right, so a client says that they have newly deployed cluster and it's not performing the way that they expected it to. There's higher IO latency that they expected to have. And I said, how did you run the test? Really important because how you've decided to write your test for disk IO latency will make a huge difference in how it is represented. So let's say, oh, there's disk two over here. Puff puff puff puff. All right. So there's a few factors that go on with when you have this hypothetical. Let's talk about working set size. That's the biggest thing and that's what this demo is ultimately gonna show, but I'm sort of speeding through it. In this case, you have a working set size that is very small. A working set size is super small can fit all the way just in a cache entirely. Cache, as we can expect, a married cache super fast. This could be a difference of a great magnitude. So like this could be, you could get latency of less than 100 microseconds. If it hit the disk because that how it could not, it was too big to be fully in the cache, had to hit the disk, then that's way different. That's, could be up to like eight milliseconds more even. But that's, the, what I'm trying to say is that the working set size is a multimodal distribution where you have everything that's in the cache and if you can be done in the cache, it's gonna be super fast. You have to hit, you have to hit the disk way slower. Just keep that in mind for if someone has a super small working set size for their application where all the memory it requires can fit in there. If now they've tested with something different, a different workload that cannot do that, it's going to be a much different situation. All right. So I am kind of, I wanna make sure that I respect everybody's time. So I'm going to get back to the slide, but I have everything sort of summarized on here. So I can still talk about it. I'm just not gonna be fumbling around on the command line. Imagine that, okay. So don't worry, like I'm talking about the same things. Okay, so random versus sequential is super important. Random, well, let's sequential first. Sequential is where something has been written. Where it starts is where the last right ended. So the offset is exactly at the end of the last offset. As you might expect for hard disk drives that are magnetic rotational, you're gonna have a lot of IO, disk IO is seek time where it's moving around. So less of a concern for SSDs, but if you have, you have just a platter-based drive. If you have a sequential, if you're doing a test with a, a lot of people will test with DD, which is only going to do like a single-threaded sequential write. Well, what if your application is actually, well, it's not writing sequentially to your disk in a perfect order. It's going all over the place, as you do. You can't really test for that with DD. You can with FIO. So FIO is what you wanna use for random writes. Is it the flexible IO tester? I love FIO because you are able to, not only can you do sequential, but you can also do random. You can, you can toggle various settings. The amount of, you can fork it into different jobs that you repeat in the same iteration. Doing a test multiple times, oftentimes you can feel more confident in the results that you've done multiple times or just one time. You gotta think also though about, if you warmed up the cache, that may be, that might be a factor. Ideally, you would be able to clear buffers and caches between, like let's say that you're just writing to it, especially if it's the same type of write, but we can't, we're in a production system. So we can't clear our buffers and cache that be rude of us. So we're not gonna do that, but keep in mind, a warm cache might give you different results. Overall though, if you're benchmarking in general, repeating the benchmark numerous times in numerous different settings will give you more confidence as a result. FIO is great because you could say just like num jobs for, perfect. Set it and forget it. All right, so let's see. One way that you can tell, this is how I believe that you should in general can make sure that, I'm just gonna put this up there. Yeah, cause I have there. Nope, I'm gonna go back. All right, I'll just have this all up here. Cause, yeah, the size of the individual writes. How can you tell if it's hitting the cache real fast? You're gonna run IOSTat with the X flag while everything's running. IOSTat is enabled in the kernel by default. It's super low overhead, so you don't need to worry about running it. Just keep it running in a different tab while you do your tests. That way, because IOSTat is a tool that you really should be called like disk IOSTat cause it's only doing disk IO. So if you see activity there, that's, and you expected it, great. If you don't expect it to hit the disk, then it is, well, now you know. If you, conversely, if you're doing a test and you see it never hits the disk, you might wanna do a different test if you expect that this stuff is gonna hit the disks. All right, so just be real fast to be respectful of folks' time. The ratio of reads to writes. So if you have a metric that combines the two, be suspicious of that because your application has a different ratio of reads to writes in a lot of cases. Figure out what that is and test accordingly. But if you lump them together in a metric, you can't see the different results for reads versus writes so you can get less of an accurate metric to what's happening. Obviously, SSDs, as we talked about with the seek time, they perform differently than magnetic rotational hard drives. Also, they are not, I don't know if anyone's seen the YouTube video of Brendan Gregg shouting at a disk array. I would say it's super funny if you ever wanna check that out. This was back before they sort of did more sound proofing and he had a J-Bod in a server room and he just, ah! And then he can say, oh, look, the Iolate and see one way up. It's great, love it. Because of the vibration, because it didn't like it. Yeah, the disks didn't like it. And the rate configuration act, the rate configuration can make a big deal as well for what's happening with how things are written for your application. If you have parity rates or a RAID five, so like that's a, you've got a parity disk, you will have to do a lot more calculations for parity that are gonna cause a lot more Iol operations to happen. It's going to be less performant than let's say a stripe. So across a couple disks. Stripe size is another concept of RAID that can be configured. But the size of the rights can be matched to the kind of stripes of how the data is composed across that disk when you have your different disks that you are adding together for your increased performance. Match that stripe size to what your workload actually is. Find out what the size of your rights are and then you can get better stripes. If your rights are bigger than the stripe size on the disk, you might have some problems because it can't, now it's gotta move around a lot for that. We have size of the individual rights, that's not only that. A larger right is going to generally, it's gonna have higher throughput, but the latency is gonna go down. So just, they behave differently. A little right is gonna behave differently than a larger piece of data that you are writing. All right, I sort of flew through this one. So I'm trying to see if there is anything else before I go to the last slide because I know that I'm kind of going over. Yeah, yeah, okay. You could use the direct options to try and to account for not hitting the file system when you're doing your disk IO tests. You can do that for DD as well. Just keep that in mind. It might not be appreciable. Most of the disk, the IO you're gonna find is this IO. One way that you can check to make sure that it is actually disks that are the problem, your bottleneck. Check your total application latency for, let's say that we're doing a macro benchmark, talking macro benchmark again. And if the latency for the whole thing matches what you're seeing in IO stat, that's probably the disk because you just found it. There it is. It matches. If it doesn't look the same, it might not be them. Might not be them. All right. Okay, we're just gonna move. Yeah, we already talked about IO stat. IO weight is interesting because it's a little bit deceptive. IO weight is the measure of the time the CPUs are waiting for disk IO to complete. But keep in mind that that keeps, that's how busy are your CPUs is a huge factor to that. So let's say that you're, they're performing some super inefficient processing. You're doing bad things to these processors and they're working hard. Well then they won't notice that they're waiting for anything because they won't have time to do anything to wait for the disks. When it could be that you upgrade a system or you make a program more efficient, now you're seeing some higher IO weight and you're like, but I upgraded the CPUs. Why am I seeing more of this IO weight? Because now it has time to wait. But conversely, IO weight can go down and that's only because your CPUs got bigger. Your problem might've gotten worse, but you won't be able to tell. So you might need to, that sort of relationship is just something that you wanna consider. Any IO weight, keep an eye on it because that is very indicative. There's a bottleneck in the relationship between those two. And it's very vendor specific, smart CTL or RAID controller tools. Those are from the vendor, but you can pull them accordingly. And if you need to check individual disks to scrutinize them, then that's how you do. We're gonna go over there. Well, CPU profile, I think you could do that with Perf. We didn't have time to get that today. We didn't have time to get to a lot of things, but we certainly didn't have time to get to Perf and we didn't have time to get to EPVF. I genuinely, it's so exciting if you read about it though. So I would say these slides will be available for scale, so if you have not looked at the extended Berkeley Peck filter and the extensions thereof, definitely take a look. Thanks. Okay, well, overall what I was to conclude, what I said is that I've said a lot of things about DRBD and clusters. All of the tools that I've focused on and mentioned, those can be ran in pretty much any Linux system and hopefully there's some things that I've shown today. They might not show you everything, but they hopefully are directionally correct and sometimes that's as much as you need and you're just trying to narrow it down. You don't have a lot of time and you don't have the overhead to do a stack trace. This might give you just as much correctness that you can put out the fire and you keep on moving. All right, well, I'm gonna definitely, if anyone wants to get questions, I have my business cards and stuff like that, we can talk on over email, but overall now I'm gonna, I'll let y'all go so that you can go get dinner. Yeah, hey. The CP, like are you talking about the IO8? Well, we talked about steel time. I'm not sure. I didn't, I don't think I mentioned, I mentioned steel time. I'm not sure. Steel time. Yeah, steel time is the one of the VM side. Okay, how do you spell that? S-T-E-A-L. S-T-E-A-L. Yes, like steel, like you stole it. And so, steel space time, but it also is when you look at it on top or VM stat, you're gonna see it as S-T. But in the man pages, you'll see it as steel time. But that's what it's, steel time. S-E-A-L. Steel time. S-E-E-A-L. Yes, like I've taken something from you and you did not want me to do that. And you said you have steel, yeah, I didn't, I didn't do that for happy. Okay, and where would you see that? You would see that when you're on the VM level? Yeah, you can see it on top and you can see it in VM stat as well. There's a couple of other metrics. Steel time is shown in like a collective. Some of these that have a bunch of different performance metrics they show. So SAR is one you could also see steel time as well. Certain flags for different, you're tools of choice. You're gonna often see steel time, but this is what it's going to refer to universally. So just when you- Yeah, yeah, it's in the top section. It's in the top section of that. Yes. There's a lot on the top of the top, so I won't say it's the tip top, but I will say it's in the top of the top, yes. Yeah, of course. Yeah. I know that doesn't like report like network statistics, but what about NFS, which is just kind of storage place? Oh man, so the NFS, NFS has its, I didn't want to get into NFS storage either because that is a whole different beast. Yeah, because NFS it, yeah, it is separate, but I would consider that a logical IO. So that is not physical IO. And so IO step would not be reporting on that. You would not see that. So you would only be seeing physical IO, IO of the discs of the movement, logical IO, which that would be covered under.