 Howdy, howdy, all right, there we go. Thanks everybody. How's everybody doing this morning? Everybody's good? We're all awake, are you sure? Okay, so I figure I have about 20 minutes until the sugar starts to spike down from the cocoa puff marshmallow things this morning. That was awesome. All right, so this talk this morning, I'm gonna hit you over the head talking about a few things, mostly around production systems, what I tend to deal with a lot, and talking about testing and monitoring two topics, I guess near and dear to my heart on some ways, and talking about organizational boundaries and how their issues to be worked through. So a lot of people have a lot of different opinions on monitoring and testing. You can find certain amounts of drama about these things. See if this thing is on, this thing is now on. So I tend to think of monitoring in this way and this may be controversial. I don't know how many of you or professional QA people don't take this the wrong way. But I look at monitoring as like, it's like testing in a lot of ways, but you do it in production where the important things happen. And so I tend to focus more on monitoring as a thing that I'm concerned about and I'll get into why that is. And that is essentially the heart of the talk is there are trade-offs to basically everything that we do, right? Every system that you run is a series of trade-offs and decisions that you make. And so figuring out what are the right trade-offs for the systems you're gonna run, that's a big part of what you're gonna do with testing and monitoring and how that sort of flows all the way from idea, all the way into production and keeping customers happy and that kind of thing. Dolly had a really long intro that he had to read for me. I think the one thing I will mention, I have been known to avoid social norms that most people tend to adopt. And so I put this shirt on for the conference's protection so that they can disclaim any knowledge of who I am afterwards. But we'll try to keep it friendly and happy and good for a Friday morning. So, all right, testing. What do I really think of testing? I think it's really important. You actually have to do it, right? I mean, it's not like a thing that you can ignore. I have seen many a software developer write code and I probably have done this myself although I don't know that I admit it. And there is that old thing about throwing it over the wall, they don't actually test it and it gets there and it's broken and all that. But I think generally most people do some level of testing, right? Like you will write code and you at least like, does it compile? Like can I actually bring up a thing and see that it ran? That may not be very formal, but there's always some level of testing that's going on. And so everyone understands like there's this baseline of testing and then in some organizations it's a very rigid procedure and there's lots of policy documents and all that stuff. But no matter how much testing you do or how much you agree that like this is a thing, whether you have a QA department or not, it doesn't matter in some level, right? Because it's not like, I've never known an organization as like we don't really have to worry about production because we have a really awesome testing situation, right? Like nobody has ever said that, right? Like, well, we don't have on-call because we use testing, right? Like it's not like it's the other choice, right? So and I think some people look at it that way. And you know, I guess if you go back like 20 or 30 years ago, there was this sort of mentality of like if you're shipping software on a CD, you really did need to test it quite a bit. Because if you had a problem, it wasn't like you could easily update that. But that's not the world that we live in anymore, right? Like most cases, not everybody, like there are certainly areas of government where like you can't just necessarily go slap a patch on like a submarine that's out in the ocean or whatever. But even that's starting to change, right? So but you've got to have some level of testing and you have to realize that there's gonna be more to it. There's all different kinds of testing you can do. A lot of people are different fans of different methods, right? There are entire conferences dedicated to the subject. I would say, you know, go with whatever seems to work for you. I don't really have a favorite. And anyone who's run my code will probably tell you that obviously that's true. Some testing I think it's very difficult to do. Certainly as you start to get into either like really large distributed systems or running at really high scale, you find that things like performance testing becomes really difficult, right? The simple sort of examples are like, if you were Facebook, how would you duplicate production traffic for performance testing? That's a really difficult thing to do because it's not like you really have a spare billion people who can go click on stuff for you, right? And everybody starts to approach that problem if they're successful, right? Like that becomes the thing you have to deal with. And I wanna sort of point out, like I said, it's necessary but it's again a trade-off. It's not the thing that's gonna solve your problems that I do think people tend to focus too much on that. They put too much emphasis on that side of the equation, and I think it's sort of an old school mentality. One of the problems that I see in testing, no matter which style you do, whether you're, if you're a functional tester or you're like, I do test-driven development or whatever, you think that that's the answer that gives you sort of working code and working software, but the problem is that testing tends to be very deterministic. You're gonna pre-define what the tests are that you're gonna run. It's really hard to do a test suite that randomizes tests as they're going. And gives random inputs because you need to be able to duplicate it and go back and look at a test and duplicate that test. If you can't do that, the testing doesn't really work. And so when you have deterministic testing, obviously you're gonna start to create blind spots. One of the first places that I see these issues are the data problem. Whether it's how much data you're collecting, the type of data you're collecting, the quality of the data that you have. There are folks that dedicate their lives to this kind of stuff. How do we generate test data and how do we make these kinds of things? But it's not as good as the weaponry that users have when they start creating this kind of stuff. And there are known examples. How many of you have heard of this Wolf 585? So I didn't see any hands, which doesn't actually surprise me. So if you were a professional QA person and that was your jam, and all you did was I tried to figure out how to break software on purpose, not like one of those people where software always seems to break when you touch it, you would know this Wolf 585. And if you look it up, I'm pretty sure there's a page on Wikipedia about it. The basic idea was there's a name, which I will not attempt to pronounce, but there's a name that basically is like 585 characters. And this is a legitimate person's name. So when you think about like, well, I don't know, 100 characters ought to be enough for anybody's last name. That's not necessarily true. And we've all heard those stories of like, whether it's a foreign language or like somebody has a space in their name or whatever, there's all these like corner cases that people think of, they probably don't test for this guy, right? Like, and if he shows up on your site, your code's probably gonna break. And I don't necessarily blame you. You could go out of test for this now, like you've now heard of it. So you could, you know, mock something up if you wanted to. But that's sort of the thing, right? Like users are so crafty and creative that like no matter how good of prediction method you think you have, like they will outdo you, right? I've never been able to figure out how to outthink my users in ways of breaking systems, right? Like they're way better at it than I am. And I have access to the code. So you'd think I could do it better than they do, but no, they'll find a new way. How many of you have heard of the Corrupted Blood Incident? Okay, so it's some old school folks. These are the gamers in the crowd. Since most haven't, you're probably all familiar with the concept of like World of Warcraft and these massive online player games. The Corrupted Blood Incident, so it was a bunch of years ago, basically they introduced, I don't know if you call it a feature, sort of a game mechanic, where they wanted to have a thing that's sort of like a disease that one character could get. And while they had this, right, this kind of infection, they would take damage as a character within the game. And the basic ways you would try to get rid of this was either go like find other players in the game and sort of stand next to them or be around them and they will transfer off to these other players, or you could die, right? And then when you died, like it would also spread to other players that were nearby. Now in most cases, dying is not like your first option that you would pick. So you might try to find other players and do that. Well, in the beginning, this is sort of an interesting game mechanic because players would go, they'd end up getting this and then, you know, they panic a little bit and try to figure out like, how do I get rid of it? What do I do? What they didn't really account for was like, well, certain players in the game have been playing a really long time. Some of them really understand how the game mechanics work in the world far better than the people who are creating the world itself, right? And they would also have characters that were powerful enough to have sort of abilities or ideas that like they didn't really account for, right? That the typical user may not have. Essentially where this started to become a problem was the idea was initially this will be out in like remote areas of the world, right? Where there's campaigns where you only have a limited number of players, but some of those players would get this and they were strong enough that they would actually work their way to get back into the major towns and cities within the world, the online world, right? And so when you work your way back in the town where there's like, you know, hundreds of thousands of other players, you end up with a situation where like new players would join the game, they'd immediately get infected with this and then they'd die within minutes because they didn't have a strong enough player to actually take the damage and stay alive. And you end up with a situation that's really not as fun as you think it would be, right? Like and the people who made the game, like they didn't perceive that this could be a thing that would happen. And before they knew it, like entire cities were basically just getting wiped out, right? And what they ended up having to do is they basically had to shut the game down, right? They did rolling restarts on servers to clear out this feature, quote unquote, right? This customer value that they had delivered. They did rolling restarts of all their servers to clear it out so they could restart the game over so players could actually play again, right? And what's amazing is like, you know that generally the people who are writing like game software, like they do actually play these games at times, right? Like a lot of them get into it because they like gaming, but they didn't outthink the users in that case, right? They didn't really, they weren't able to predict like where this was gonna go. So, you know, it's interesting to see those kinds of cases out there in the world. And again, that's what I say, like you can do a lot of testing, but predicting the future still remains hard. We haven't figured that one out yet. And this is the thing, there's so many factors to consider, right, and they all sort of pile on top of each other. Trying to have foresight in what people are gonna do, trying to figure out all the use cases, right? Like there can be assumptions that change over time. When you introduce one game mechanic, you know, one year and then three years later, another one, do you remember the thing that you had back then? It's kind of like code, right? Like sometimes you change a piece of code and you're like, I'm not really sure all the features that this code touches, but, you know, hopefully it'll work. You'll never be able to add enough tests to be comprehensive about all that, right? And the thing that I think that's important to realize is that trying to get tests for everything in the system really is not your goal, right? The way you need to think about testing is not like we're gonna stop all the bugs because that's not gonna happen, right? Testing is really only for one thing. It's about a confidence game, right? It's about understanding that I can make some changes into the system and then not be super paranoid, not be freaking out that when I roll this code out to production, it's gonna blow up the world, right? And that's all you really need. If you have like super elaborate test suites that, you know, take days to run, you probably need to chop those down because you're never gonna be able to roll out code faster than those test suites run, right? So figure out like, what are the important things here? What are the things I know I need to deliver and how do I reduce the scope of that testing so I can have strong confidence? And you can actually have multiple channels for testing, right? You can have sort of a fast lane for like, we think most stuff will be fine with this and we'll run a comprehensive test suite that's sort of offline. If you wanna do those kinds of things, I would say go for it. But your goal is really just to understand how confident am I that when I push this code, I'm not gonna blow everything up, right? And it doesn't have to be 100%. This is the thing that I think a lot of people get sort of tripped up on, right? And this is sort of culturally dependent. If your job is on the line, if you ever blow up production, you're probably gonna say, well, reasonably confident is not a good enough metric for me, like I have to be 100%. Find a new job, I would say, right? Like that, because it's not gonna work. Like at some point, this is not gonna work for you. You'll make that mistake, right? And you won't be able to blame it on your coworker that you're gonna try to do. And so then you're, you know, it's gonna hatch, it's gonna fall on you, so. So try to understand what's the level of confidence I need, right? Make tests for those things, right? Think about sort of key features and things that you need to keep running and ways that you can make sure that those features will keep going. And you can deal with sort of those known, known, those things that you understand, this is how the system would break if we change it in certain ways, right? But it does, again, I'd say, this is not good for the unknown, unknowns. This is not good for the thing that you didn't really, either have a good way to test or the thing you didn't predict, right? And that is where monitoring comes in the picture. This is the other side of that trade-off. I don't think, I mean, I hope in this crowd, like nobody is like, we shouldn't really monitor things, right? I mean, it's sort of like testing. We all think we've gotta do some of it. Hopefully we all think we have to do some monitoring. If you don't think you have to do some monitoring, go down and talk to the monitoring vendors. They'll give you a lot of good reasons to monitor, I'm sure, they must have some. I sort of sum it up this way, right? So software isn't ever gonna be perfect. It's made by humans, humans aren't perfect. It involves computers. Computers are naturally against us. I mean, the AI might not be here, but the computers are already against us. So I'll get on the same page with that, right? Like everybody gets that. Systems are definitely complex, right? Most of us really don't understand how computers work anymore. We have different levels of abstraction that we understand, but I mean, nobody knows the assembly that goes into their Mac, which I say nobody. There's probably a few in the crowd that do, but for the most part, most of us don't know that. There's always external dependencies. And this one is definitely growing, right? Like the whole push for doing like microservices and using APIs and like the, you know, bringing sort of more REST integration into your application, all of that starts to create a huge web of external dependencies and realizing that many of those dependencies also have their own external dependencies, right? So that's a big thing. And then just trying to sort of realize like, how do we manage all that in a way that keeps us sane without, you know, sort of losing our minds because of the level of the complexity and the change that's going on in the system. And change is also the sort of great equalizer to all things, right? Testing is really good and the development model is really good because generally, those are fairly static things. You can wrap your head around and don't change wildly out from underneath you, but in production, right? Especially when you're production of LZ Internet, things do change. They change a lot, right? And they're sort of beyond your control. And in most cases, you want that change, right? You want new users to sign up, right? You want to like have them buy stuff from your site, right? Or you want for them to register for your service. Whatever the thing is that you are providing to people, you're looking for that change to happen and you want that to keep going on. So that's a big question, you know, sort of ask question. So what do I monitor, right? I've convinced you monitoring is a good thing. That was probably the easy pitch. What is it that I actually need to monitor in a system these days? There's a saying, right? In God we trust and all others we monitor. Right, so there's all these pieces to our systems that we deal with, right? Whether it's a web server, database, you know, caching systems, there's the integration points of APIs. You can have monitors on those, right? You monitor like your side of the API and their side of the API. So you need to be able to point to like which side is broken, not for blame because we don't ever blame anyone for anything, but like you need to understand where you're gonna focus your troubleshooting, right? There's performance, there's user behavior. Think about application features, like monitoring whether a feature is still used by a majority of your users or whether people have stopped using it. Maybe they're using a feature in a way you didn't predict that happens all the time. You can put monitoring on a lot of things, right? And there become questions, right? How much of it is enough? But we also have the other question, we saw this yesterday, there was an open space on like am I logging too much, right? Logging is sort of in that realm of things that you're monitoring and how you're going about monitoring things and maybe you are monitoring too many things, right? So you've gotta know what is actually important for your business, right? What does it mean to monitor a thing? There's an application many of you have probably heard about. I won't name any names, but a lot of people use it for sort of like a distributed messaging system, right? They talk to other people over the internet. And there's been cases where from sort of a way of looking at that service you would say everything is fine, like servers up and running, you could hit like API endpoints and get like 200s back, so the status seems to indicate that it's okay, but you would send your message, which may or may not be a tweet, and you expect other people to get it, right? And the thing that you really care about is like, are they actually getting the messages that I'm sending? You don't really care whether a ping check comes back and says the server's up. You don't really care if the status is 200. What you really care about is, are my messages getting delivered? Can I go back and see that people are getting this thing, right? If I go back and look at my history, do my messages show up there as if I've actually sent them? And if that's not happening, then like something is wrong, right? And start to think about when you're monitoring a system, how do my users perceive this system to work, right? And then how do I build monitoring around that sort of idea so that we can see if they're having a bad experience, right? Again, servers working does not mean business is working, right? Servers can be up and fine and doesn't mean things are working, but the converse of that is also true. One of my favorite customers we had taught us a very valuable lesson on the idea of monitoring things like temperature on your server and temperature in the server rooms, right? Like we used to do this automated. Anytime we spun up a machine, we would automatically put a monitor in place to see how hot the box was, right? Because everybody knows, oh my God, if the server gets to 70 degrees, like the world's gonna end. And we sort of vary succinctly and accurately got told by one of our customers, like, why do I care how hot the box is, right? Literally he said, well, not literally. There's a code of conduct I'm told. Basically like, you know, I don't give a bleep if the data center is on fire as long as I'm still making money, right? What was an important metric to this guy was what's my revenue stream, right? How many orders are getting processed? It really wasn't how hot is the server? And to be honest, he was right. If the server was on fire, but that other metric was still going on, I mean, it would be nice to know the server's on fire, but it wasn't like the primary thing, like the business hadn't stopped at that point. This is a very good point and a very good thing to think about. Many of the things that you monitor on maybe are not business impacting, right? And if they're not business impacting, then you gotta really think about, how do I wanna be alerted about this situation, right? Should I even monitor it? Do I need to monitor it, right? If I just focus on business impacting things that I'm monitoring, is that enough to actually get me through my day as far as making sure that the service is up and running? I think most people do focus on the wrong things, because again, part of it is trying to be proactive, but I think another part of it is we tend to think of things as technologists, right? And we think of like, how do I know if the technology is working and not how do I know if my business is working? And that's the mindset that you really should start thinking about, right? And it's not just for your benefit, like it benefits the organization that you're at, which ultimately comes around full circle because people realize like, oh, these are the things that are important on and these are the things we should funnel sort of our money and our energy into, right? Help people monitor these metrics. They're all about how we're doing from a revenue side, right? Or how we're doing from a user registration system. And I will be honest, I've been doing talks for a long time. I've been doing DevOps, I guess, for a long time, whatever that means. And part of this is my fault because I've given talks on operations and monitoring that kind of stuff. And I think there's a part of this is monitoring vendor's fault, right? Everyone has heard that phrase monitor all the things, right? Like we've said this for years within the industry and I think we're starting to see this conversation change and I think that's definitely a good thing. It isn't that you should monitor all the things, right? Cause that, I mean, literally is like every spindle on a machine, every like CPU instruction on a VM, these things really probably are not that important, right? Or at least they're certainly not the place to start, right? You really need to start about, how do I know? Again, my organization is functioning correctly. And there is, if you like drama and the internet is great for drama, there's been drama on Twitter because it's short form drama instead of Facebook, which is long form drama, about things like the terms monitoring observability. And I think when we say monitor all the things, what we really have actually meant was, we want to observability all the things, right? Like we want to be able to go and have introspection into the systems we're running. We wanna be able to go and debug those systems when there are problems, but it doesn't mean you actually have to have a monitor on every piece of your system. That means you need to be able to go and get that information when you need it, right? And observability all the things isn't quite as catchy, so I'm sure that's not gonna take off, but at least like keep the idea in mind, how do I give introspection while keeping a concise set of monitors and alerts and those types of things, right? How do I make sure that my organization, my business is actually still functioning? Well, I'm gonna start from that premise, right? I'm gonna make sure that user registrations are still working and I'll have a metric on how many users are registering over time. And I'll use that as sort of a baseline of like, is there an actual problem? It's a bit of a top-down approach, but it's not a bad idea. I always like sort of give a real-world example when I can, so I'll give you an example. An online marketing company that we are working with, major e-commerce, 100 million users, right? Really big systems, lots of code, lots of metrics, right, we're collecting all that and things like, with most customers, they often start with a call. And the call is, the customer comes to us, he says, hey, I was looking at my graphs because we expose graphs to everybody in the organization and if you don't do this, maybe that's the first thing you should do, which is expose your graphs to everybody in your organization. But we do that. So he says, I'm looking at the revenue graphs and I see there was a dip over the last few days, right? And if you look at this graph, like, we see spikes and there's normal patterns and there is this sort of weird dip. And we're like, well, okay, there's a dip in revenue. That's interesting and maybe a bad thing. We overlay a graph on top of that, so we use a monitoring system that allows us to do graph overlays so we can take other metrics and put them on top of other metrics. And we overlay traffic on top of the revenue graph. We say, well, okay, you had a dip in traffic so maybe that's why you had a dip in revenue. There's a little bit of debate on that and so we say, well, let's go with the system. So we overlay load time on this and we say, well, load time is fine, right? So load time is sort of the bottom line there. And the load on the systems is fine this entire time. So we think like, well, it wasn't a server problem, right? It wasn't like a technical issue like that, but we definitely had less traffic and then one of the app developers at that point pipes up and they're like, I bet it's the database. Because always it's the database, right? Like it's the first. I don't know why we didn't just start there because, but anyways, does that bet it's the database? So okay, we overlay the database graphs on top of that, right? And we're like, how does the database look? And we're like, it looks pretty steady. It's basically just a solid line. And so we look around some more and again, the system that we have so we can build graphs on the fly on any metric that we have and that's sort of key. So we had a building a graph on email bounces. And what we see is like, hey, you're doing a lot of marketing, marketing drives people to the site. Suddenly there was a spike in email bounces. Well, hey, guess what? Spike and email bounces, less traffic to the site. All the systems are working fine, but that will lead to less revenue. So then we dug into why the emails are bouncing, right? We were able to resolve the issue, but it was sort of like that level of introspection to understand this wasn't really a problem. It didn't seem to be a problem until we knew there was a business impact to it. And then when we did, let's go look at different causes around that and see why we might have a trouble, right? You can imagine what if email wasn't actually monitored? Luckily we had enough monitoring in place, we were able to generate the graphs on that and actually correlate it back to the systems that we were working on. I would say if it hadn't been monitored at that point, you know it would have been monitored going forward. And that's a little bit tough because it's tough to predict that stuff. And I will say that goes back to this idea of observability and instrumentation, right? You wanna be able to instrument your system so that you can go back and figure these things out, right? And be able to sort of pull those metrics out and do interesting things with them. We actually had this same problem again, few months later, all the metrics were the norm, right? Sort of the revenue went down. And in this particular case, what we ended up doing is we overlaid a graph that ended up being higher decline rates on credit cards, right? And this time we thought we were smart because we figured this one out without the customer being involved. And then we call the customer up and they're like, oh yeah, yeah, we know. We're having a dispute with AmEx right now so they're blocking our charges. So like, yeah, that's just that. It was like, well, darn it. The minute you think you're smart and you don't involve the customer, right? Get the organization involved. There might be a business issue that's going on that will impact your production. That has nothing to do with technology, right? And you have to remember, always keep interweaving business and technology. That's the point of this thing, right? You have an organization and the whole organization is there to accomplish a thing, right? Technology is in there trying to do one thing and the organization another. If it feels that way, like, I mean, okay, maybe some of your organizations are that way, right? But that's not the goal and that's not how it should be. And hopefully we all can agree that like, that's not the way that it, you know, that that's not the way to success down the road, right? That's what leads us being frustrated. So I got a few minutes left. Well, TLDR this little bit. Testing and monitoring, not testing or monitoring, right? Everybody can, I think can agree to that. We need both sides of that equation, right? We need to understand some level of testing to gain confidence for production changes and then monitoring and production because we understand we can't predict the future well enough to know that we can be comprehensive on the testing side. Again, you've got to understand the organization that you're in, understand why it's there, what is its mission, what's the point of what you're doing, right? CIO Baltimore, you know, I mean, it's correct. Like technology for the sake of technology, what's the point of that, right? It's fun on the weekends, but that's not why we're getting paychecks. Atoms or ability to as many things as you can, try to figure out how you can go into a system and debug it. It doesn't mean you have to have years of log files, right? It doesn't mean you have to stream everything and like have a graph on it, but try to figure out how you will get that information out. And when you're building an application, figure out how would I know that the application is working and accomplishing the reason that I'm building this feature, that I'm building this application or this microservice, right? It's there for a reason. It should expose what those reasons are and how to tell if it's healthy or not. Make sure you're monitoring those impactful things, right? The things you usually have a reason why you put a thing out. What is it that you expect it to do? Go back and verify it's doing that thing that you expected it to do. And then lastly, only alert on those things that are actionable emergencies, right? Again, you can tie this back to business metrics and it's the one time that the business side will probably be on your side with this, right? You don't have to get up at 2 a.m. in the morning because the disk is full if no business impact is actually being measured, right? If you can't see that this affects us in no way, our revenue is still fine, our user registrations are fine, like the carts are happy, the messages are going through. Whatever it is that your service delivers as long as it's still delivering service, the disk full doesn't necessarily mean I have to wake up at 2 in the morning, right? You can deal with that when you get into work the next day or when you get done shoveling the snow out of the driveway when you can't get into work the next day. So that's it. Thank you, everybody. I hope you are all now awake. I hope you'll think about these things with your organizations. And if you have questions, I'll be around. There'll be open spaces, all that good stuff. Thanks, everybody.