 I think we're just going to go ahead and get started. It does look like there's a line outside. Just to let you all know who I am, I am Julie Gunderson, a senior reliability advocate at Gremlin. I'm actually an organizer of Dev Ops Day's Boise. I'm an avid mushroom hunter, but those are morel mushrooms, and I arrived in Spain by plane. My name is Karim Satterly. I'm a senior developer advocate at HashiCorp. A huge fan of aerial photography or one of those guys with drones, very annoying. Used to be a conference organizer, gave up on that. Thank you, COVID. Also arrived here by plane. So, I know we all had lunch, but quick show of hands, movement is very important to help with digestion. Who came here by plane? Raise your hand. Wow, jeez. Okay. Oh, wow. 83%? Something like that. Yeah. We'll go 80, 83%. Doesn't truly matter, because metrics are always a lie. But flying, obviously, one of the safest ways of traveling, and yet you still might encounter an incident or two. Sorry, an unscheduled event. My airline does not want to call it an incident. In IT Ops, we'd refer to it as an outage, as an incident, but to keep everyone's stress level down, especially after seeing so many hands up, we're gonna avoid harmful language like that. We will, before you have to get on a plane to go home. So, picture this, and it's purely hypothetical, this never happens. Your plane is on a landing approach like the plane behind us. This is the Amsterdam Tower. We tried to find a sky that reflects Valencia's at night. It's very hard, because Amsterdam is usually very great. Crew on that flight deck just reconfirmed the correct runway. This is good, because planes do sometimes land on the wrong runway, or actually taxiway, and they got the latest relevant metrics, things like wind direction, temperature, humidity, all the things that make a plane really unhappy, and that you, as a passenger, generally don't notice. Short crackle on the mic. Sorry, we couldn't afford the sound effect for that. And unbeknownst to the pilots, they find themselves in a Nordo situation. Now, wait, Karim, what is a Nordo situation? So when a plane has a Nordo situation, they have no radio situation. These acronyms are not very creative. They just remove a couple of letters, make it shorter. Hey, it works. But imagine this, plane has lost voice comms. This is not a critical failure, but it is not very convenient when you're in a landing approach. And planes, like our multi-cluster, multi-cloud, often, obviously, multi-continent deployments are part of massive, massive, truly massive engineering projects. And usually, they're seriously complex, usually overly complex, and the same holds true for planes. So here we have a pilot and her first officer, and they're not panicking. That's a good thing. At least, it is for me. They don't panic because they appreciate the complex systems fail. And they don't panic because they're actually practicing for these scenarios all the time. Now, I will take a step back, and when you do see your flight crew all of a sudden start to buckle up, well, maybe that's when you should pay attention a little. But during this, they know what to expect during these situations. These scenarios are so expected that they're often referred to as normal accidents. Meanwhile, though, at the ATC tower, a controller just started a debugging process, a debugging protocol. And we'll join them shortly on that whole process. But before we really get started, over the past few days, and while some of you are coming into the room, actually, we asked almost 100 attendees. One very important question. What does observability mean to you? Think, thanks to you, we nearly hit 100. We hit 97 at this point. And so when we had answers that came as part of a theme through the power of Google Sheets and copy-pasting, we grouped them together. This is really what metrics is like. And honestly, we got some pretty amazing answers. I know we had somebody ask what the most outrageous answer is. So I would say, you know what, let's run this BuzzFeed style because you won't believe number five. I would tell you number seven is the real killer, but we couldn't afford seven. Slides have to be 16 by nine. So one of you answered that observability is akin to a surveillance state. And we're just gonna kind of have to come back to that one in a little bit. Now, more than one person with maybe the title of sales or account executives suggested that observability is just an add-on skew that they sell and it gives you that automated visibility. So all you have to do is install observability.exe, double-click and everything is instrumented, super easy. Not sure how that works in a container context, but it's great. So almost a quarter of you actually took this question very serious, appreciate that. And you gave answers that we could group as graphs or traces, I like that. Yeah, because we started to really get somewhere with these answers. And then almost a third of you know how we're fudging the metrics all the time. We're making it look good for the story that we need. We had people that basically concluded that oops, we didn't test that. Not everyone said it this way, but for the one person that I truly hope that you're here that was shakily telling us that we don't actually test for everything, you are fine, we're all there. If I ask everyone here to raise their hand, if they have 100% coverage, you will not see a single hand. Wait, let's ask, actually. So raise your hand if you have 100% coverage. Okay, good. I told you. So four pretty strong answers or definitely polarizing, we'll go with that. So let's look at them a little bit more. So we promised you that we'd get back to surveillance state and we were at a little bit of a loss. So either somebody was being snarky or brilliant or this is just a really inappropriate project code. We just want to tell you that if you feel like this at your organization, both Gremlin and HashiCorp are hiring so you can visit our websites or our booth. Yeah, automated visibility is clearly marketing. You can unlock that for 39 bucks per pot per month with a one year contract. Pay the front? Yeah, of course, I want to lock in those 3%. Okay, good. Yeah, I got to get that discount. Moving on to the more serious answers, graphs and traces is what we would call the known knowns. So the things that we seem to understand and are aware of, think of this as the 200X range of answers that your API or service might give, the 400X range, things where it's very clear why things are broken. And we use the word thing a lot because this is true for any engineering project and though the 200 isn't. And you want to think about it at that moment that you can't do better than respond with a 500 or if you're that kind of person, a 400 and then you just blame the end user. So these are the things that we are aware of but we don't fully understand. The known unknowns. So let's talk a little bit about this journey. Before our plane can lose contact, we first need to have a plane, we need to have communication protocols, we need to have a tower optimally speaking. Obviously, if you have a field that you want to ditch your plane in or a river like happened in, what was that, 2008? Also very possible. But you are going to go through a very long process. And so you want to think about this process as a stand in for any engineering product. We start building a thing and when we start to build that thing, the first thing that we need to do is we gather the requirements. So we want to understand what are our customer needs? We meet with those stakeholders, we document the outcomes. We actually go through quite the process for this but just don't have an entire day to talk about it. But once we've done that research and once we've established those timelines, then we start building that thing. So in the case of that plane, we know how many passengers it needs to hold. We know how far that plane needs to go, what the fuel requirements are. And when we look at that, those are the things that we are aware of that we know. Now, if this kind of engineering project is of interest to you, you can find directions on how to build a plane on WikiHow with pictures and it will walk you through that whole process. And I know many of us are engineers and you'd go like, hey, but building a plane is really, really tricky. There is a lot you can smudge around and make look good as long as the FAA approves that you're good. So now it's time to take off. It's launch day and you have worked with all of your marketing teams and all those other teams and you're super excited, that thing that you've built, you're ready to release it and... In the age of full service ownership, that's where the real problem starts because you built it, you released it and now you own it. I mean, not a huge fan because obviously that means I should have had 100% of testing. But before you can take part in that service disruption, you're not actually going to experience anything until you know and you've detected that there's an outage. And just because you haven't detected an outage because your tooling hasn't detected that outage, it doesn't mean that your customer hasn't actually detected that outage. Remember our pilots? From the second, the air traffic control tower lost contact with them, their customer, they started a countdown, five minutes or less to resolving that issue or else. Or else, indeed. So we want to think about that and that's what we're talking about today. With observability, are we able to understand? Can we look from the outside in and be able to understand that we are experiencing an incident? So the or else that Kareem was mentioning is two in the United States, two Air Force trained observability specialists, otherwise known as fighter jet pilots. So if that context is lost for long enough, two pilots will physically fly up and flank the plane to observe from the outside in that it's still there. So not an ideal situation, also if you are on a plane and you do see two fighter jets, don't worry, it's actually part of the process, so it's okay. But once you've detected the incident, that's now when you can actually start working to restore that service and that's excellent. So we're going to walk through this here on how we get there. Exactly and when you see those fighter jets, it's totally okay to switch over airplane mode. I just tried to get those text messages out like quick, fast. That's actually why we lost radio control because one of you or all of you did not have airplane mode switched on. Come on. It does actually cause problems sometimes. I'm going to talk about it later. After flying. So remember, we owed you one more answer and this was the one that we got from most people. Observability is really the information. You didn't think you needed, but that could actually solve your problem. And these are the unknown unknowns. It's easy to build for the stuff, for the happy path, even for the imagined unhappy path, but there's always the stuff that you can't truly anticipate until you've experienced it and it is no longer an unknown. And so generally speaking, these have been classified into what is known as the three pillars of observability. So we're going to talk through that a little bit. So first we start with logging and that's an important part of your telemetry toolbox because log collection and collation, that can give you more details about what the code in your application, what it's been doing and it gives you information like calls to remote dependencies and query request times, errors for bad data and timeouts. And one thing that's actually just important to note here is it's actually important to set some guidelines for your logging so that we can make sure that everybody is logging in the same way. Exactly, code is poetry and your error messages should be haikus that help you understand what's going on. Put the things in there that make sense for you and keep them consistent. If your system is complex enough, which obviously it is, put in numbers that you can track in terms of error codes, a 400 will not always help you detect that, yes, the client didn't provide great data, but it doesn't necessarily mean that nothing they provided was good. Now that said, we also have tracing and tracing helps you deal with the complexity of your distributed systems. So look, there's lots of tracing tools out there and your tracing tool of choice, that's going to help you understand maybe where that single user request goes and subsequent requests it creates throughout your system. So following a single user through a distributed system with multiple dependencies, it's incredibly complex, but that actually gives you the insight into how your application is behaving within that web of both upstream and downstream dependencies. Apologies, I was having a different request here. So with that said, kind of going back to the start of that incident that you were experiencing, how did you know how severe that incident was? How did you know if you should wake up Patrick at two o'clock in the morning on a Tuesday or at three o'clock in the morning on a Saturday after he'd been hanging out at cube com for the whole week or if it was something that could wait until a normal workday. So when we talk about measurements, we need to understand how our system is working, how severe those incidents are, but also... If you can measure it, you can manage it and if you can't manage it or interpret it, you can't harden it. It's tricky. It is very tricky. Trace, log, measure, those are your three pills of observability. Different products will call them differently and words do matter, but the gist remains the same. So let's go ahead and go back and look at that timeline of the incident. So if we look at the orange line here, that represents this extended period of time where we didn't even know we were in that time to detect the incident, so TTD, right? And that could have resulted from many things. Maybe our monitoring solutions, maybe they weren't working, maybe they weren't picking that up. What we need to know is where our systems are having those interruptions and that comes back to observability. And so with that time to detect being so long, we want to shorten that and we want to shorten that with observability, with monitoring and then with experimentation because we can't restore any of our services if we don't know that they're broken first. So when we talk about experimentation, we're really talking about science here because science and chaos, chaos might be a little unfortunately named, people tend to get a little bit freaked out, especially sea levels, they don't love hearing chaos, they don't love hearing failure, but when we talk about chaos engineering, we do follow the scientific method. So we start with observation and baseline metrics because it's important to understand how your systems behave now. Some of you might call that a steady state. I prefer a nominal state because your systems are going to vary depending on what you're experiencing. So if you are a retailer, for example, you would experience a different state in your systems on a normal day and a different state going into the holidays or going into a Cyber Monday or a Black Friday situation. So you really understand at least some baseline metrics of your system. And if you don't have those, it's okay, don't let that be a barrier to entry to start practicing with experimentation. Just start doing some observation, make a note of those, and then you can actually use your experimentation to validate those metrics that you're collecting. Now moving on, we kind of know what our system looks like, right? What it looks like now in that nominal state. So based on that, we're going to formulate a hypothesis. We're going to essentially say that this thing works the way we think it does, the way we expect it to. After we've formed that hypothesis, we're going to design an experiment around that. And when we design those experiments, it's really important to understand the blast radius. So the blast radius, if you think about it, is how big that experiment is going to be, how much of your system is going to be impacted. So for example, Or how much of your weekend. Oh yeah, or how much of your weekend, indeed. When you're starting, it's okay to start small with chaos engineering. I am absolutely guilty of in the past saying that you could only practice chaos engineering in production, and that's not true. You can practice chaos engineering in development, in staging, and move to production because it's like how we work with code. You can move up the chain, but if you're going to say run like a CPU attack, maybe don't run 100%. Start with 5%, build on that 10%, 15% consumption because we'll learn more about our systems and we'll be able to control the outcome of that. One of the other things you want to have in place is abort conditions. So this is when you're going to say, oh, we just learned something about our system that we didn't expect and we better stop this experiment. Netflix famously uses stream starts per second as a metric. And they're using a metric that is a direct reflection of their customers. When they're running experiments and they see that stream starts per second start to drop, that's when they halt that experiment because while we are all having fun playing around with our systems, we have to remember that there is somebody on the other end who is ideally trying to use that. So we want to keep the user in mind and this is all about making their experience better. And now that you've run your experiment, you can just walk away, you're done. But not because you want to take a look at those results and there's going to be likely one of two outcomes. So either you validated your hypothesis, yeah, your systems are reliable or they failed gracefully or you found something in your system. You found a potential failure point, luckily before your customer did. So analyze those results because then you can go back and either have confidence that your system works the way you expected it to or now you know what you need to do to harden your system. Also please do make sure to document this. And then a major, did you want to say something? You definitely want to document everything you're doing to understand what is happening, how it's happening and how it differs from the initial hypothesis. And I do promise Kareem will get to talk here in just a little bit, but- Not too sorry, Candy. With chaos engineering, learning is a key outcome. And there's a couple of ways that we are learning and that's why it's important to share the results with your organization. So first of all, what we're learning on our teams we should share with other teams so they can also learn how to harden their systems but we're also learning as an organization and developing that culture where we embrace failure. We're developing that culture where we want to learn from each other and the way we do that is we share, we share wins, we share our failures, we share our learnings. And your learnings might seem small to you but they might be big to the organization. Many of you might not be early career engineers, operations folks or anything early career you might not have done an internship but other people are. Help them understand the things that you wish somebody had taught you at that point. Help build that culture, no matter how big your organization is. Small contributions like that go a long way. And so as we learn from failure we want to generally look at four signals. In the SRE book and the Site Reliability Engineering book from Google this would be called the Four Golden Signals. If you like a different color, totally fine. Before we continue, a word of warning. Learning together is awesome. Learning because you caused a huge outage less awesome. So please, please for the next few slides and we tried to make this stand out a little bit. Understand your recovery plan. If you go like we don't have one, do not continue. That's okay. Start thinking about your recovery plan. Maybe that's just a new deploy and everything gets resolved automatically. Maybe it's a little bit more involved. And then make sure your backups are known to be good. We've all been there where your backup software returns a green check mark and you're like, that's awesome. And you find out the file with zero bytes. And it's not actually a backup, but the file name checked out. So let's talk about latency a little bit. The time it takes to serve as a request. So for example, with our plane are the voice comms between plane and tower just delayed or something else going on. There are a couple of ways you can simulate this. Injecting delays programmatically super easy. If you're an early career engineer, that's comes as part of the job. I've definitely been there. You could also build in some sleep commands or whatever your language of choice offers. You could change your DNS and network settings fuzz with the timeouts, black hole, certain connections, make routing really, really complex. If you're codifying your infrastructure, this is very easy to establish and make go away again. It's a good thing. And then super fast way to introduce latency, no pun intended actually, is switching geographical zones. Let's say your standard by default deploying to central Europe. And all of a sudden you're deploying to a zone in Asia. You are definitely going to have some latency. See how your system responds to that, if it responds to that and go from there. And then of course, as you're investigating how your system goes, you're gonna be looking at error rates. They give you an understanding of how many requests your system fails to complete correctly. And note that I use the word correctly because not every request will have to be responding with a 200 okay. Sometimes it's okay if you lose a certain percentage. Error rates are fairly easy to simulate. Terminate a service or two. Again, have a recovery plan and understand how to bring them back. Delete or revoke some access credentials. That's a personal favorite. If you have automated software for that, for secrets management, expire those keys early. Change a policy to not bend tokens that are working for 15 minutes, but 15 seconds. See how your application deals with that. And of course, another great setting of another great way to create an error setting is to change the clock. Time is a relative concept. You'll know this if you've enjoyed running JIRA and encounter the leap second. This is not a snide remark to the latest outage. This is from 2013, but it still hurts. And actually, I just want to pause here. So this time thing, that is an error that we would think, okay, our systems can handle that. I just have to let you know every time we have daylight savings time in the States, my card for my bank doesn't work. Now, luckily, I'm too lazy to have to change all my payment methods on Netflix and Hulu to go to another bank, but we have to remember that our customers can make changes like that. Actually, glad the European banking system is a little more stable. So yeah, when we talk about traffic, we refer to the demand that's being put on a system. This isn't exclusively HCP traffic. Anything that goes over wire, maybe you have an internet of things set up, maybe all your stuff goes through ether. All of this is traffic, and eventually that can catch up with your system. Create some spikes. For HCP traffic, super easy. Use something like AB or Newman if you built APIs. Just spawn tons of instances and see how your APIs respond to that. See how your web application firewall responds to that. If it actually blocks your own requests and then disable those blocks because your attackers will do the same thing. They will figure out ways around that and will not be caught by the first level. So don't just rely on a single safety net. If your system loses a load balancer, then you're getting the same benefit, but if your system uses a load balancer, change the weights you're using. Instead of deploying a setup with three equally targeted zones, shift all the traffic to a single zone. This will give you enough traffic to test with, hopefully. If not, might wanna ask why you have three. Then for saturation, we're looking at system utilization and what constraints our system is experiencing. And this is a tricky one because this is very, very specific to your system. We had somebody come up to us yesterday ask about what a good CPU is for an architecture they were planning and it's a guess, right? It's everyone's guess. And the only thing that's really the common thread with any architecture is that you're not only going to experience outages at 100% saturation. You'll have problems usually at 70, 80%, which is why we usually over provision. Simulating saturation is actually quite informational. You could start out by altering the scaling logic of your orchestrators, delay scale outs to create hot zones, not the COVID kind and force more load onto a single system. Or you could fill up disks with randomly generated data. If you ever played with Linux, you've probably found ways to do that. And of course, there's plenty of different approaches. I think the person that suggested what was it, observability.exe, you could use consume.exe to just completely burn through your Windows CPU. Sorry, your CPU on a Windows system. Obviously it's the same CPU. And so latency, errors, traffic, saturation, the four primary signals you're looking for and while we present them as individual things, they're very much connected. If you shift your traffic from a single, from two instances to a single instance, you're gonna get more disk usage. You're gonna get more saturation. Your disk is gonna get fuller. And eventually you're gonna get errors because your application is no longer able to write. And we're talking about this in an instance setup, but all of this holds true for pretty much any setup that you can imagine. So the whole point of this is getting to that point where the customers don't even know we're having an issue. Because there have probably been many times that you were on planes, sitting in the back, eating your mushroom and crusted chicken, that you didn't know that there was a communication issue happening. And so when we look at that through everything we just talked about with observability and monitoring and chaos engineering experiments, we're looking to reduce that time to detection because as I mentioned at the beginning, if you don't know that there is an incident happening, how can you even start to restore it? A big part of this is understanding our systems better but also building that culture. So one thing that we talk about is game days. And with game days, you're practicing chaos engineering experiments, but not just with your team. With other teams in the organization, it's always a good idea to at least include one other team because you're going to learn about how their systems work, they're going to learn about how your systems work, and you're also going to learn how to work together and have that cross-functional communication. So deep breath, we're nearly at the end. So I want to give you some takeaways. First of all, everything we talked about today is tool agnostic. It doesn't matter if you're in a container context or not. It does matter if you're on physical hardware or virtualized hardware, which is just a more expensive physical hardware. Complex systems are ultimately built by humans and those humans are the primary contributor to lowering your error budget. But those same humans can be also the primary contributor to increasing your error budget and allowing you more time to experiment. We gave you this heads up before, but everything you build should be codified. It is so much easier to look at code in the broadest sense of the word. A protocol is also codification. To look at that and understand how to restore your system. When you codify, you're removing click ops. That's a good thing. Yes, I work at a vendor that has software for that. This is not about that. It's about making sure you can undo things that are broken. And when you test things, employ a proper method. We're not trying to just randomly go unplug things in a data center. Remember, we're walking through scenarios. We're walking through our hypothesis. We're very careful about how we are testing our systems so that we can harden them. And finally, none of this is really a technical problem. This is very much a people problem. So remember when we said teach that new engineer, teach that new person on your team how things work. Create that culture of reliability. Create that culture of it's okay to fail. It's okay to ask and it's okay to learn together. None of us know as much as all of us together. And building reliable systems and better systems starts with building a better team. And with that, just remember words matter. Words like chaos are scary. They're about to flag us off the stage, play that exit music. So with that, let's go ahead and tell them where to find our slides and our books that we like. Yeah, we'll definitely do that. Summer is coming, so I know you're ready for some vacation reading. One book that we like to talk about a lot is normal accidents. Charles Parrar dissects why complex systems constantly just inch closer to the next incident. Not fun reading, but very informational. Also not your traditional IT book. If you like the Phoenix project, fatal defect is written in a similar-ish style. Very nice. Also helps you understand complex systems but very much in an IT sense. And some really scary insights there. One of my favorite books also that Gene Kim was an author on, but authored by Dr. Nicole Forsgren is Accelerate and actually talks about the importance of chaos engineering, the importance of culture, and the importance of game days and learning. And then I think we were asked to put this in there so please enjoy. Definitely take pictures of this. I see you pulling your phones down, that's not okay. Keep them up. You're gonna need them for the next slide though. Hit us up at our booths. We have socks at Hashi Corp. Super nice when it's not this balmy. And if you want a copy of these slides, we'll upload this as soon as we get Wi-Fi back. We just wanna let you all know that the plane did land safely so pilots do actually have a way to communicate with lights, with tipping the wings, that everything is okay so we wanna let you know that. Oh and a transponder. So the last learning that we wanna give you from that is don't have just a single safety net. Build for failure. Build for systems that constantly will fail because eventually they will and you can predict where it comes from. And I'm Julie Gunderson. You can find me at julie underscore gond on Twitter or at the gremlin booths. I'm at case of theory on Twitter. My name is Kareem. If you wanna talk about Asian whiskies or baking, I'm happy to talk about that or otherwise infrastructure as code I guess. And with that, thank you so much. Don't know if we have time for questions. Thank you.