 It was 5 a.m. on a Sunday morning, and I'd been up since midnight. It was my third month at Twilio, my first time I called by myself for my team. And something was wrong with our monitoring system. See, it said the third of the hosts in one of our regions were down, but they weren't down. And I watched as team after team logged on, checked their hosts, silenced the alerts, and went back to sleep. But I was cloud operations. I was supposed to know how this worked. I was supposed to fix it. And by that time, I was tired. I was lost. I was confused. I couldn't reason about the network anymore, reason about the relationship between things. I was done. So eventually, the network got fixed. The system worked again. And I went on to do many, many more on-call shows. But I didn't forget that feeling of being lost, of being confused, of feeling powerless. I wanted to make sure that other people didn't feel that. I'm James Burns. I'm from Twilio, and I work in the Inside Engineering team. And I help Twilio developers understand their services with hundreds of services, thousands of hosts. I'm here to talk to you about two different techniques that you can use to empower your developers or empower yourselves to understand your distributed systems. So first, why do we care? Well, Twilio is a cloud communications platform. And what we do is we allow developers to interact with their customers and over voice, SMS, video, and messaging. And so these are some of the customers we have. Some of them you might recognize, some of them not. But these are people who, when their customers wanna talk to them or when they wanna talk to their customers, they expect that to just work. They expect it to be like the telephone. The old telephone network, you pick up the thing and you get a dial tone, the new one, you have signal bars. It's supposed to just work all the time, five nines. So I mentioned practical distributed systems. Let's talk about a little bit more about what that is. I'm going to propose that a practical distributed system is observable and resilient. Two words, let's dig into what that means. So in order for our system to be practical, in order for it to be usable, it needs to be observable. So at very, very high level summary, you need to be able to see whether the system is doing the thing that you thought was supposed to do. And if it's not doing that thing, you need to know why it's not doing it and how particularly it is not doing that thing. As Bruce Wong says, whether chaos engineering thought leaders, if you can't see failure, you can't fix failure. If you can't fix failure easily, we have a problem. So Adrian talked about a few of these things, how things are made observable. The usual ways are metrics, logs, exception reporting. These are the usual ways that you make the system so that you can inspect it. So as a story, one of the systems I work on is a low latency system where we take data from one place and we move it to a different place with doing some simple transformations. And how we made that system observable is we looked at what sorts of metrics we needed to understand if the system was actually providing the latency it was supposed to. This was a sharded thing. So we say, hey, by shard is, are all the shards performing correctly? Is the inton latency the same? Are things actually doing what they're supposed to do? And then they also need to be resilient. So what this means is that the system needs to keep doing the thing that you told it was it to do. And if it stops working, it needs, well, so it needs to work as much as possible even if the things around it aren't working. So the same system, the other side of it's a different API and that API had availability challenges. And so one of the things that we did since we had made the system observable is we looked and we said, when the API is down, when it's not working, if a request takes longer than 3.5 seconds, it's not going to work. That request is dead. We should just abandon it and try again. And we also noticed that if the system was working, that would affect 0.001% of the request would be incorrectly retried. So we implemented this. We put it out in production. We saw we could experience 80% error rates and our system would still work. It was resilient against the system around it not working. So these are all properties of regular systems. I mean, people have been developing practical systems for a long time. Your UNIX process is observable. Your Windows process is observable. So what's the difference between this and practical distributed system? Why is this a problem that a lot of people have now? To start answering that question, we have to look at the journey of systems and how things went from simple systems as monoliths and went to microservices. So that journey starts simple. It starts with a single service that implements something that people need, something that people want. And that service is observable in the usual ways. You use top, you use VM stat, you use the system statistics to understand what your service is doing and you probably also have logs. And is it resilient? Well, normally you can just use a supervisor pattern. You just restart the service and it's doing the same thing as it was before. It doesn't have a lot of internal states so it's really easy to scale it to make it resilient. If that thing works, in a while it becomes a shared code base. And at this point you have multiple developers working on it and those developers are doing different things. One of the famous things that sort of came up in the last talk as well is you have log standards and everybody has a log standard. You have the XKCD, there are 12 log standards and then somebody says, we need a new log standard and that's number 13. So this happens even in small shared code bases where different people are like, no, this log, my logs need to be this way. And you put these different pieces of code together, they're still in a single process, they're still working together but they're starting to be different. And is it resilient? What's resilient in similar ways? You can probably still safely restart it, safely scale it out but they're starting to be more state wrapped up in that process. And eventually, if you're successful, it becomes a complex shared code base. So you have lots of developers working on the same thing, they're trying to coordinate, you have merges that take hours, you have ugly branching situations and you have fragility. Your system starts to have issues with making change because it becomes complex, people can't understand what's actually going on inside the system. So it's observable, sort of. You have the same things, right? You still have a single process, it's still running on one box, doing that all the work in a single place. System metrics still work but in terms of resilience, in terms of complexity, there's a lot of state caught up in that so you can't just restart it because you're like, what was all in flight? What's going to happen if it receives new data that's format a different way? So people go microservices and microservices are magic, right? All of a sudden you're decoupled, all of a sudden everybody can just deploy whenever they want and write whatever language they want. And so you're in Nirvana and you say microservices, the solution to all my problems. Not quite. So let's look at what happened with observability. Before you could do something like this to understand, this is a flame graph. If you haven't looked at this, look at Brendan Gregg's blog post on this and what this does is this tells you end to end from the kernel all the way up to your application stack, what's taking time? And this is actually super easy to run. I just figured this out a couple of weeks ago. I helped me find a serious performance problem but you can do this when you've got a monolith. You can do this when you're running a single process but how would you do a flame graph in microservices? Run flame graphs across 10 different boxes, run them in different places, it just doesn't work. Metrics might still work but again, you start having divergence. As all these teams start deploying separately, they again start saying, oh, I'm gonna collect this kind of metric here and I'm gonna collect these kind of logs here and it starts to become really hard to put together a story about what's happening across all your different services. To answer the important questions which is how are my customers experiencing my service? Am I meeting their needs or not? That becomes quite hard. So also what happened with resilience? First thing is that these things are now connected to my networks and if you didn't know, I'll tell you a basic truth, networks fail and they fail all the time and they fail in really subtle ways. And so often your application can't see that because it's interacting with the framework which is interacting with the socket, which is interacting with the kernel, which is interacting with the network. And across all those layers, you can't see what's going on by design because it's supposed to be abstraction but when it fails, that can be really messy. So you end up having a lot of network failures. All the usual distributed systems sorts of problems. And then the effect of failure between one system and another is harder to reason about. So if something three steps away from you fails, it can still mean that your particular service that you're responsible for just doesn't work anymore. And also in terms of scale, if you scale your service up to meet the needs of someone who's requesting stuff from you, you can end up wiping out everybody who's downstream from you because they didn't know that you're going to throw 10x more traffic at them. So let's look at the first tool that I promised you to help address this and that's distributed tracing. And so distributed tracing as you saw in the previous talk is the ability to measure what happened to a particular request, who touched it for how long across however many different systems. And so you think, well, yeah, I've got my answer. This makes it easy, right? Distributed tracing is going to point to what's wrong and then I'm going to just go fix it. But you pretty quickly realize after you spend some time with this, distributed tracing provides the what but not the why. It will tell you what got slow. It will tell you what went wrong in broad terms but won't tell you why. And it won't tell you all the systems level problems that may have contributed to that. And that was one of the real challenges when we introduced this at Twilio is because myself from a systems background, I put this out there and I'm looking at the graphs and I'm sending them to people. I'm saying, well, look, it's obviously this tier reflexum system level failure and people are like, yeah, I don't know about that. So eventually people start developing this intuition, developing the ability to reason about how things are going but they're not going to become system experts. So let me tell you a couple of stories about how we applied this to transform observability in our system. So this is a distributed trace. It's actually from a single system but what we did when we designed this is we said, hey, at the top you have the total amount of time and then you have underneath that individual functions and these are the requests to the Diron stream API I mentioned before. So these are when we're taking the data and trying to post it back out at really low latency to the next step. And originally we designed this where we're going to trade off complexity for performance. So we're going to make calls serially. So that's the waterfall you see there is us making one call to API and then we need for it to finish making the next call and that worked at first. It was good trade off but then we started scaling more and start getting slow, really slow as we made more and more of these calls and they took longer especially when they're performance problems. So we instrumented, we gained observability and we saw this is why we're getting slow. And so we did this. We made all the calls in parallel. We made the longest thing is still the total amount of time because of particularities with the system but we were able to gain insight we're able to observe the system and make decisions to make it go much faster. A second story. So this is where we were, we instrumented the Edge API at Twilio with Distribute Tracing and usually you start in from the Edge. So we instrumented this and the API team's like, look, this downstream service is slow. So the top is the request from the customer. Second one, these two are directly related to authentication for the customer and the API team goes and says, you know this downstream service, they're always slow, look at how bad they are at their jobs, blah, blah, blah. Not quite that mean, but and I go to talk to that team and they go, but we're not slow. Like we give answers right away. We have the metrics, here they are. And so I developed a tool to instrument our service mesh to create spans for when the mesh makes a request versus when the application makes a request. Then I found this. I found that the downstream was still in the truth. The request that was made by the application didn't represent when that request actually went to the downstream. And you can see here that little tiny sliver is the downstream receiving the request and responding. But it didn't get the request for 800 milliseconds. And these are the kinds of things that you find when you instrument with Distribute Tracing and you're like, I made assumptions about how the system worked and they're just not true. Like I find surprising things every day still. It's crazy. Here's another example. They responded in one millisecond and it didn't get picked back up for another two seconds. So these kinds of things happen in real systems all the time and you just can't see it. So when you instrument with Distribute Tracing you can gain this visibility and start trying to figure out why is that happening? It's a problem with the way I'm doing threading, is a problem with whatever else. And so you have the pretty graphs before but reality is even more complex than this. So this is more complex. You have requests coming from all over. You have them going all kinds of different places. You don't have even instrumentation. So some people with Distribute Tracing, some people haven't. But as you do this, as you use it more and more, as you observe degradation, as you observe failure, you start to gain an intuition. You don't have to be a systems expert anymore. You can start seeing that this is how this behaves. This is what I can do to try and address that kind of an issue. And you start enabling your application developers or yourselves to reason about distributed systems without having to be distributed systems experts. But this process, it takes a long time and you're still waiting for failure. You still need to observe the failure and often that comes at the cost of your customers. So your customers see failure, you see failure, you fix it and then you wait for it to be validated. So the other tool I'd like to introduce you to is Chaos Engineering. Principles of chaos.org states that Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production. In other words, we break things on purpose. To quote the CTO of Amazon Web Services, Werner Vogels, everything fails all the time. So failure happens. Hardware fails, networks partition, bad software deploys, SaaS fails and human error. People make mistakes, they click the wrong button, they enter the wrong number and suddenly you've taken down all of S3. Hopefully not you personally, but it happens and again it's even more complicated unfortunately. So when you're running one of these complex services in reality, it's not just software you've written. You're often integrating with cloud SaaS like S3 or Dynamo or something like that and you're also integrating open source software and these aren't things that you control. These aren't things that you can instrument. You also can't obviously control when they fail. So if you design, if you see a failure, you design something to address that, you're waiting for that to fail again. So you go, oh, there was an outage, I'm gonna design my system, be resilient for it and then you wait two years to see it fail again and then you're, because of the rate of change, your system probably changed so much that it's irrelevant and you're going to fail the same way again when that SaaS you depend on goes down. So the difference is chaos allows you to experience failure on your own terms. It allows you to validate your changes, it allows you to make sure that what you're doing is going to solve the problem before it becomes a problem for your customers. So everybody sees the Netflix talks on chaos and goes, yeah, I'm not going to run Chaos Kong, I'm not even multi-regional, I'm not gonna fail regions, this is crazy. So this is how our team got started with Chaos and it's called the Chaos Game Day. And in this process, usually you do it in stage, you have the person who is the master of disaster who understands the system, understands different ways that behaves and is going to cause a failure on purpose. And then you have the team and they're in their standard incident response mode. So they're across the table or in a different room if you wanna really separate people and they're going to use their standard incident response with chat ops or whatever else to respond to it. So we did this and my manager asked me to be master of disaster, which was super exciting for me because I'd seen systems fail, all kinds of crazy different ways. And I wanted to be like this guy, I was gonna sit there, type in, I was gonna be like network partition, prime number packets dropped, whatever. And he said, no, let's start simpler. So we started with this. We start the incident, I'm sitting across from the team, I do this, I do sudo halt, I shut down capacity, one of the servers in a large number of servers that creates this service. And I'm watching the metrics, the team's watching the metrics and nothing changes. Nobody notices that the capacity is gone. So they're sitting there because they know I've done something. So they're furiously trying to figure out what I've done and they just can't do it. So I do it again and again and again until all the servers in that tier are shut down. They're like, oh, it's a total outage, everything's gone. And so then they bring the capacity back up, restart the service and stage and then we go and do a post-mortem and we look at why we didn't see it, what happened, why this was a problem. And as bearments were created out that, we built a system that allowed us, by making changes, by creating betterments, concrete betterments based on actual failures that allowed us to validate the resilience that we put into a system. And this is the secret of Chaos Game Days, it allows you to get operational expertise without waiting for failure. So often people will be in a position where you're like, hey, you've got to have five years of experience to be able to do on-call and production because it's super hard and you have to have experienced all these different sorts of failures to be effective. What Chaos Game Days do is they allow you to accelerate that process dramatically. So we had a new college grad in our team and she came in and the first Chaos Game Day, she's like, this is sort of crazy. But by the third, she was just knocking stuff out. She would see the problem, she would be able to map in her head what sorts of things are dependent on that service and she would find it and fix it. And I was throwing crazy stuff at her, the stuff that I wanted to throw before and she just nailed it every single time. And so you can dramatically accelerate, especially your younger engineers ability to participate on-call by following this process. Sofall tolerance is an ideal we shoot for. It's that your systems can be resilient but they can't stand up to all these failures. But there's a trap there. There's a trap in believing that you can be done, that at some point your system is going to be fault tolerant. Because the whole point of this whole microservices movement, this whole DevOps, this whole Agile is to make change faster. And because you're always changing things, you're going to introduce risk, you're going to introduce new failure modes. And so what you need to do instead of believing that you're going to create a fault tolerant system once is that you're going to engage in this process. We do chaos game days every week in my team. Every single week, before we plan for that week, we do a chaos game day and we say, is our system still resilient? If it's not resilient, we're going to prioritize those barements so we can make sure it stays resilient before we start adding 20 new features or whatever. So how to get started? Here's a few tools. SuduHalt, one of my favorites. There's also TC, IP tables, and you need to look up exactly how to do these things because you can do bad things like locking yourself out of an instance. Few IP tables, block everything you're like, I blocked all traffic. And then you can SSH back in to roll that back out. That instance gone. So there's also Gremlin Inc. is a commercial SaaS tool that provides a nice gooey on top of this, provides scheduling, those sorts of things. Then of course, the Netflix open source with Chaos Monkey. So to quote Ben Siegelman, co-founder of LightStep and one of the writers of the Dapper paper, operationally independent services conspired to run the business. If you can't both understand trace and model how that conspiracy breaks down sometimes, you are and you use drunk words in trouble. There's this illusion of independence. There's this illusion that we're all gonna just deploy all the time. It's gonna be super fast and agile. But our customers are relying on that conspiracy, that hidden relationship between all these things to make sure that they can actually do the thing that they need from us, that they can make that call, that they can make that request, that they can share their pictures with their friends, whatever it is, they need that thing that the conspiracy creates. So we've got distributed tracing and chaos engineering. Distribute tracing makes our services observable and chaos engineering proves our services resilience. Together, they unlock the power of microservices. We go from a simple system to one that's complex, but now we've made it powerfully observable, powerfully resilient, and we can start to unlock that velocity without costing our customers. So I told you about some of our other customers at the beginning, but these are three of our customers for tuleo.org, our non-profit site. And these are also why we do this, why it matters for our systems to be available. The Polaris Project is a project that helps people escape human trafficking. They have an SMS thing where often people will be deprived methods of communication with people outside of the people who are controlling them, but they still have a cell phone. This allows them to talk to someone who can help them, who can help change their lives. Crisis Text Line helps people who are experiencing crisis, contemplating suicide, contemplating different sorts of harm. It gives them an outlet to talk about what's going on, how they can be helped, and what will change their lives. Trek Medics. For the four to five billion people who don't have emergency medicine services, this allows them to coordinate local 9-1-1 emergency ambulance services in developing countries. All these things they need to run to. They need to be up. They need to just work. So that's what we do. Why we do what we do. Thank you. Thank you. If you've got any more questions, do you send them three? I want to start with, what are the particularly tricky problems that you've seen using these tools? So the one I showed with requests just disappearing on local hosts was one of the most tricky. There are other ones I've seen where we will see a failure that correlates across a whole bunch of unrelated systems, and you have to figure out why did those correlate. I can't unfortunately go into the details what that correlation meant, but you will find things deep inside the systems that you depend on that can fail in ways that you didn't anticipate and that you wouldn't see unless you're doing tracing. And what do you recommend for teams that are focusing on product? What kind of timeline for each Chaos Game Day? So what was that? What do you recommend for teams that are focusing on product? And what kind of timeline for each Chaos Game Day? So I mean, the most important thing is to get original regular cadence of running the tests, running the Chaos Validation. Using Chaos to validate your current systems resilience will guide where you need to develop observability. And so when you're continually causing failure on purpose in your system, and then that's driving your development of solutions, it sort of works as a system itself without anything necessarily driving it, you'll start to feel the motivation to fix your system when you're like, I can't see failure all the time and that's not okay. And especially product people being involved in that process helps. Cool, that's it for now. Thank you very much. Thank you. Thank you.