 Welcome to DevOps days Baltimore. This is my first DevOps days Baltimore, but not my first DevOps days. I Always joke that Nathan Harvey and I only we both live in Maryland Right outside of Baltimore. We only ever meet in other countries at DevOps days Which is almost always true. This is like one of the first times that that we've collided in our home states so When when I originally was on the program was on program for an hour long slot So you can Hopefully empathize with the challenges of delivering this in 30 minutes So I'm here to talk to you about reanimating DevOps and I have I have a lot of harsh words just to wake you up in the morning Right, so I'll try to keep my potty mouth in control, but that's not gonna happen, but so So the most interesting thing to me that's happened in DevOps is actually CICD All right, so when we look at that first a survey how many people here were in the industry Computing industry in general before DevOps became popular Okay, that's a great number And how many people came from a software development software engineering software development lifecycle side and How many came from an operations side? Oh Wow, it's like we can have some sort of pit fight later. It's gonna be great so when I have some controversial Stay I was around when DevOps was coined and participated in the DevOps community a little more passively than actively since then And I have a very specific view of how this thing played out and how I wish it had played out instead So when I think of DevOps, it was really putting dev into ops what I saw it as it was a somewhat condescending view from software engineers coming in and saying these poor operations people Man, their life is so tough. They don't understand how to automate anything Everything they do is just just wrote repetitive crap I'm gonna take my software magic and put it into your operations And the two things that I would say that came out of that that were really fantastic were CI and CD So the idea of automating the software engineering Software development software release release engineering release management and even deployment which overlap a lot with ops So it was really really amazing So it was a way for organizations to really move faster and it required a tremendous amount of collaboration between Development and operations to do this so This talk is about reanimating DevOps from a from a corpse and carcass may be a bit harsh But then again, it may not be that harsh So so what is it it look like right so the the reason that Cincd became so important is because has anybody ever launched like deployed software before Okay, keep your hands up if it worked Right, so we keep deploying the software and the software is junk And so the interesting part of all of this is that what really happened was that I want to make sure this is the right slide This is the Great. Okay, so the one on the left. All right So software engineers really had this idea that that operations people struggled so much Trying to deploy software and and the fact that they couldn't automate things They you know, they really needed help and operations people were just like, can you just please not give me shit software? That's all I'm asking for is just don't give me stuff that doesn't work, right? So the The empathizing with the target of this this effort didn't work very well, right? So software engineers identified problems that operations people had but it wasn't their primary problem So what do operations people need? So what I want to leave you with is the part of dev ops that never really came to fruition out of that We had put a lot of software engineering practices into operations What we didn't do is put a lot of the operations blood sweat and tears and operational practices and techniques Into software engineering and that was really what's missing so operations desperately thrives Unobservability Right when you deploy something you have to be able to ask yourself Is it working and if you can't answer that question you have serious serious problems? The first thing that we do is always nothing, right? How many people here have launched software and have no idea if it's working, right? Hopefully less today than ten years ago, but ten years ago It was standard practice where it's like the box is on I can ping it and the process is running My job is done as operations and the reason it was done Was because you had no porthole into that software to understand whether or not it was working or not So then what do more sophisticated software engineers do they give you logs, right? So then you're paying splunk a lot of money But you're looking at logs you're doing some sort of red light green light Monitoring to make sure that the system is working correctly, but it's still systems Just because they're up and on doesn't mean that they're they're actually servicing their their users well, right? So then the third phase of that is really metrics analysis, right? It's it's taking measurements out of the software that are an indicator It's a surrogate indicator to tell you that things are actually working the way they're supposed to be working Turns out that's still not quite enough Because when those metrics go wrong and you know this isn't working, right? You're left with you know all the pieces, but you don't understand why so that last step is really Understanding what the behavior of the software is Right and how many people here have deployed a piece of software? Had it break and no matter what question you wanted to ask of the software. It said here's the answer Like why is this slow? What are you doing? Why did that database? Command why did that database operation fail, you know, why did the service crash? What is the stack trace? Like those are the types of things that the software should tell an operator All right, it shouldn't be some sort of magical thing that they have to divine so a really a quick Kind of detour is that we actually really solved this problem on a single system About 12 years ago, maybe 15 years ago with detrace Does anybody here know what detrace is? couple people How many people here run Linux? Okay, how many people run Windows? Okay, those are the two platforms that don't do detrace Every other platform in the world pretty much does it from QNX to free BSD to Mac OS X to anything It was originally developed out of some microsystems on the Solaris platform and Solaris 10 But what it is it was a frustration of software engineers when they were in production and they wrote this beautiful piece of software Which of course didn't work and they deployed it and it didn't work right and the customer was saying this is slow and Then you get in you get in on the production system And then you say why is it slow and they didn't have the tools to answer that question So I don't know how many software engineers have told you when you're operating something You're like this doesn't work right and they're like well, can you put can you repeat it in development? It's like No No, the problem is here go fix the problem So the idea of being able to interrogate a production system when it isn't working is the holy grail, right? I don't want to have to Repeat the problem in a development environment because one I may not repeat the the same problem I may struggle to repeat the problem and in the end of the day that One production problem that I can observe is the real problem. So detrace actually solve kind of all of that it allows you to Instrument basically anything in the system system calls HPA adapters everything from the hardware performance counters all the way up to Lines of Ruby code and stack traces all in one simple language that looked a little bit like awk Where you say when I reach out over this TCP socket and I write some data that looks like this Give me a stack trace and tell me how much memory. I just recently allocated You know like weird questions that one would say why would you ever ask that question and an operations person says don't judge me? Right you're in you have a problem. You need answers to your questions, right? So recently ebpf came out in Linux and we actually have a bright future now So ebpf is plumbing inside of the Linux kernel that really allows for building tools like detrace So it is pretty exciting times. We are in the beginning of those times So really quick a Story about web monitoring does anybody here run a website? Yeah, use the web, you know Okay, cool. You're not all like an app nation So is anybody used like Google analytics or some other tool like that, right? Okay, so Rewind to 1997 1998 did anybody actually run a website in 1998? My they only like three hands four hands. Okay. All right. We're old So I'm gonna give you this insight. It's gonna make you just blood drain out of your face How did you know that your site was up in 1998? Right. Well one 95% of people didn't know didn't know they had to care But the other 5% we ended up paying inordinate sums of money to companies like Gomez and keynote to ping your site Every 15 minutes to make sure it was working and When it came back with a green report, right? You can do the math on that. It's four times 24 times per day You were like a hundred percent uptime. I'm perfect. This is great and you're all laughing, right? You should be laughing, right? Checking your site is working every 15 minutes should give you about zero confidence that is this available accessible and performing That is the dumbest thing that we can do with today's technology. So what do we do instead, right? We embed analytics into the web page and every single person that actually visit the Visits the site ends up reporting things to you like performance data and availability data So you actually know, you know, obviously don't know who didn't show up, right? If someone tries to access the site and it completely doesn't work They don't report they report nothing, right? And that's a little hard to track Which is why we still ping our sites every five minutes usually every one minute But most of the data that we have about the performance and the availability of the system comes from this real User monitoring right so real user monitoring for the web is Obvious now I'm going to paint a picture of today Which is how many people here run infrastructure at all like any systems like a Linux box or Windows box or you know If you're happy and have de-traced with a free BSD box How many people like Ping it every 10 seconds or minute to ask it questions like is your disk full or are you on? Right. Oh, yeah, it seems kind of interestingly dumb like the other system that we did 15 years ago 20 years ago But the idea is that you know that computers actually doing stuff. That's its whole point, right? It's actually running software. It's servicing users. It's making to speak connections sending packets all that stuff Every one of those things you could potentially measure to see how it's doing But that is the future of what this is going so when we're talking about data explosion Instead of pinging a box every we used to ping them every five minutes Then we ping them every one minute now some people say well, I ping it every 10 seconds, you know, that's fast enough It's like, okay, do you realize with like 72 cores in a box running at you know 2.6 gigahertz? computer actually does a lot in 10 seconds like a lot a lot a lot right billions and billions of things in 10 seconds and Any one of those can have an aberrant latency that can cause problems for users It can it can go off the rails So so you actually observing the entire stack of that and all of the behavior that happens on the system is going to happen So the problem is is that is an information scale problem, right? When we look at at Moore's law That is that technology tends to double every 18 months So if you do that math out from 2000 to 2017, that's a 2,500 fold increase 2,500 times The the size of data ingest so we're starting to look at things like an Exabyte a day coming in to architectures if you start doing all of that and people are like What like exibited that's insane. No one's gonna do an exabyte a day So I'll rewind 20 years and say we have a terabyte of RAM in boxes And if you told that to somebody 20 years though, they'd be like, you know, that will never happen ever like that's crazy There's like one box in the world that needs that who cares everyone else will scale out Like we have racks and racks of boxes with one terabyte of memory, right? So this this is coming So 15 20 years from now the idea of people tossing around their dollar a month exabyte cloud Storage is totally gonna happen, right and the data that you're gonna put in there Is not necessarily information overload, but you definitely have to have tooling around making sense of it all These are the problems that are coming is what do you do with that? So kind of back to systems that are broken so there are two methods for dealing with systems that are broken One is because in this previous slide We don't have an exabyte of data We cannot write this if I observed every single thing that happened on any of my systems I would explode my information space that never be able to store it. I can't really make sense of it So I can't store it all so there's two things that I can do I can either store it all in aggregate throw away critically Important details about that or I can sample that and Anybody who tells you that one of those is right is completely wrong. They're both right They're two entirely separate approaches. One of them is that you sample data. So every 100,000th transaction or every transaction that has like a known anomaly you end up tracing really really detailed information about it Everything you can think about that piece of that of that transaction and you keep that and that's really useful for understanding the the nature of transactions in the system and debugging problems But then there are a hundred thousand other things that happened in that system that you don't know about and what you want To do is you want to store everything but you can't so what you do Is you end up storing behavioral data about that it's like how long those operations tend to take and you aggregate that data so Where does that go on the tracing side you have technologies like ebpf and t-trace event driven stuff There's a lot of stuff from like honeycomb Is starting to look at that? That the product is really cool. I don't really understand the marketing there Because they're highlighting the top one as if it's the only one that matters the top one is super important and Those sorts of tools are really really interesting and help you debug systems help you ask what is going on But without generalized behavior to information you're really left with only one side of the story So we need that generalized behavior stuff. That's what we do at our company. We track data so I Want to dive into things that you can take back to the software engineering organization that can make their software run better in Production right that's our entire goal is not just to ship shit fast. It's to ship Software fast really don't want those two things to be synonymous, right? So how do you make your software better? And the first and most important thing that we can learn and this is much more important now that we have distributed systems everywhere I studied at Johns Hopkins. I was in the Center for Networking and Distributed Systems there I ended up leaving sans a PhD after spending way too many years there if you get me drunk enough I'll tell you the story I Left and said I will never debug distributed systems again and here we are so that sucked that did not work out How I planned but the one most important thing you can do in building a system in particular a distributed system is Understanding how to fail with grace right you need to fail quickly and safely And I would say that the airline industry is a great example of anecdotal stories around this You have engines fall off of planes and they still land and people people survive right so the idea of like when things go wrong You don't want to just keep trying right the engine falls off the plane. It's like we might make it You know, that's not how you do it. You say this is the procedure. We're gonna shut this down We're gonna turn it off as quickly as possible And then the way that that happens in distributed systems is much more complicated because all of the adjoining systems need to accommodate that Right. So when people say new software is really designed for failure That is a layered cake of truths and pains so any part of your system can fail at any time and and being able to react to those different components in there and their states of Uncooperative nature is really important So there are a lot of techniques in doing this. I'm not gonna dive into those enough time The second rule is that autop autopsy are not just for medicine Okay The days of old IT when the computer froze up and the mouse didn't work and the IT person came in and said well If you just restart it everything will be fine. It's not an answer Okay, these things when you do a hundred thousand transactions through them You're gonna have them again and again and again and you're gonna screw more and more users As your system grows it is not acceptable to have a failure in your architecture Especially one that repeats that you cannot explain you have to get to the bottom of the problems Computers are so much better This is not how I want to say that this is being recorded Computers are so much easier than children Right, so I go to work and I tell a computer what to do and it Pretty much always follows my directions You know aside from a couple of microcode bugs, but even then they're following someone else's directions There tend to be very few scenarios in which a computer is actually Malfunction in an on the malfunctions in a non-deterministic way my children are entirely non-deterministic So There isn't because this the system always follows your instructions It means when it screws up it did what it was told. It's very likely. It's going to do the same thing again so The days of did you turn it off and on again? Really need to end and even more so in large-scale deployed distributed systems Another one which if you've been to any software engineering conference, you've probably heard Is using a technique of circuit breakers and this is about failing gracefully. So there is a very very big difference between being electrocuted and being shocked Right, so When something is about to fail Trying harder is usually not the right answer. You actually want to have the circuit breaker flip So you really need to design tolerances in your systems so that when a component or an interaction in that system It starts to go wrong or goes too slow Then then it doesn't just keep piling on more work So our traffic systems and everybody love the traffic right here anybody come from DC. Yeah, there we go. Enjoy going home So that is a great example of a system unlike out west Do you know on on the on ramps out west they have the red light green light? Indicators to flow control traffic into the into the highways I Have no of like one road around here that does that but I the the Beltway certainly doesn't do that And what happens is you get three and a half hour backups because when the road slows down There is no way for the underlying system to stop putting traffic into the problem Right, it just adds more traffic to the problem and this doesn't work Well, it really really doesn't work. Well, if you look at some research about a self-driving cars They tend to go a little bit slower and they leave more cushion between cars And they tend to not have traffic jams because of that because there's no jerks. They're all driving the same way so by By putting circuit breakers into the system you actually can control the behavior of the system better Right, it's much better to have you know 50% of your traffic turned away and 50% of it served well then a hundred percent of it served poorly right and If you serve all your traffic poorly when I say poorly it means that your your web page doesn't load the images Don't all load you can't check out like it's great to go and shop and then you can't pay Right like what do you do anybody know what you do you go to Amazon? They don't have that problem right so now you've lost business Amazon's got more business There's a reason that that that Amazon designs the systems the way they do And has anybody ever had a service failure in Amazon? Yeah, has ever anybody had a really frustratingly long load experience with Amazon where you thought it might work But you just couldn't get it to work Because I have never had one of those When I go to Amazon, it doesn't work it doesn't work. It's like we're a little we're overloaded. Sorry, that's it Right and the reason is is because they are gating traffic in they only want to let enough customers in Where they know they can deliver that service they let too many in a degrade service for everyone to the point where they abandon So rule four is that you cannot under you cannot understand What you cannot measure all right, so So many times in software You know, there's data structures and algorithms and searches that are running everything that you do has a performance If you launch an API service people are hitting that API it takes a certain number of microseconds to service that call I Don't know how many times I've gone into an organization and said how long does it take for that API to call to run? It's like That's pretty fast like what what does that mean? What that's not a that's not a speed fast It's not a speed. This is not the Ricky Bobby of API services, right? so You really need to measure everything an error on the Even if you don't store it all air on the on the side of exposing Everything you can in a piece of software, right? The software should be Instrumented so that someone can ask a question later So understand everything and you can do that through measuring it and in a lot of cases and this is for your job's sake The best way to not get to do an awesome job and not get a promotion or get a bonus or anything Is to go improve the system and not be able to describe how well you improved it like I made it faster I want to go fast. That's great No, but if you can come in and say I dropped, you know user user service experience from from, you know, 700 milliseconds down to 250 milliseconds And hey, look, there's this Shopify report from like eight years ago that says that's awesome for these reasons Look our revenues went up our shopping carts went through or you know, our users are happier If you are measuring those you can see the impact of your work and your colleagues work And there is nothing worse than being able to there is something worse than being successful and not being able to show That's being horribly unsuccessful and being fired But it's very frustrating nonetheless so you measure to understand and One of the reasons that we measure and it's a predicate to this is that a Lot of times in operations is one of the reasons that I railed against Operational culture for a very long time is that in operations people just expect you to keep the website up Right and with a phrasing like that If it's not up, it's down. That's a hundred percent. That's perfection, right? The idea of aiming at a hundred percent uptime and a hundred percent service quality is the dumbest thing that any Organization can do because you can never exceed expectations Right and and you should never set up your environment that way, right? The the culture in the company needs to measure things and they need to use that to build in what we call failure budgets Right, you need to have a budget for failure You need to say that maybe my site only needs to be up 99.9 percent of the time, right? That's actually a pretty loose constraint, right? My site's down all the time because Amazon's only up 99.9 percent of the time But if my site is up 99.99 percent of the time going into the end of the month, I can go down Why would I not go down? Still meet all my constraints Right. Why don't I take those risks? Why don't I launch those features that were marked slightly unstable and test them out in production where I can get faster feedback more valuably and still be able to deliver on my my quality of service requirements and my service level agreements So by measuring everything and defining what your uptime is and what your performance is It means that when you do better than that and you are more stable than that You can introduce the risk that allows us to all go faster Right, so DevOps in general is around moving organizations faster. That's not actually true It's about de-risking the speed and this failure budget methodology allows you to de-risk speed So and my last one is that While justice should be blind operations should be not blind at all. It's it's a it's a sad thing. I Love this picture. It's my favorite Don't show it to little kids. It's kind of freaky the kermit the frog and the hand. What did they do to him? so When you are in production the only failure that matters is the one that you're experiencing right now So it's absolutely critical that everything that you do in developing software and all the practices that you apply and operating that Software can lead you to diagnosing that failure in reality Like as it is in production repeating failures is both a highly error-prone and very elusive I've seen software Architectures where you have a bug report from a customer and an engineering group of three or four people spend three weeks Trying to reproduce the error in development so that they can fix it Right go into production and just figure it out And if you wreck a train car in there you're allowed to in your Environment if you have an error budget, right? So you can actually go and look at these things live I do say that that GDPR and other things like that Make it a little bit more difficult because as you go into production you as a as an engineer are exposed to information that you may Not be allowed to be exposed to so this complicates things But there are ways to do that using canary systems. So you have production environments that don't take normal production traffic Right, but it is in production. It uses all the production systems and when you have a problem It's a lot easier to Divert a problematic user to that canary system or infrastructure than it is to repeat the entire problem in a development environment So there are methodologies and techniques that we can use when deploying production systems that give us access easy access to isolating Existing production problems so that we can solve them And that is that is it. So if you do those things You will be much happier people because your shit won't be broken all the time And that's it. Thank you very much