 And I hope that we have a question for you. I was wondering, you said, what's just my device? I'm not sure. Ladies and gentlemen, if you have so, you can all squish this way for any eight covers. And we will now have Christine Yantz on the floor. Is there a little kind of that process? Hello. Thank you, Christine. I'm the co-founder of Starfleet.com. But the content of this talk came from my time spent as an engineer building with all the tools at Perse, Facebook, and Honeycomb. There's a disclaimer here. The content is better agnostic. It's tool agnostic. You don't care what tools you use, so long as you've had to create something to practice it in your development process. So, I want to share today some of my lessons learned in building tools in a space that most people think about as just being an ops world. But really, I'm like, well, the ops folks, I'm not good at saying this, and we all know developers think like this. But there's so much good that comes from the tools I'm being, like, taking each other's worlds and seeing how they can work together. And this contract, if you use it for your service, is what unifies us, right? That's why we're here. Theoretically, we're on the same team. So we're in this together, and we should be working to provide these services to our users together. For years, people have been talking about the DevOps movement, teaching ops people to code and to automate their work. That's cool. We're great at catching up. It's a good idea. What about the other direction? What about ops dev or teaching developers to own their code production, to feel comfortable digging around, asking questions, pre-direction, almost like with developer machines? And I'm here to propose that observability is that bridge. This is the thing that helps developers go from understanding how their code runs on their machines to understanding how it runs in production. Observability is all about answering questions about your system, as JDV said, holistically. And that ability is as valuable for developers and the folks like the code and the people running, testing the code as it is for the operators deploying and making the machines. So, again, lots of people here are monitoring observability, they think monitoring and ops. And that's certainly the part of that is true or else I probably have had to find a lot harder to be in this room. And certainly a lot of the tools and resources in this space are focused on ops and concerns. But what if those tools and what if these techniques spoke a language that developers are already using and thinking about in their daily lives? And so they build ideas, things like customer ideas. Anyway, what part of, okay, you're telling me that latency is up or you're telling me that throughput is down. And what part of the service and what part of my API, what part of my code is being affected is probably the same. Another recurable characteristic, right? When ops, when Doug throws code over to ops versus black box and when ops looks at the black box and says, you're gonna see the use of, what do developers use to figure out why and how to fix it? They're like, okay, well, help me isolate the problem, help me reproduce the problem so that I can fix and figure out what's going on. The more we bridge this gap between what ops sees, if you're the example. So I go online and try to contrast. It's not great. By talking about what production sees, it's not how ops versus sand is stuff. What is happening in production? There's more that ops folks that does can share responsibility in what's happening. These days, software development can be a super customer focus, right? There's all these great practices about writing to my docs, architecture reviews, TDD, more tests, code reviews, CIS, TD and then we ship and we celebrate and we're like, awesome, my code is done. And then we just sit around waiting for customers or manufacturers to tell us something wrong. What happens to that diligence? What happens to all that curiosity up front when you're like, oh, I've got to make sure these cases are covered. These things kind of, you know, explore all these things and capture them in test cases. Oh, we know why. It's a lot easier to isolate reviews, check for these cases on our machine-intended production than in production. But during this development process we've got all these questions about our system that don't smell like production monitoring but can only be answered in production. And they're not all problems are anomaly. They're not all things that are better to pop up on their own. Instead, they're about hypotheticals. They're about customer segments, right? What happens if I make this change in the code if I, you know, optimize this over here will actually prove to the fact that I think it will? Often this question is even just what does normal mean, right? Oh, okay, you know, this is what's happening I want to make this a little bit better. Am I actually sure? Are we sure that this is what people are seeing and what my test cases say it should be? By being curious and powered to leverage production data to explore what our services are doing we can have that data not only inform what we're building or what we're fixing but how to build those. How to scope those. How to, again, make sure it's working and how to roll it out. So here are some, you know, very traditional standard production systems obviously questions. And it didn't take very much to turn them into things that, again, are production questions. Things that developers don't care about things that affect the business things that are relevant to our users. Again, the answer to these is only limited production. You're not going to be able to simulate these on your laptop without going and looking what is happening, who's doing what at what rates. Our tests are only as good as people can write them. We're not going to be breathing ahead of time. Benchmarks can be kind of run and, again, they're isolated. It's one kind of no thing. You have to understand that. Exceptions are enough. Because sometimes, much to try, exceptional behavior doesn't always result with exceptions. So what can we do to extend these great practices that we have in development into production? Because we can't wait for off spokes to tell us what's going on in this external part of the development cycle. And the customers know that we've missed an opportunity to be amazing because you're relying on that to your QA, and that sucks for the brand new. So we should be taking the initiative to get the same amount of visibility into production as we do in development services. Let me tell you some stories and see how you feel. This is Bea. She's a software engineer on a small team and she's been tasked to improve how her team influence rate limits. So, until now, rate limits have just been in-processed. Cache, small startup, resource constraint. Where each server is trying to approximate the rate limit individually based on some global ownership and how many servers there are. While it's worked well enough, it's not that I have to take the service through. There's this kind of uncomfortable correlation between, well, more of the customers who have to go here because the load that the server was having off day. You know, there have been rate limits more or less in some cases that it should. So it was, from a point when it was time to split up, share cash, clean up this packet. But, instead of just making the change, Bea was like, okay, I want to be careful about changing the customer seats. I can't just, like, cope with my promises about what I'll have to push the code and kind of hope for the best. So she wants to know what data together to build confidence that this rate limit algorithm would work correctly and not screw over to many current customers. By bringing data into the decision making process, she could simulate the change. Let's see what state behavior would be before flipping the switch and making the current promises real. So what we did, she did, alongside the logic in her API server that calculated a particular rate limit for a request, she also added a bit that tracked whether it would be whether it would be rate limit or not would be algorithmed. And this, like her, makes me visualize hey, this is what what's currently happening. This is what would be happening in this brand new world, code is not you should get, but we're simulating this. And, you know, she'd see, cool, rate limit algorithm afterwards tricked in a couple of places, that's what it should be. That's what we expect. And she and her team were able to actually examine each case individually to assess hey, this customer, this use case. Is this what we expect to happen? Yeah, right? This is, again, taking these test cases out of development, looking at the production, where the test cases are actual users, actual workloads, after machines. What's so special about machine anyway? It gets so much more informative. Disability, well, sorry, I mean, obviously the ability to attach to the bugger, split up bottom lines, do bunch of stuff without screwing the reaction customers. But that's what we just did, right? That's how she got all the graphs in the previous slide. We pitched she captured this metadata. These are our debug statements. They're lightweight. They help us validate these hypotheses. They carry metadata specific to the business, and they describe the execution of our software. They send a lot like they want out of a debug process. All right, so this is the leveled up. She's got her debug statements in the pocket, she does how to do that. She's ready to do a little more. Now she's looking at the storage later, and she's been working on some functionality to basically change, expand some, basically how she decides to lay out this on disk. What she knows is she knows that it's something that will impact a small set of customers a lot, but shouldn't have a hugely visible impact on everyone else. But it's a very performance-sensitive kind of system. She doesn't want to just make change without being sure about it, and she wants to see what will happen, again, with a production workload on production machines. All the existing tests and new tests, again, that she wrote throughout that process, she worked as well, but earlier we talked about translating these debug statements to production is not equivalent to tests and testing in production. Of course, there is not that small of a speaker yet, you all know where this is going. The advantage of local tests, right, is controlled state environment where you can construct and execute tests, but they can do the same in production by using things like a private test segment, add that tier of observability solution of choice and watch closely to see what happens. So she did roll out her code, hide a feature flag, turned on the new code, a couple customers, a couple machines. She was able to compare any top-level performance metrics that she normally cares about, but able to directly compare the very small segment of folks in the control group. As components in the new storage layer increased, she was able to lower 10% and 20% and the rest of the cluster, all while keeping a close eye on the performance metrics to see, hey, is everything kind of what we expect to see. Mr. Grass, she could actually see a very slight increase in latency as a result of the change, but it's expected and it's within an acceptable threshold and because more than anything else she could see and attribute that change to, okay, well it's this set of this code experience with this set of customers which is everything else, she could feel confident that, oh, okay, yeah, this is expected, this is correct, this is doing what I want. What she did here is similar to what a lot of us want to do with comparing, comparing, associate with the long build-out and single host, set of hosts, but again, that's thinking back things from an ops perspective and we roll back and think about, I want to think about things from a business perspective or the foreboard perspective some customers are string heavy some customers aren't heavy, then the choice of single servers are arbitrary and possibly dangerous. Instead, feature flags are like a smarter manual carry that developers can control and remembering and being able to incorporate them into our observability tooling gives us the ability to find these ephemeral send-ins in our code to answer these ephemeral questions in a way that developers can work into their sometimes, though, even with the best intentions they manage to release code with bugs or other technical consequences but the question that always comes up you know, you need to have a reaction with the customer report when did this start happening not necessarily because they're still concerned about the, you know, the type of live although that was important too mostly because that's just the simplest way to get back to hey, what commits, what pull requests what new codes might you grab this time as any ops person will tell you, okay biggest risk chaos in the system usually humans usually humans pushing code out on your code so we got tired of this backwards timeline dance I got religious about tying this hard data with billings something that helped her to track up this code, whether this bill whether to cause these things on the graph and once she got that going this highest single piece of metadata available to the systems she could immediately go from this to this, something that that goes from hey everybody kind of went up around this time not really sure that we should go find out what happened to oh okay, well this bill that we let's go track that down by making her vulnerability tools talk about her system in terms that she dealt with day to day she could go from this question mark celebrate and then be done part of the process to start filling in the middle section to start thinking about testing and production to thinking about investigating outliers yourself developer terms ops terms and to really start communication and exploring production hopefully to find problems identify things before again your customers or ops that's how we bridge this ops developer gap to really start thinking about what happens after they ship instrumentation monitoring not just things at the end of the death cycle after they've been like okay everything works we're all afraid to go they should be sprinkled in and continually checked even as you're figuring out hey I think this is right but I want to test it on the CD I want to see the impact by capturing the more lightweight transient information of prod actually ship the code to everyone we can make better informed decisions deliver better experiences to our users now we can have this be part of our development process rather than just writing a code writing a plan, shipping it, and isolation actually forming a hypothesis about what's happening in production and looking to see hey if the code I'm writing is true what I think it does playing this path out of this experience and being like hey I have this great idea and they should I just want to say it looks great when they actually run it doesn't do anything in production what if we could just cut out that chunk of time where we spend a bunch of time writing code for nothing we could validate our prophecy and make sure we're doing the right thing we're spending our time where it matters alright you're saying great it still sounds nice and wonderful how do we get here what can we actually walk away with high level idealistic stuff well I think this is my station talk a little bit later I expected we'll cover a lot of the same ground but as we saw in the how many of you were right but there's a lot that can be there's a lot that can be gained by just capturing something that every HTTP request for every HTTP server capturing a little bit of something about which endpoint, which method, which is a little bit less took we started to capture a really high level view of what's happening and this is even great for things like A would anyone care can we make this change for our API every year if they weren't actually using it or is it just all all garbage how do those standard HTTP requests that we start off with start sprinkling in some business relevant identifiers these things that actually are descriptive of what what your workflow what the workload looks like at that point as well as any infrastructures into the characteristics competition you're reading from the right to do the example that's all that thread in there because you know at some point you're going to be looking at something strange you're going to want to break it down and you're going to want to find out that it's that one number that's set again having problems and those first two really set you up for this sort of ad hoc at federal parties that you saw in those two stories as you find yourself asking these questions hey what makes change, what does success look like there you can find you can then find yourself ready set up to capture the information necessary to actually answer those questions and you had it before during after it's changed and I really understand what you're doing effects these actual measures of health and both of those examples we saw earlier the rate of an example of a search, for example but I don't think that we need to add that was often added in general that could be written but if you make it simple enough to make it easy enough for developers to add these things on the fly the more it will be empowered and able to pull that data and then use that data and this is partly just cleanup again earlier when we did the talk about things like too many values, too many types of series sometimes you had to think things up especially if you were generally a lot to do with the federal activities some things that really keep data easy to work with to be again talked about go context the way to thread hey this thing's relevant to some quest through your various microservices picking up whatever actually do ties along the way that characterize something about this data a common set of nouns and things to do with naming I can't tell you how many folks we've worked with who have ended up with a set of logs where in some parts in sublogs there are some systems they have app ID, some application ID some app dash ID help future view not capacity and again in the end look at what you need look at what your developers need look at what makes sense for your system say you have a service where you care a lot about performance you care a lot about analytics oh man I really need to super optimize and all the agencies are okay well then a lot of what you care about shape each request for every single read request so you go in later and slice the dice and look at requests that have this read request that have these characteristics how they perform what about this other statement on the other hand you know let's say there's this surrogation in the range you have optimized the craft out of the right path really had it optimized even in the case where your instrumentation library super optimized super low overhead you still probably want to be sensitive to performance so probably want to be careful about how your log thing I don't want a log thing additional for every write that you process so that thing's up there maybe that might have done that let the use case dictate where again this requires the folks dealing with instrumentation understanding these things we're talking across that obstacle to reach and just because I always want to show folks what this might be like in practice this is what honeycombs eight guys female have to evolve like we started off with a very simple high level HGP parameters for each request and then we started adding things we started adding initial timers we started adding some copatopic complication things like that we started adding things that were specific to the routines that were being run again things that we were trying to use during the 8 guys server and we kind of just kept going we kind of just kept adding new things right proving the things that were useful but we wanted to continue building up this giant corpus of things that we could break down by or we could measure we could look at because the more we this virtual cycle of the more we need the more we want to know the more we could build smart with the brightest amount of code so in conclusion again don't care what tools you use so long as you're thinking about incorporating this sort of production data into your development process they're small, I have no compliance whatever you do the folks who are involved in shipping software, building software should understand the behavior of what their software does once they throw it over the wall software developers should be owning them ability because we have the most to gain observability should be a core part of how we understand what to build and how to build it, who we're building it for we have a lot of hypotheses about what we should be doing, evaluating it before we're like cool that code is definitely worth writing we have the power to make our users happier by building better software by making our observability tools reflect the software reality that elders live in build ideas, future plans, customer ideas things that are not just in cpu and memories we can all be better engineers we can all ship better software by the way all the stories that I told they are real, you can find them at the blog URL and that's the URL that I'm going to use thanks for all to be around I hope you're just coming in for lunch if I may, I'll be there we'll be right back