 Test, mic. I'm nicely mic'd. Didn't have an AV person either, so I wired myself. It's 11.30, but my track... I'm not sure. Unless any one of you are the track owner. No. And I haven't talked to anybody about AV, so I don't know if this thing's recording. So let's give it just a minute or two. Almost I'll get through all the material and happy to stick around. If nobody shows up in a couple of minutes, I'll go ahead and get started. And if anyone would know, Google would, right? Because they said it was... everything was going to be recorded. Huh. Unless I somehow missed some check-in. I'm in the right place at the right time. Oh, it is being recorded? Oh, we're live? In that case, I better get going. Okay. I'll do this. Oh, the stream is about five minutes behind. So... I'm just going to do it this other way. Okay, good morning. I'm Matt Inchentron. I'm going to talk to you a bit this morning about the tyrannical 9's solved. If you don't know what the tyrannical 9's are yet, you'll find out momentarily. Maybe that's what you came to find out about. And mainly what I'm going to talk about is open tracing. Open tracing is one of the CNCF projects. I'll get into that a little bit and how it can help you identify where problems are in your system. So first I'll kind of define the problem a little bit. So if you were to go... It's kind of interesting to me and keeps resonating with me over the years as I've seen all of us start to build out larger and larger systems, larger distributed systems, larger cloud systems. There was a paper some years ago from Jeff Dean and Luis Barroso called The Tail at Scale. And it really was about how at Google, what does that latency tail look like at scale? And so they said it's challenging for service providers to keep the tail of latency distribution short for interactive services as a size and complexity and use increases. Temporary high latency tend to dominate. And they walk through why that is, a little bit of the math behind it which I'm going to lay out here in pictorial representation. And a series of things that you can do to solve that problem. It turns out there are a number of techniques that you can use on a large distributed system to really try to minimize those small latency kinds of operations on a large distributed system. Specifically what they lay out is imagine that you define Google does things as service level objectives which is a little different than some companies where they try to come up with a service level agreement or it's roughly the same thing, just a question of how you measure to it. But imagine for a moment that you have a system and in that system you have some sort of front end that is making calls to various different systems and then those various different systems have to fan out a request across a large number of systems. Regardless of the scale, this probably looks like a lot of applications that you guys have built. Especially these days, especially in this world of Kubernetes and I have a bunch of pods that are going to run this particular app or in something like Couchbase where we're a distributed database and we're going to spread data out and then have to access a lot of data. The world has been trending this way for a while. Now if you assume that you have a fan out to a hundred servers, probably still a lot for most people depending on if you're talking about state full or state list but assume that you fan out to a hundred servers, something that has to touch a hundred servers so in something like a Couchbase deployment and the median is 10 milliseconds but your 99 percentile response time is one second, right? That's not too bad, right? Except that fan out, if you have to touch every server it turns out if you run the math on it, 63 percent of your users are going to experience requests that are going to take more than a second because that one slow node, that one node is going to dominate, tend to dominate that response time. You have to get all of that data, possibly merge sorted, aggregate it to maybe put this in perspective. I don't know, I know a number of people from Ticketmaster and then I know Amazon have a similar architecture. You have this kind of situation where you have front end nodes that will call different services. You might have a recommendation service. That recommendation service might have to call a bunch of different databases. Using MongoDB, you're almost always touching all nodes in Couchbase. We have some ways of making sure that we do that efficiently. That sort of thing is normal, right? You'll have this microservices architecture where you're going to have a front end, go call different services, those different services call back end services. What they laid out there is it turns out with that math, and again this is from 2014, so some five years ago, you actually end up, if your 99th percentile is one second, you're in pretty bad shape as you distribute out. Now, there's a guy named Rick Hudson who extended this a little further. He called it the tyranny of the nines, and that's actually a reference to something earlier that I can't remember off the top of my head. What Rick said was all of these workarounds referring to what Jeff Dean and Luis Barroso had documented, it's like making requests to multiple nodes at the same time and cancelling one if you get the response back from another. They'd have techniques like that. They also had cancellation from the node that receives it first to another node that's going to handle the request. So they had all these techniques to minimize the latency. So he said all of these workarounds come from very clever people with very real problems, but they didn't attack the root of the problem, which was garbage collection latency. On Google scale, we had to tackle the root problem. Why? Redundancy for Google was not going to scale. So here he's referring to redundancy of making multiple requests, which is what the tail at scale prescribes, is that you can use a number of redundancy techniques, sort of fault tolerant techniques to reduce latency, which is a really interesting idea. But redundancy was not going to scale for Google. It cost new server farms, and then he had this, this is directly out of his application from the International Symposium on Memory Management. At scale, redundancy costs a lot. If every time you make a request, you're going to fan it out to multiple servers, and yeah, you're going to cancel one of them, and yeah, you'll sometimes run them twice, and so forth. It still costs a lot. So this was in 2018, so some four years after the original tail at scale paper, Rick referred to the tyranny of the nines, and the fact that at scale, you still need to think about that response time percentile, and he lays out a prescription of, you have to do everything that Jeff Dean and Louise were talking about, and you have to work on reducing the latency in the tail. You can't target 99th percentile. You probably have to target 99.99th percentile in order to make that work, in order to reduce cost and be able to keep these things scaling. Now for me, at Couchbase, I didn't really introduce myself particularly well, so I'm a software engineering manager, senior director of engineering at a company named Couchbase. I have a couple teams that I work with that I lead. One is the cloud engineering team where we do things like Kubernetes and CNCF kind of integration we're doing. We have a Kubernetes operator for deploying Couchbase on to Kubernetes environments, and I should probably describe Couchbase really briefly for those who don't know. Couchbase is one of the leading NoSQL document databases, so we're a JSON document store, throw a bunch of JSON in, you can query it and so forth, and so you'll see some of that in a demonstration in a bit. The other team I have is really the team that does all the developer front end, so SDKs and connectors and that sort of thing. So from that team one of the frequent things that we'll get is somebody in one of our users or sometimes through we have an enterprise support agreement, somebody opens a support case they'll frequently come to us and they'll say, hey, I'm having this problem with this app. The app's running fine most of the time and then every once in a while it's a timeout exception. So I don't know how many people are Java developers, but it almost doesn't matter, right? In a big distributed system you start seeing timeouts. And so the user says, hey, I've got this timeout, what does this mean? For me being a little putting a little bit of humor into it, I guess, what does this mean? Well, it means more time went by than you allowed for in the operation. We started a stopwatch, we threw the request off and then we didn't get the response in time, so we returned control to your application through an exception, right, in Java. And of course then the users are like so should I turn the timeout higher? How do I fix this? And the answer is no, no, no, the timeout's already the default timeout's two and a half seconds. It's plenty high to allow for garbage collection issues, different connection blips, that sort of thing, so you shouldn't need to turn that timeout up. So then the user would be like, so I should retry then when I get a timeout exception? And the answer is no, no, no, no. Imagine that bit of code that every time you get a timeout you're just going to automatically retry and then someday something really bad is going to happen, and now you're responding to a system that's not behaving well by making it do more work. And that's like a race to the cliff pathological behavior. It's going to be in a bad space. So my quip would be, no, you need to find the problem. And then frequently the question would be how do I do that? And then it gets difficult, right? I'd like to say frequently in the past a lot of the discussion was well that's kind of about correlation. Do you have the CPU high? Do you have a lot of garbage collection going on? What kind of environment are you running in? What is the environment recently? That sort of thing. Because it turns out, especially with something like a Java application and the SDK that we're running, it's this little library that's running on a thing called a virtual machine. Virtual basically means hide the details from the high level, right? So it doesn't really know much about what's happening underneath. It's probably running on a container, on a public network, on an oversubscribe network, so there's just a lot of stuff. And it's very hard for that little library to know what the cause is underneath. Really all the library knows is too much time went by and so we return control to your app. So with that let me talk about how we start to solve this. And I believe my own personal opinion is you just kind of have to ask yourself, in the future are things going to be more distributed or less distributed? And it looks like they're going to be more distributed. So all of these kinds of issues, what Jeff Dean was talking about in 2014 at Google, this fan out to 100 servers, that's probably something like that is going to be relatively commonplace with even moderate system architectures going forward. So you really need another class of tools to be able to understand that is my belief. So what is distributed tracing? So if you think about it, monitoring what's happening inside a deployment, it's a pretty basic problem, right? We ask computers to do things for us. The computers should always respond correctly but sometimes they don't. Or sometimes they respond correctly but they're kind of slow about it. And the problem is that from the outside it's a mystery, right? So we have things like we can respond to changes in the environment, things like alerts. We might want to analyze events in aggregate, right? Go understand metrics so we can do things like capacity planning, understand where the workload is going to come from, maybe provision additional resources for certain times of day, something like that. And frequently we want to understand the sequence of events which you do with logs, right? No! Don't use logs. Who here uses Splunk? Okay. Do you feel really efficient every time you go into Splunk and you're trying to trace something through that system? You're trying to figure out, okay, this happened here and then this happened over here and these things might be related or maybe not, right? And so the reality is it's very hard to understand that sequence of events. Historically we would do things through a monolith, right? This kind of square up here on the left. You could kind of imagine the request comes in here, goes through these different modules and then it pops out somewhere over here. But in our distributed environment or microservices environment now, and if I did logging from this monolith, it all ends up in one nice little log in order. But now if I'm doing it in microservices, I'm probably calling multiple of these services. In some ways this is not a very good diagram because there's actually an indeterminate order in which that may occur. I guess that's true in a single process as well, but it's more often than not that you're going to happen in an indeterminate order and so it's going to be really hard to understand what happens. So every request touches a bunch of computers. Does anyone know the Leslie Lamport quote about this? It's one of my favorites. So Leslie Lamport, one of the people behind distributed computing, has a great quote. A distributed system is one in which a computer that you did not know existed can render your own computer unusable. So it's very true, right? One system fails and like, hey, why am I getting error 500s here? I have no idea because something way back in the back failed. So that's a distributed system. Anyway, so every request in the modern world touches a bunch of computers. Somehow you have to glue all that stuff together. Once you're doing that gluing, you're basically tracing. So whether you're doing it with your Splunk Logs or you're using a tool that's designed for tracing, you are tracing. So what do we really want with tracing? Really what we want is assume the diagram here on this side, on the left hand side. You have a request coming into a client or you have a client making a request that's hitting some, we'll just call it a web server. And then that's going to call multiple services off billing database, right? And you're going to have a series of things that are going to occur. You have to go through authentication. Okay, is this request valid? Is it from or who's it from? And then I have to go through the billing service and I might do a couple different things on the billing service, like verify it and process it. And then I may actually have to record that transaction back to some database or something like that, right? So this diagram on the left is kind of the logical block diagram of what we're calling, but really what we want to see from a user perspective when we're tracing with the trace visualization is we want to see something closer to what's here on the right. So the top most is here's how long it took from the client's perspective. Here's how long it took from the web server's perspective. Here's how long the auth component took. Here's how long billing took. The billing process itself, the billing service might actually have a couple components to it and we want to trace how long each one of those components took. So we kind of want this x-axis of time context visualization when I'm tracing what's happening inside a system. But I might want more than that. I might want a number of other bits and pieces to understand what's happening inside that trace. And so that's where Open Tracing comes in. So let me go through a little bit of an introduction to Open Tracing. Open Tracing is one of the top level CNCF projects, Cloud Native Compute Foundation from the Linux Foundation, right? So it is a vendor neutral open standard for distributed tracing. So it defines an open source API, and I'll talk a little bit more about that in a bit, to standardize logical constructs in terminology. And then there are a number of different implementations. It turns out in CNCF there's also an open source implementation, Yeager tracing, but there's Zipkin, there's LightStep, and anybody can go ahead and implement their own tracer. It's not a hobby project, right? So Google, like many of these things, you know, Google one day tossed a white paper over the wall and somebody went and re-implemented it in the real world. So the Google paper on Google's internal tracing is on something called Dapper. That came out around 2000. Open Tracing was first launched around 2016, and it's been built by a set of experts to try to fix this common problem. So it provides multiple language APIs for tracing and spans across app code. So in instrumenting libraries, in Open Tracing frequently you'll have a number of different users might have different operational styles going back to the microservices example. Maybe this team and this team have chosen different tools to understand what's happening inside their system. That can be a problem, right? We might want to trace, if I'm the person running that grey box web server, I might want to trace across those two different teams. So what really I want to be able to do is have everybody operate to Open Tracing standards, and that way all of that data is accessible, and all I have to do is plug in the appropriate tracer in the appropriate place to be able to pull data together. There's also a fair amount of standardization around what certain things should look like. For example, HTTP request should look like HTTP request. Database request should look like database request. And then there's a W3C trace context, so the Open Tracing specification council and the W3C are working together to define some of the headers and parameters that would be passed across, whether it's in Open Tracing or in the W3C headers. So going a little further into it, you saw that idea earlier, I've already used the word span, so a mental model. You can kind of imagine what's happening here is we have Service A is making a request to Service B, and so Service A has a series of things that it's going to do, and what we really want to do is we've got a few different concepts. So we have a concept of a trace. A trace is really just that recording of something as it moves through a distributed system, and they're represented as a DAG, a direct-to-day cyclic graph of spans. So we've got a set of spans, that together makes a trace. A span is a named time representation of a piece of the workflow, but spans also have different things like tags and logs, and some of this has changed a little bit over time, so you can have in some of the Open Tracing tracers, you can have things like baggage, but tags are, for example, things like, which database am I accessing? Which user am I doing this on behalf of? So I can tag a span that is really just a named operation with a time begin and a time end, and then I'll have that that I might have additional information I tag into that span. And then the span context is a set of trace identifiers injected in there so that the next service can understand it and use it with the previous service. So as you're seeing here in the diagram, you might have a parent span that then has a child span, and that in turn has a child span. So within this service, this service is maybe making internal calls, maybe some of those internal calls are doing things like reading disk, or reading a cache, or something like that, but they may also make a network call across to another span and they want to pass that span context across the network to another span, and then there's certainly logging that may occur at different levels. In some cases, and we do this with CouchBase, you may actually glue the tracing back together from logs and from log context if you don't necessarily have a way to pass the span context from parent to child. So let's look a little further at that. So each span has the operation name and that's something that the library author or yourself as the developer would define. A start time stamp, a finish time stamp, any references to other spans, typically to a parent. So you'll have a parent span context that's passed in, and a set of tags and a set of logs. So in this case, really what you're seeing, and I really should change this one day so that it's not my SQL, but it's instead CouchBase. But anyway, database query. In this case, it's a database query. What database instance are you using? So in this case, from the customer's database, what is the statement? So those are the kinds of things that go into the span. And then you may have a message if there's something associated with that particular span. So from start to time, finish, and those are just recorder's wall clock stamps. Span relationships. Open tracing currently has a couple different kinds of references for causal relationships. So one is basically who is the parent and this is done as child of. So when you start a span, you may have a parent span identifier when you start that new span. And so you're identifying yourself as the child of, you know, you've been invoked by this parent span. Might also have a follow from, which is where the parent span doesn't depend at all on the result of the child's span. So you can imagine those kinds of things where you're going to make a call to a child, but you don't care about the response. And so we want to be able to keep track of that in any open, any tracers might want to be able to keep track of that. Any questions on open tracing? Does that kind of make sense a little bit? Okay, cool. And I'm going to do some demonstration here in a little bit. But let me shift gears really briefly and show you how we've used this at inside catch base. You know, catch base, we contribute back to open source community. So some of my engineers have contributed back to the open tracing projects. And then we also have taken open tracing and included it. We've instrumented right now most of the library, the client side, but we're starting to instrument some of the server side as well to be able to understand how things are happening inside the system. So we have we've taken open tracing, integrated open tracing in the catch base. And then on top of that, we've built a feature that we call response time observability. So what is response time observability? It's a feature that is on by default with sensible defaults. So catch base has a key value API really just gets in sets of JSON documents. So we set thresholds for that 500 mls. Then we have nickel, nickel, FTES and analytics. So we have nickel is a superset of SQL. So you can query against this document store with SQL. It has views, which is an incremental map reduce feature, full text search. So we set thresholds for all of these things. What we do is every 10 seconds we will collect up anything that's gone above that threshold. And then we'll take the top 10 of those and we'll log them at info level. Now that sounds a little, it sounds like a lot, but obviously if you don't pass the threshold, there's nothing for us to log. Why do we log at info level? The reason we ended up doing that is frequently what would happen is going back to the earlier conversation somebody would say, hey, I'm getting a timeout exception. And then we'd ask, well, what happened just prior to that and what was happening inside the system? Do you have any telemetry and always the answer was, oh, well, I haven't added any instrumentation or actually sometimes it's funny. They'll say, here are my Splunk logs. You figure it out. Oh, yeah, this is what I wanted. I have written tools to parse Splunk logs because we don't have Splunk and Splunk comes across to me very weird. It's like backwards and JSON wrapped with some XML. I don't understand it. Maybe somebody knows. But if you don't have the Splunk tooling, you basically have to build your own parser. So we wanted that to be on by default with a default sample size. So you don't have to do anything other than have a logger. I hope you have a logger at info level. Yeah, question. Yes, correct. So question was, is this all in the client? So for Couchbase, this is all directly in the client. What we also do is we also record the server will send us a duration that took execution time on the server side and we pack that into the protocol, on the KV protocol, which is really tight, very intentionally efficient. We pack that into just two bytes so we can get a sense of how long did something take on the server side and then we have a way of doing additional correlation. Then we record all of those things into our response time observability. So to give you an example, threshold logging for us, if you were to look at the log you would see something like this. You'll see every 10 seconds you might see something like the KV service or the nickel service. Obviously here the thresholds have been turned up a bit, or the account has been turned up a bit. So you'll see the top end for each service. So what we'll do is we'll record things like it was a get operation, what was the last operation ID, which host was it going to, what's the last local ID, and all of these are things that we get out of open tracing through things like tags. What's the encoding time, the dispatch time, the decoding time, anything having to do with server duration. Technically you can't have encoding and decoding the same op, but I'll just to describe this really briefly, you can imagine that in that library the end user passes us, I'll use a different example. Maybe it's Node.js. You just take a Node object, or you take a JavaScript object and you pass it to the API. The API has to turn that into JSON. Yeah, not a lot of work, but there is some work, right? You have to basically go through a transcoder to convert that from one to another. In something like .NET or Python there can actually be a lot more work, especially if that's very complicated. So we want to record that amount of time taken on encoding and decoding. Decoding is when we receive it from the system, basically they get, encoding is when we're sending it to the system. Dispatch time, that's really the time it took on the network and the server duration. That's the server's perception of the time from when it read it off the buffer to when it responded. Now, there is a little bit of a challenge there because it can actually technically be on that buffer for a little while before the thread gets an opportunity to get to it. It may have been received by the host OS and not put into the buffer, not have the event called by the server side using lib event. So sometimes the span, you'll see that there's some time that may seem to not be accounted for because there are things that we don't necessarily have span context to account for entirely, things like time on network, right? But this is an example of what we'll see. We were also able to take this and go a step further and let's do what we call orphan logging. So you can imagine you make a request to the network and then that time out occurs so we return control to the app. But maybe the response comes in right after the time out, right? So that would be a good thing to know because we want to see, we still want to look at that duration from the server's perspective to understand how long did it actually take on the server side. So this also is on by default. This is not actually directly built on top of open tracing because open tracing isn't really intended for this but we can get a certain amount of information from that. So if we see a lot of, you know, we might see one node as an example, a lot of orphan requests coming from one particular node. So an example of what orphan logging might look like, in this case the KV service, we have GETs and so this is a GET type operation. What was, we record a little bit of context just to break this down. When we started this process we were saying, we'll use IP address and port number to do correlation between client and server. Except then sometimes there's a network address translation in the middle and there's maybe a difference of how the host name is looked up and all kinds of things that can make it a little confusing. So really what we did is we created a correlation ID which is the C here. The portion before the slash, in the last presentation, last time I presented this somebody said, you don't really need that slash because they're fixed side. Yeah, I guess that's true. We could have saved one byte but there's a correlation ID and this breaks down to the instance of the client because on a given host you may actually have multiple instances of applications. You may have multiple processes or you may have multiple instances of libraries in a single process so we wanted to be able to correlate to that as well as correlate to a connection ID on the cluster side. So if something goes wrong, if the server, as an example, some of this was inspired by a particular user who uses CouchBase very heavily and they wore out their SSDs. I kid you not. So their SSD were leveling. They got to this place where suddenly SSD latency was a little higher but it was only on particular nodes and it was on a large deployment and so it was very, very hard to find because the system you can imagine these requests come in and they might touch three different systems on a 100 node cluster and you're pretty random as to whether or not you're going to hit that particular slow node and that worn SSD, the worn section of the SSD. So it was very, very difficult to find. So things like this correlation ID make it really easy for us to then on the server side we can log additional details about that slow I.O. operation and then we can correlate that back to a client request. So I'm actually going to probably go past this super quickly but it enables us to do things like if we say for example we'll see patterns like if we see timeouts but there are no orphan responses then the connections eventually drop. That means there's some sort of network issue. So we can get a lot of information. We can discern and intuit a lot of the cause pretty easily from what's happening inside just that single log. If we see timeouts mapping up to an orphan response and all orphan responses are from one node then that tells us that that node is probably slow, right, that there's something going on on that node like the SSD ware level in case I mentioned. If timeouts, only some timeouts map up to an orphan response but orphan responses are coming in from multiple nodes and the orphan responses all have very small durations relative to the timeout value. That probably means there's something going on on the client side. It could be we've certainly seen situations where people over commit virtual machines and so sometimes I remember one end user this was a Java environment and looked at the Java garbage collection logs and you'd see ParNu took five seconds. ParNu should never be going for five seconds. That basically meant that something else was not letting that JVM be scheduled on a CPU. So if we can see that very easily maybe some map up to an orphan response could be a different kind of environmental problem like transparent huge pages being enabled on the server. So a little bit more about tracing specifically and I'm going to do a couple demonstrations to show you how this works in both open tracing implementation, Jager tracing, and using our internal response time observability. So first demo I'm going to try to discern which operations are slow just from the regular tracing and then the second one is I'm going to try to trace a much more complex interaction on a system. So with that I'm going to drop into a live demo here So what I have here is this is a simple spring boot application. It's kind of noisy here. Let's see how do we clear this. Anyone know which? Not that one, not that one. Well, we can clear it like this. We'll restart it. So this is just an IntelliJ. I could have done this with Node.js I could have done it with, we have open tracing support across Python, Node.js, Ruby, C sharp, all of our official SDKs. Some of those were actually sort of complicated because in our Python one as an example, Python actually embeds the libcatch space, the C SDK. So for us spans will start up in Python and they finish in Python but we want to trace the span down into the C library that we're into the time in the C library. So this is just a simple spring boot application if I were to show you what's going on in the app. So it has the test controller annotation on the class. What that means is that I can just go take any method and I can specify some request parameters and map them to URI. So there's a greeting, there's an itineraries, there's airport info, airport info as you can see is really boring. Itineraries is probably the most interesting one. And somewhere in here is a login, I think. Am I missing it? Or maybe that is the greeting. Request map, post login, request method, gram name, yeah. No, here it is, yeah, login. Oh, and this one didn't, we didn't specify URI just map straight to login because the name of the method was login. So what I want to do is generate a workload. So in order to generate a workload, I'm going to do the thing that every typical developer will do which is just pop into curl and do a bash for loop. So while true, do curl, name, so I'm passing a parameter, name the user. And what you'll see here of course is it's going to take that name. It's going to do a get, it's going to retrieve that user from the documents. It's just doing, you know, building a string. Then it's going to stick the login time in there in seconds, I guess. And then we're going to upsert that. We're going to rewrite that document back to the system. So actually before I do, I'll just do a single one. So you'll see it came back true and that gives me an opportunity to kind of pop over to CouchBase. So if we were to go over to CouchBase, this is the CouchBase UI. So if you install CouchBase, you would have this. So you can see we did one operation or two operations. We did a get and a set. So two ops in the last minute. If I were to go to the buckets and I look in travel sample and documents, it's a user. There it is, login time. Not super exciting. It's a simple little document. So if I were to go back and update it again, then I should be able to re-retrieve it and it should be an updated one. It didn't actually show it as updated, but yeah, time passed. So that's what we're going to do. And so what I've done here to ensure that we hit the response time observability, you'll see there's a little bit here where I wire in our particular tracer. So if you don't do anything out of the box, it automatically has that. One of the things that we contributed open tracing is the ability for libraries to have a default tracer and still be able to swap out to another tracer. And then we'll swap that out for Jaeger and we'll show Jaeger. So really briefly, well true. That looks a little funky. Yeah, there you go. Well true. Done. Okay, we got lots of trues. We should also have, over here, because I've turned the threshold very low, if things are working well. I got the info level on. Wait a minute. I've messed up already. Anyone figure out what I did wrong there? The tracer I'm using is Jaeger. And I really want to use response observe. So, sorry about that. So let's run that now. Because I did both demos before I started this session to make sure that they ran correctly. So, oh, I bet the terminal is really unhappy right now. It's function 12. Yeah, okay. There's probably a bunch of faults in there at some point. Oh, connection refused for a little while. Not surprising. But now we should have, if we look at the bucket, bucket travel sample, we should be able to look at the stats. Doing 70, some ops a second, 75 ops a second. Not too bad because we've got some terminal IO here. But more interestingly, we should have, because we've turned the threshold way down. Yep, sure enough. So we've got ops over threshold. So here we're seeing, you know, the top N operations. So here the op name is Upsert. The server time was 76 microseconds. But the, I think the total time, where is it? Total time was just over one millisecond because I had in the app, I went ahead and just tuned the threshold all the way down to a milli here, right there. Yep. So you can kind of see what's happening here is in the threshold log tracer. That's what we call it internally. Turned it down to a milli just good for demonstration purposes. So now that we know that that works, so let's, but really what I said, the idea here, again, is that if you do nothing, if you install couch space and you do nothing, we give you some information just out of the box. But really what we want to be able to do, is we're going to change the terminal. Good. That's good. So now we should, sure, our load dropped off. So now I want to change the tracer type. We're going to go from response time observability to Yeager. So Yeager, in my case, running locally on the laptop. I'm running couch based locally on the laptop. Yeager locally on the laptop. Yeager is running in a Docker container. Yeager is an open tracer. So when, in the select tracer method, there's just a little bit of configuration saying when to write to Yeager. And I could do a whole presentation on Yeager tracing, which is not what I'm aiming to do in this particular environment. And we're going to hit that same URI. So in this time, we will look at now, instead of that doing the internal logging to the internal tracer in the process, it's going to send it to Yeager. So shouldn't close the terminal. While true, do let's say, I'm going to cheat. I always forget how to do a post. Yeah, that's it. CurlData, done. There we go. Okay. So now again, if I pop back over to couch base, I should have workload around the same number of ops as second. But this time, I pop over to the Yeager UI. So this is one thing I didn't demonstrate is that in the app over here in trace me, you'll see that when we configured Yeager, we called it the service name simple read write. It was just a name. You got to have a name for the service. And then that service name, it's going, in this case, it's doing log spans. I should probably turn that off now that I think about it. I did that from a debugging perspective. And so here in the op itself, in login, you're seeing that we're not actually doing anything, which is kind of interesting. That's because couch base underneath is going to do all of that logging. It'll create a span if there is no parent span. But you as an application developer can actually pass in your own parent span. If you're doing more work. So if you're going to go do grab a couple things from couch base, do some work with them, and then send them back to the user or something more complicated, you can specify your own parent span and pass that into couch base. I'll do that in the next demo. So now if I pop over to Jaeger. So Jaeger, like I said, I could do a whole presentation on Jaeger. So Jaeger is built on open tracing and has a number of tracers, has a front end, has different storage back ends. For purposes of this, maybe we want to look at things that have minimum duration of one millisecond and we want to look at 20 results. And so if I do a search, what you see is sure enough earlier this morning, sort by most recent, things are off a little time wise, aren't they? Oh, 12.10 today. Okay, six minutes ago. So that's right. Why is it showing what's that? Oh, yeah, should be fine. But what I'm confused by is it seems to have put these in UTC time or something. That doesn't make any sense to me. Anyway, point being, you have a set of traces and you can see the duration here on this y-axis so I can look at, hey, why was this one slower than the others? So this was over three milliseconds, actually really quite high. In this case, it's probably quite high just because I'm running everything here locally on a laptop so it helps for demo purposes. But I can see that from the client's perspective, the dispatch to server, the client's perspective of how long was it on the network, which it may not have actually been on the network that long. It might have been in buffers, might have been in a netty buffer. It might have been waiting for CPU time or something like that. And then the server's perception of how long did it take to actually service that request. So very, very quick. And if I were to drill into one of these, dispatch to server. So this has a set of tags. I can tell which version of the client is being used, what's the host name, all that fun stuff. And then the response decoding. I should have a set of information there as well, the Jaeger version that's in place. But let's do something a little more complicated. So that gives you a quick sense of what you can do with Jaeger. Jaeger has a pretty neat architecture. You run a little sidecar local to the processes. It can actually push filters down. It's very efficient. Jaeger's been developed at Uber. Yuri, who's one of the guys behind Jaeger tracing also on the board of CNCF. We've worked with him a fair amount on actually trying to get couch base integrated in with Jaeger. If I do another search, sure enough we've got a much slower one out there. Now this one's kind of interesting, right? It took 5.65 millis. Only a portion of that was requesting coding and dispatch to server, and it doesn't add up to everything. So I can intuit from that that I probably was kicked off a CPU and probably put back on or maybe it's time in a buffer or something like that. Not a big deal because it was only one outlier, but you can drill into those kinds of things. So now let's do something more complicated. We're done with Trues. There's another one called itineraries. Just to walk through this for a moment. In itineraries, what you see here is the response mapping is to this URI itineraries. Itineraries is going to create a new tracer, and then it's going to create a new span called query and fetch, and then it's going to right here where it declares the starting point. It also has this parameter that says get on close if you don't hear anything back on the scope from that one. Then I'm going to do something kind of complicated. So inside Couchbase, there's a service called Nickel. Nickel is a query service. With Nickel query, I can query an issue something that looks like a SQL query. It's actually a super set of SQL, but because I'm going for the maximum performance really what I've done here is I've created a really simple query that is just retrieving the key from the travel sample bucket for any routes, limit 1000. Then what I do is based on each one of those, I flat map in a lambda, and with that flat mapped in lambda, I'm going to go do a get, and I create another span here to go do the get of that ID, and then map it into a JSON document class, and all of that is going to execute in parallel. So because we have a streaming parser as the rows come back from the cluster, I'm going to start key value fetching the individual items, and I'm creating spans for each one of these, and then the parent span gets passed into the Couchbase library, and then the Couchbase library creates other spans for response time, or for encoding and decoding for the dispatch time. So if all of that works well, oh, and then at the end I close it and I return the results. So really what I want to do is curl I guess I can just do local 8080, itiner, itin, erar, Ies. There we go. Okay, so that worked. So now if I pop over to Yeager, I should have something more interesting here because I'm not only going to have the simple read write, but I have Yeager query, so I can find traces. And if I have a look, oops, sorry, that's Yeager's own internal. I think I need to refresh this and I'll have operation, give it a couple. Service is still simple read write. Oh, that's correct. What I should have is simple read write, absurd. What did I call this thing? I should have query and fetch as an operation. Find traces. Simple read write, operation all. Still using Yeager, right? I'm just going to have a quick look at, if I look at ooh, now it's getting, oh no, uh oh, uh oh. It may have overwhelmed the machine. I was too zealous with my running in a loop. So IntelliJ is not liking life right now. Yeah, IntelliJ has gone out to lunch. So I should have I should have had a threshold on that. And I can't get the window back. Hmm, we'll try this. See how long it takes to quit. Let me see if I can get an older one from earlier. So if I were to say shoot, I don't think I'm going to have an older one because I restarted it. So here's what you would have seen. We'll see if we can get it back. But what you would have seen is the, we would have seen a you see an outer span that is the query. And then you see even before the query ends, before the query span ends, we start sending out the gets. And so you see this kind of scare steppy of I'm fetching a thousand documents underneath. And then as the responses come back, I can see the span of each one of those. Since it has to, oh, thank you. Thank you, Oracle. I will skip this version. So while I try to get that back, let me try to just finish up on the slides really quickly. A couple things. Where are we going from here? In the open tracing community, we're working, the open tracing community is working with W3C on trace context. Adding more back ends to Yeagery tracing, that's the one that Couchface we're contributing to right now. Open tracing is also, the versioning is a little confusing. So right now in Couchface, we're updating Open Tracing 1.1. So there's the Open Tracing specification, which has been 1.0, 1.1, now 1.2. 1.2 adds an additional correlation ID if you can't directly pass trace spans. And so now some of the tracers are getting updated, or the Open Tracing API gets updated, and then the tracers take on those features. So the reason I mention that is if you were to go grab the Java version, I have to look today, but it was something like 0.9.3 actually implemented Open Tracing 1.1. So you kind of have to look at the version numbers a little carefully. And we're doing some work in that area. My request to you folks is, well, one thing is if you want to join the community OpenTracing.io, get our discussion. The get our discussion is quite open, quite good. My thanks to Ted Young of LightStep for a set of the slides here from the Nauskan workshop. My thanks to Ted, Yuri, and the Open Tracing community, and then Mike Goldsmith and Charlie Dixon of Catchbase who have been doing some of this engineering work. So I'm happy to take any questions, and I will try to do questions and retrieve the demo at the same time. So questions. Yeah. So the question was there was a lot of additional code I had to pass the trace information along. Is there anything being done to reduce that? Actually, so if you look at the login one, there's none. The only thing that we had to do was initialize the tracer. And the only reason I did that was in the case of the response time observability was to turn down that threshold. Obviously if you have another tracer, you do need to pass that in. Now, that's actually a really excellent question because sometimes you want to trace into something that you don't control. So there's a, in fact, Ted told me to mention this. Ted, if you're watching, sorry. I forgot to mention it in the slides. But there's one thing that's being worked on which takes the Java built-in tracing and allows you to hook that into existing apps. And so it's called the secret agent, the Java secret agent. So you can basically take that, plug it in, and even if you have something that unlike Catchbase doesn't have top-level API support for tracing, you can trace into it. But like in my login case, my greeting case rather, there is actually no open tracing. The only reason I had to do anything was to change the threshold. Other questions? Platform? Yeah, so question is what in Java provides this? In this, in how old? How old? I'd have to check. I believe Java 8. I know I'm using Java 8 in this case. In this case, I'm using spring boot just for something to use. It doesn't have to be spring boot. The only real restriction, of course, is if you are using open tracing, you're going to have to have all of the permissions set to be... It doesn't matter if you're using JRE or JDK. Should be fine. So let's see if things are still running. Okay, curl itineraries. And then let me check really briefly Come on, IntelliJ. U, tool windows, where's my... Oh, now it's in a weird state. It's like half in presentation mode, half not in presentation mode. Let's see, where's the run window? Yeah, check that out. Very weird. Okay, let's see if that worked though. Up to the Yeager UI. Refresh. Everything's good. I should be able to find... And I cannot. Let's pop over here and run this guy. Yeah, it's failing to run because there's another process out there. Which one? Probably... I'm not sure which one that is. Sure. Yeah, question is, can I talk a little bit about why I chose Yeager versus Zipcan and LightStep? Mainly, it's open source, fastest setup in demo as well. I really like the stuff the folks at LightStep are doing. For example, they even allow you to trace down into what's happening from a mobile device all the way back to the Edge server, all the way back to the database and so forth. So you can actually start a trace all the way down to there. There's a ton of functionality that those guys have. And they've been driving the open tracing community forward. Yeager has a number of people. There's actually a relationship between Yeager and open Zipcan. The Yeager folks started with Zipcan but they ended up rewriting kind of the back end to deal with some scalability limitations. And then they ended up rewriting the front end. And so then it turned into Yeager after that. And meanwhile, Zipcan, the Zipcan community has taken it from Zipcan to open Zipcan and done a fair amount of work. So there's a little bit of a software fork there. From my Datadog also has an implementation. We've worked with the Datadog folks to make sure that they could interpret what's coming out of the CouchBase open tracing tags. So a lot of good stuff going on there. No particular reason other than I could docker run it. LightStep, I don't know what their source license is but I don't believe it's fully open source. But they've been definitely contributing to open tracing as far as open source and driving that part of the community together. Yeah, question. What are the requirements for running Yeager? So the, there it is. Kill, 7190. So the, my, still running. So the requirements for running Yeager, can run with an in memory back end but obviously that's not going to do anything terribly interesting for you. It currently supports both Cassandra and Elasticsearch back end so you would have to have that infrastructure. And then you have to have, there's a Yeager UI component and then there's the Yeager sidecar component. So the sidecar is something that say if you're running in a cloud environment that would run on the pod. So what will happen is the application uses UDP to talk to Yeager locally and then that gets forwarded up to the back end and then any analysis reads out of the back end. We're actually working on getting couch space as part of that as well. So there's, there's some plug-in work happening inside Yeager so that you can plug in different storage back ends. So then it's going to be basically, you know, Yeager is a Go app that has different back ends. So the requirements are pretty modest in that case. You know, from my perspective, we've found it really useful to just run even in development environments to just be able to pull up the in-memory back end and do tracing when we're just doing things at development time. Of course, as couch space, unfortunately we don't have a really interesting production deployment of our own. We get to use yours. So I look forward to getting more out of those. Yeah, so the question is, I'll simplify it a little bit. I think the question is a little bit, there are things like application performance management suites. So New Relic, app dynamics, those sorts of things. And then there's open tracing. How do these overlap, I guess would be the quick way to put it. The agent-based APMs, you know, they're, they provide a lot of value add. They add a lot of things that can instrument like what's going on inside. The JVM, as you recall earlier, I mentioned that there's, that correlation is actually an important part of what is happening. So, my own personal opinion, I think it makes sense for there to be a blend of both. Some of the APM vendors, I won't name names, have felt a little threatened by open tracing. An open interface that others can play within. Yeah. And hey, there are open source implementations, but I don't think it has to be that way. I think that there's room for them to use open tracing. Their argument would be that open tracing, there's a problem with open tracing that it's API-based and not network protocol-based. And so that makes it hard from an integration perspective that you can't actually do multiple things at the same time. But then, you know, my counter argument to that as well, agent-based is even harder, because now I have to shove in an agent, and if I have a third-party library like couch-based, I have to have an agent for couch-based versus having an open standard. So I think it's going to depend on how it's going to be a while until we figure out how all that stuff maps out, but the reality of the matter is that the, you know, as things get more distributed, we're going to need something like this. So I'm going to, let's give this one more shot here, even though I know we're at time, apologies. What? Option function 12. There we go. Okay, so that's working. Spans are getting logged. So if everything works well, then I should have over here, aha! I got the demo back. Find traces. So query and fetch, 3,003 spans. So this one's pretty cool, right? What you see here is if you see this outer span, this is the query and fetch. This is the nickel query and you'll see the nickel query that we have the dispatch to server time of the nickel queries fairly short, but then it takes a lot longer because we're doing 3,003 of these. Why 3,003? For each one we have the get operation itself, then we have the dispatch to the server and the response decoding. So it's three open tracing spans for each one of the rows that come back from nickel. But because it's a streaming parser, we're able to start the gets themselves underneath, and so you'll see that it's going to couch base, it's a KV client, what the op ID is, what the key I'm retrieving is, etc. So if I were to maybe make it a little more interesting but not fork bomb my laptop this time, option. So I always try to run in presentation mode and IntelliJ, but like to open some of the terminal windows, you have to like remember the function key thing, so sorry about that. So while true, do, sleep, sleep one, sleep three, done. Okay, now we should have regular workload going on, hopefully not fork bombing. Now if I pop up here, I should be able to do some fine traces and I should have, sure enough I've got some other really interesting ones already, took a lot longer for that particular operation to run. A bunch of these are the fetches underneath, so these are the fairly quick one, under 200 millis, and then this is a lot larger. Should still be running, right? So with that, I'm over on time. Thank you very much, I'm happy to stick around and answer any other questions. Okay guys, this is working. Guys, give me a moment, I'm trying to figure out how to get this going, so laptop recognizes it, but this doesn't. I think I'll grab somebody for the AV setup. Okay, while we're waiting for the whole AV setup to happen, I'm just curious, how many of you are interested in network telemetry as a specific problem, or are you guys more interested in telemetry as a problem? How many of you are interested in network telemetry as a specific problem? Okay, a couple, oh, most of you, that's interesting. Okay, then that's good, because then I will know how to focus on network telemetry specific aspects. That's an option if you guys want to do that, but I don't know if everybody, everyone's pointing out you could just bring up the presentation on your oh yeah, okay, on the schedule. Is that something we want to do to get started? Bring up the presentation on your device and then we can get started. Okay, we'll improvise. Okay, now the one small thing is the slides you will see are slightly different because I've edited them last moment. Not too different, I hope, but you know, should we bring it? Should we go for this? Just bring up the slides on your devices, improvise, adapt. Yeah, my laptop recognizes the projector, but the projector does not recognize my laptop. My laptop is showing that I have two displays, but the projector wants to show a default display. Yeah, I mean my laptop sees the projector. It says there's a VGA display, but I don't know, the projector doesn't seem to like my laptop. Something high resolution maybe, that's a good point. Actually, I've selected default for display. I think the projector is actually switched off right now. Thank you, thank you. Awesome. You can almost say, guys. Almost? Yes, I think so. Yes, the mic's fine. Okay. Awesome. Thank you. Okay, guys, let's finally get started. So, I'm going to go a bit fast because we've lost 11 minutes. Okay, good afternoon, everybody. My name's Varun Varma. I'm here to talk about PANOPTIES which is an open source network telemetry ecosystem. And before we get into the details, let's talk about the problem that we were trying to solve. It's a fairly straightforward problem. We want to collect, store, analyze and visualize network telemetry. Now, I took a quick poll of people who are specifically interested in network telemetry, but for audiences at home or other people who are not familiar with the network telemetry space, here's the 10 second primer. So, network telemetry is essentially a collection of metrics and state from network devices. The dominant protocol is something called SNMP, Simple Network Management Protocol, V2, which is one of the most common versions deployed and operational in production was defined in 1993, so it's coming up on 26 years and it's an encrypted transmission of UDP. It's old. But that's what it is. However, the industry is actually moving towards network devices exposing API, is exposing being allowing you to actually run agents on the network device itself. And one of the newest things that's coming up in the industry is actually streaming telemetry, where the devices have the ability to send telemetry to a collection system. Now, a lot of you might be wondering that how is this problem different? Network telemetry, you know, network devices have been around for a long time, you know, SNMP has been around for 26 years, so what makes this problem different? And to address that, I'm going to talk about something, I'm going to take a slightly different approach. I'm going to talk about some of the criteria that we had while choosing a network telemetry system and basically why based on those criteria we could not actually use any of many great existing products which are already out there. So in our specific network we actually have a very high rate of change. What that actually means is that we're actually adding and removing devices every few minutes. And what that means for us operationally is that we cannot choose a system which requires any sort of manual configuration. We cannot go with a system which requires you to write a config file and even if there's some automation, some sort of Ansible or Chef thing which actually can rewrite the file, but the system needs a restart or needs a push every time a device comes in and out of production. The other thing is there are some primitives which tend to be unique to network telemetry. A couple of examples are in network telemetry you very often come across the fact that you actually are collecting counters from network devices. Now counters tend to be these monotonically increasing numbers. One example is the number of bytes transferred over an interface. So network devices actually reported as an increasing counter but for it to be intelligible and meaningful what you actually have to do is convert it to a rate. You actually want to know the bytes per second which are being transmitted over an interface. Not an increasing number of bytes. Also there is this idea of enrichments. I'm going to talk about enrichments in quite a bit of detail later. But these are a couple of examples of problems which are I would not say specific to network telemetry but more prevalent in network telemetry. The other thing is the other couple of things are actually fairly generic. We wanted complete coupling of collection processing and storage. Some of the systems we looked out there in an attempt to be an easy to use out of the box system actually couple all of these together and that doesn't work for us because we need to be able to scale each of these layers independently or do something interesting with the data at each of these layers. And finally and this is going to sound a bit religious but we wanted this to be in Python. And there was a very specific reason for this. Python is essentially emerging as the lingua of the network operations world. A lot of people who come from a network operations background are most comfortable beginning their first scripting journey through Python. And we always wanted people to be able to contribute to this platform. So we actually wanted something that has an easier or lower barrier of entry than a strongly typed language. In addition to all of this there were some other additional complexities that we had. So in the telemetry world thinking is that being able to push metrics scales much more than having to pull metrics. And that's actually true. You can actually build much more scalable systems if you distribute all your agents and then these agents are essentially pushing their metrics to a high performance async IO or event based collection system. But the problem is in the network industry or the network telemetry world we are limited by the fact that we have to pull. We actually have to go out and collect data from these devices because a lot of devices we have simply cannot run agents. So do not support streaming telemetry out. The other aspect to this is that actually pulling these devices is very expensive on network devices. So what tends to happen with network devices is they have these dedicated chips which are very good at forwarding packets very fast. But they have fairly small control plane processors and telemetry typically tends to go through the control plane processors. The industry is trying to change that and actually trying to make it so that the telemetry comes from the data processors not the control processors but frankly as of right now that is the fact. Now when you try and pull these fairly small processors for network telemetry it's actually really expensive on the device. So you have to actually really optimize this and by expensive I mean to the extent of it actually being able to cause it can potentially cause an outage on the device itself. So you have to be able to figure out a way to optimize that as well. On our network we are a very very large network so we and the network's been around for a long time. We keep on refreshing our gear periodically but the reality is we have a lot of diversity in terms of vendors, in terms of platforms, in terms of OSs. That's another problem that we had to solve for. And then finally just scale. Scale in terms of the number of devices, the number of interfaces that we have, the number of regions that we actually have to cover. So when we looked at existing solutions both open source and commercial solutions and we evaluated all these criteria for us each one of them fell short on one or more of these criteria and we actually decided to build our own. And that's where PANAPTIS comes in. So what PANAPTIS is is it's a green field development that we did, it's python based. We built it at Yahoo which is now part of Verizon Media and it essentially provides real time telemetry collection and analytics and it does it through implementing different mechanisms of discovery, enrichment, polling, has a distribution bus and has consumers. I'm going to go into the detail of each one of these during this talk. So with that let's talk about what the architecture of PANAPTIS looks like. But before we get into the detail architecture let's talk about the system requirements that we have. So based on all the problems that I mentioned earlier here were our system requirements. We wanted a system that was able to collect telemetry using multiple mechanisms. So SNMP of course being a key player but we wanted something that is capable of collecting through APIs or through CLI or through streaming whatever is most appropriate for the class of device that we're actually trying to collect telemetry for. This was very important horizontal scalability. Now this surprisingly was a litmus test on which a lot of tools or existing systems actually failed for us. So the idea was very simple. We wanted to be able to put in more PANAPTIS boxes and the more boxes we put in the more capacity we should get. It's as simple as that. But behind that very simple idea what we needed to ensure was that we actually have load balancing and free lower within our system itself within PANAPTIS that is. And then that we don't have any single point of failure or we don't have any golden host or special host which two functions that if they go down then nobody else can you know unless they come back either through manual intervention or automation the system stops. These were two very specific design criteria that we had. The other thing was that I mentioned this earlier that a lot of tools what they do is to simplify this problem they integrate everything. They integrate the collection, the analytics, the visualization in one solution and that's good because it does make deployment and adoption easier. But the thing with us was we did not know how all we would want to consume the data that we collect to begin with. We had a couple of use cases for sure but what we did know is that as time goes on we would want to consume, we would want to use the same data in different ways. That was again a criteria day one that we did not want to have a fixed set of inputs and outputs. And then finally surviving network partitions you know it's a network telemetry system so there is a lot of value of sending telemetry to a centralized system because that's where you can do the richest amount of reporting and analytics when you have a global view of your network. But the reality is there are network partitions. There are times when a data center is not reachable and it just cannot talk to the centralized telemetry collection system and our system had to be designed to actually operate even in that mode. So with those design goals, this is the architecture that we came up with. At the foundation you will see a bunch of essentially open source, well known open source technologies. Celery is actually a very popular task management system, distributed task management system for Python. We have readers which we use for various purposes. We use it for caching, we use it as a general key value to store, we use it to actually even store a task key. Zookeeper which we actually use for a couple of things, we use it for distributed locking, we use it for leader election and then Kafka as our data distribution pass. So we began with these fundamental infrastructure components and on top of that we built a plugin framework and because you know remember our aim is to be able to collect data in different ways. And there are three types of plugins we have or three plugin classes that we have, discovery plugins, enrichment and pulling plugins and I'm going to talk about these in detail. And then on top of that we have these device specific plugins which you know talking to for example a Cisco device is different from talking to a Juniper device even with Cisco different platforms or different OS versions require us to talk to the device differently. And then what you would see on the site is some agent systems that we depend on which are not part of Panoptes. There is a time series database we actually integrate with two different time series databases at this time. There is a CMDB the configuration management database which for us is the source of truth of what to actually monitor. And then configuration management to set all of this up. Now in the Panoptes world you bring your own for these three for these three infrastructure services that if you already have a company wide time series database you could feed that. If you have a CMDB if you have a configuration management system you bring your own essentially. Now here are some of the actually detailed framework requirements we have had when you build this platform what all did this platform need to be able to handle configuration passing unless you can handle configuration you can't get the system up and running, logging, plugin management work use, message pass, distributed locking and you know a few things like this. We then actually chose a set of technologies to address each of these problems. So Python was our language of choice for configuration parsing we use something called config obj which is a popular Python model which does schema validation for configuration. So you write your configuration file you write a schema which says what should that configure what does a valid configuration file look like and you don't have to write code to actually ensure that the config that you're getting is actually valid. For logging we just dependent on the built in logging facility that provides which is fairly industry standard log4j style configuration. For plugin management we use something called yapsee which is yet another Python plugin framework. It's simple but you know it work for us very well. Work queue management salary and I'll have a bit more to say about salary later. For a message pass we use Kafka with the Kafka Python module to actually talk to Kafka. For distributed locking and leader election we use zookeeper and kazoo. Kazoo is a Python library what a question is if you're trying to do this it's actually sometimes tempting to use zookeeper and do locking on your own. Don't do that use a well understood library because there are a lot of corner cases in leader election and distributed locking which you don't want to handle. There are really smart people who have actually taken care of this. For persistence we have two time series we integrate with OpenDSDB and in fluxDB and we also store point in time data in MySQL. We don't store time series data in MySQL we store the last known value of a data point in MySQL and I'll talk about the use case for that later. For caching we use readers with readerspy which is a Python module to interact with readers and for federation we use MySQL and Django and I will talk about federation in detail. Here are some of the core concepts that make up panopties. First and foremost is this idea of plugins. Plugins are essentially Python classes so there's basically a well defined interface you inherit from a class and you're expected to produce a data structure at the end of its run and it can do anything it wants so it's really extensible, there's no limits to what a plugin can do and their function typically is to collect and process data and again there are all these data sources that plugins can actually get data from and there's three types of plugins discovery enrichment and metrics. The other core concept in panopties is that of a resource. Now what a resource is is it's essentially an abstract representation of something that has to be monitored it could for example be an API endpoint in the specific context of network telemetry resources tend to be network devices like literally the name of the network device that has to be monitored resources are discovered and they're discovered using discovery plugins so we actually discover our resources by querying our configuration management database which contains a list of all the network devices that should be on our network however with some security caveats we can actually also discover resources by doing topology walks of the network. There's some security aspects to it, you don't want to notice a neighbor of a network device and you don't want to randomly start polling it information to a rogue device if that's the intention of somebody. Resources have metadata associated with them. This actually is again a fairly powerful idea that so for example one of the metadata that we have consistently across all resources is the site the physical site the resources located in and that simple piece of information lets us generate a lot of reports like for example show me all the traffic in North America things like that and they have an ID which is unique and a few details like that. Now because resources are core concept and we have a wide variety and diversity of resources we actually also implemented a DSL to select and filter resources this DSL is essentially a subset of SQL and as you can see this example shows we're selecting all Cisco's which in our network are configured as switches and are running the specific OS version and the reason we have to do this is because like I said plugins have to be returned to you know talk to devices differently if the OS version is different or the platform is different Next up is the idea of metrics. Now metrics are something that the metric is essentially a number that can be measured and plotted that's as simple as that Now metrics are typically fast changing which means between two polling cycles you would ordinarily expect the number to change it might not change and that's valid also for example you know error counters right that's a number you do not want changing between two poll cycles but the point about metrics is that they have the potential to change fairly quickly and again you could query metrics through any of these different mechanisms. We have this idea of enrichments and this is actually interesting so what enrichments are they are metadata which are actually in addition to metrics. In the example I spoke about for example for resources you know one of the metadata pieces we add is the site. So now when I collect a metric for example the interface bytes and bytes out from an interface right I actually annotated with enrichments of which site is the device that that interface belongs to and therefore it lets me do a query an aggregation query across for example show me the traffic across all the interfaces in this site or in this geolocation and enrichments can be of any data type. Metrics remember are just numbers enrichments could be strings, dictionaries, lists or any other even more complex data type if you'd like. The other interesting thing about enrichments is enrichments don't necessarily come through the device itself that we're polling. So in the example of geolocation when you pull the device the device might not know which georegion is it in or another example which is fairly network specific is something called ASNs these are autonomous system numbers when you pull a device you actually get a number. Now there is a number to a human readable name mapping but that's not stored on the device that comes from a different source. Enrichments can go beyond actually just data collector from the device and actually stitch data from different sources because enrichments can be of different types they tend to be more expensive to process than metrics. So even something like a string a simple enrichment which is a string will typically have much more overhead in processing than a number. Also they will typically have more complex transformations numbers typically you would pick off the device and report when an enrichment you would do some sort of processing with. So therefore what we do is we actually collect enrichments at a rate less than we collect metrics. One example of this is when we're collecting bytes and bytes out from an interface we collect that those metrics every once every 60 seconds but the name of the interface is we collect once every 30 minutes and the reason is the name of the interface is not a fast changing quantity. It is typically you know once you name an interface it does change but it doesn't change once every minute. And we cache the enrichments. So essentially what we do is we have this enrichment cache we pull our metrics frequently and what we're doing is we're stitching it together and producing a complete time series. And overall this lets us to be far more efficient and scale much more direction. A word about data distribution encoding and distribution. Phenopters was designed day one as a distributed system. All the three major systems we have discovering enrichment and polling are decoupled and that again lets us scale each one of them individually and lets us introspect each one of them individually. We use Kafka and Redis or Redis one of those two sometimes both to actually pass data between these different systems I mean think of those as our IPC mechanism and again like I said this helps us actually troubleshoot and pinpoint problems in the workflow because we can actually introspect data at every stage. Now we use JSON to actually encode all the data that is passed between these different systems. It's extremely inefficient but it's developer friendly and I actually have quite a bit to say about performance in the upcoming slide. With all of that context here is what essentially the workflow looks like. We collect data. We might or might not do some post processing on it most of the times we do. We place it on the message pass and then once it's on the message pass we actually have two different flows. We have one flow which sends the data to our time series database and then we have multiple use cases hanging off that. We have we generate graphs, we do our alerting. In our specific case we actually also send that data to a Hadoop grid and we use then that for analytics and reporting non real time essentially. The other workflow is that we also store data in RDBMS. We make purpose specific APIs and I'll give examples of that later and we actually hang off UIs, custom UIs and CLIs on top of API. So what about scaling and operations? This slide shows our numbers in terms of orders of magnitude. We can't share actual numbers but in terms of what we are in terms of what the system currently in production handles is we handle tens of millions of time series, hundreds of thousands of network interfaces from tens of thousands of different network devices across hundreds of different sites and we do this once every 60 seconds. Some of the scaling issues that we faced in our system basically we knew that we will have scaling issues. That was not a surprise not a hey we didn't realize this. So we did build these to be horizontally scalable from day one. In my opinion, performance or high availability are choices you should make as early in the design cycle as possible because these do tend to be difficult to add on later. Python was developer friendly and very very slow. So one of the techniques we used was we actually delegated high throughput actions to C extension modules. One example is the actual act of talking SNMP to a network device is done through a C library not through a pure Python implementation. JSON is just as slow but again it is developer and operator friendly. Anybody can hook on to a Kafka console and actually see the data flowing through and actually troubleshoot what's going on. At our scale we kind of broke everything. We put everything in production and each of these systems would break. The readers by default only lost 10,000 connections and we've hit that limit many times over. That's just a reality. Now a word about federation. One of the key system requirements that I said was surviving network partitions. So while we feed our data to a centralized network telemetry system a centralized time series database also need the ability to be able to have collo local data in case of a network partition. In case one data center cannot reach the other. We still need to be able to give engineers at least a minimum amount of data that they can troubleshoot any production issues with. So the way we do this is that in each of these data centers that we have we actually have a full MySQL cluster and we store the last known metric, a subset of them, the interesting ones which we know are needed for troubleshooting. We store it in each data center and what we replicate is not the data across data centers we replicate the metadata. So what that means is you could query the API servers in data center one for data in data center say five. Data center one will tell you that hey I don't have this data but I know who does. So and will actually give me back a sort of a referral to where I can actually query information for data center five. The client can follow this query, this referral and complete the query. Now what happens is in the normal course of operations to the end user this seems like one big gigantic data source because you know these following and the follows and redirects are transferred into the user. However in the case of a network partition you will get the referral but you will not be able to follow it because you might not be able to reach DC five from DC one. And that's where our engineers actually had the ability to go into DC five through an out of band network and actually query the last known data point. So yeah this is essentially how we tackle actually both surviving network partitions as well as the fact that replicating full sets of data across all data centers is a very expensive operation. This is the list of devices and platforms that we cover. I think the list is complete but I don't know there might be more. So we have interface metrics for these platforms we also have generic interface metrics so if you point the interface polling plugin to an unknown device or not an unknown device but a device not in this list it will still give you some reasonable amount of data but if it's one of these devices it will actually annotate it will enrich the data more. We have system metrics for these platforms we have functional metrics. Overall I want to say we have about 30 odd plugins already. Here are some of the operational experiences we had a single biggest issue in building this scaling and resilience was one thing but was the fact that the devices and platforms that we collect data for aren't consistent. For example one device will report the memory used while another device will report the memory free and we did not want to give the user that information we wanted to normalize it. So the biggest thing for us was actually working across all these platforms sometimes on the same platform a different OS version would actually report metrics differently and make sure that we normalize them so that the user gets a consistent view. SNMP it's got a lot of faults performance, security, a lot of issues like this complexity but the fact is it's ubiquitous. Also here is the reality we are a multi-generational network. We have devices from various generations in our network multi vendor SNMP tends to be the lowest common denominator. That's something we just have to work with. We also support using APIs from some devices and in our operational experience performance of APIs is far better than trying to collect data from SNMP. And then finally we took this call on using Kafka as for data distribution and turned out with the right call because we already have three different consumers which take the same data but do something slightly different with it and in the future I expect this only to grow. One other interesting thing is though we place all the data that we collect onto a Kafka bus we don't actually expose the raw data. So there are other teams you know friends and family teams, teams in the infrastructure group within our company which are interested in the data we produce. And it is tempting to actually sometimes say okay here just hook up your own consumer to this Kafka bus and you'll get all this data. The problem with that tends to be is that then we can never change the internal mechanisms we can't change the serialization format because now there's an external team that's dependent on us. We won't even be able to change the Kafka client or the version of Kafka we run because here is some teams running a client which is older than what we need to upgrade to and we're stuck. So the approach we took instead was essentially exposing this data through APIs. We don't expose the raw data but we wrap an API around it as a level of indirection. And what that allows us to do is it allows us to change the internal machinations as long as we keep our API consistent. So it decouples that problem. We also push all our data though we have federation. Like I said centralized storage is very valuable because it does let you have a global view of what's going on and it is also very important for reporting and analytics. So we actually store all our data in a company-wide time series database which is built on OpenTSTP. It's a system that almost everybody in our company is familiar with. We don't have to spin up our own time series clusters for that or more hardware essentially to store that. So it gives us economies of scale. But in addition to this, the central UI and the central telemetry system we've also built purpose specific UIs. I'm going to give a couple of examples of that later and those were immensely valuable. And this goes back to one of the points that I made earlier that we did not know how we would want to use, what different use cases we would have on the data but we knew that we would have different use cases. So this becomes another example. Now let's talk about performance. Now I apologize, some of this is going to be Python specific but that's what it is. Now anybody who's worked in system administration or infrastructure or distributed systems knows that throughput is essentially a measurement of speed and the amount of parallelism. Now as much as nobody would like to be on the lane on the left, the reality is that lane actually has more throughput than the lane on the right. And that's the generally accepted convention. However, I would make this case that I think we need to start looking at throughput differently. I think we need to be looking at throughput as a measurement of speed, parallelism and developer productivity. The reality is if you can make developers more productive, you can iterate faster and you can possibly solve any performance of throughput problems that you have through faster iterations. And that was actually a very big motivation for us using Python. And this is one of my favorite quotes. There's a link and this article actually summarizes a lot of my thinking in terms of why Python, why not for example Go which is much more performant. So it says optimize for your most expensive resource and everybody wants to take a guess of what that currently is. People, exactly. It's developer resources and that's what Python is really good at. I mean, I think we're beyond the point where CPU and memory are expensive resources. You have to be careful. I'm not saying throw caution to the wind, especially if you're building a large system, you can't just say okay, we'll solve all problems by throwing more CPU or memory at the problem. But you have to have a balance between over-optimizing that and compromising on developer productivity. Let's talk about what we actually did to scale in both these directions. Vertically as well as horizontally. And for vertically, what I mean is what we did to actually speed up the system. Profiling. It sounds obvious but the thing was when we did profiling, this was a large complicated system and there were hunches we had and there were big pain points and the obvious thing was collection will be a bottleneck because it's pulling base we have to spend on multiple processes, we have to go talk to devices which might not respond in the right amount of time and it might sound tempting to actually start looking at that problem when we profiled. JSON schema validation was our slowest operation. Just solving that actually sped up our system a lot. For Python, I would suggest begin with the basics. Actually look at the official Python documentation on performance. Just basic things like list comprehension and built-ins actually are going to speed up your code quite a bit. I wanted to mention two specific tools. Again this is Python specific so my apologies but I did want to mention two specific tools that we found very useful in finding performance bottlenecks. Python is horrible both on CPU and memory so we have one tool called cprofile which is actually a built-in into Python since 2.5 which lets you profile what's going on in terms of function calls what's getting called the most and you know how much time is each of those operations taking. And something called obj graph which actually lets you, the other thing Python is pretty bad at is memory management so it lets you actually figure out what's going on in terms of memory consumption with Python. Quick example this is right off the official documentation. You just import one module and you can profile your code and it can tell you the number of times of functions being called the total time, the per call time, the cumulative time and it's actually it is an acting library called pstats again part of the official distribution which lets you do very interesting simple but very interesting slicing and dicing so you can actually see you know sure this function is getting called less frequently but it takes a lot of time when it's called. Very simple slicing and dicing reveals very interesting performance bottlenecks. Next is obj graph. This again I think is from the official documentation. Obj graph lets you literally print out pictures of your class and object dependencies. So in Python one very common reason for having high memory utilization is not releasing object references and you don't realize that the underlying object reference has not been released. You think you've released the reference for the object that referred to the underlying object but that's never going to get garbage collected so this actually lets you visualize what's going on. The other thing and I mentioned this earlier was C extension modules. So the raw speed something like C gives is fairly unbeatable and this is what example I wanted to give if everybody looks at the very first line where there were you know operations being done with nine digits. This is a comparison between C decimal and decimal. Decimal is a pure Python module to work with high precision decimals and since we are doing numbers that's something we did need to do and C decimal is a C implementation of the same. Actually as a Python 3.3 C decimal is the default decimal handling module in Python but just as an example look at the performance difference of manipulating just nine digits of precision in the built-in decimal module which is the third column and the C decimal column. That is the kind of performance difference that you can see using C extension modules versus pure Python modules and we saw this was one of the issues we had. This is something very very Python specific so properties are a way in Python to write very clean readable code but properties actually sometimes all you're doing in a property is reading the state of an internal variable and just sort of returning that and that's easy. However if your property is required you to go out and talk to something external or do some computation and processing consider using some of the cached properties. It's a simple decorator that you can put on top of any property method and what it does and by the way only use it if your properties are not expected to change on every invocation. It basically runs your property method the first time, caches the result so the next time you call it I will return the cached result. None of the rest of your code changes so that's something interesting. Let's talk about scaling horizontally. So we used Celery. Celery is a very popular workflow management system for Python. It lets you scale across processes, CPUs and hosts. How many of you have heard of the GIL problem with Python? So GIL is something called the global interpreter lock. It is the bane of Python performance. It's the reason Python performance is really bad, multi-threaded Python performance is really bad on CPU bound applications because essentially literally there's one thread running at a time even though you have multi-threaded application. What Celery lets you do is distribute your tasks across processes and even across hosts. Actually this is a very interesting read. Somebody, I'm forgetting the author's name but he's making the case that Python doesn't even have to fix the GIL problem. Celery's done that for Python. So if you're looking at something where you want to build a scalable Python system consider Celery. Just a couple of observations. So you could build a very fast and high throughput system but you also have to choose dependent systems to be fast and horizontally scalable. I would definitely encourage people to think in terms of scalability, always think horizontal scalability. Don't think if I can add more CPU and memory, my system can do more. Think that if I can add more hosts, can my system do more? So choose dependent systems which can also horizontally scale with you. The other thing is when you're comparing to systems, compare with a full feature set. Now sometimes it's not possible to know all of this beforehand. You only know the features you want later in your development cycle. But if you have a sense of what you would need try and do an Apple's to Apple's comparison. One specific example I want to give you is SQL versus no SQL. So we use readers and MySQL both. By the time we considered implementing joints AAA, clustering, TLS and indexing in readers we're like nah, let's just do MySQL. It's actually going to be faster. So sometimes what could seem fast is fast because it's not doing a lot. That's just something to think about. Some of the areas we have not explored for performance is Python which actually compiles Python code down to C code, async IO and using even more C extension modules. For example, the module we use to connect to Kafka is actually a pure Python module. And even if that ever becomes a bottleneck, we will try out modules which actually use C libraries to connect to Kafka. Where we want to go next is we want to actually incorporate streaming telemetry into panoply. That's what the industry is moving towards. It has various benefits. Less load on the control plane actually more accurate timestamps on the telemetry. There's a whole bunch of benefits. Basically our idea is that given the resource cache and the enrichment cache we have, we add a streaming telemetry collector which can collect data from devices rather than pull them. So couple of pretty pictures. So I mentioned APIs. We actually have two types of APIs in the system. We have these real time purpose specific APIs. I know this can be really hard to read, but we have these real time purpose specific APIs which this is an example of an API which exposes data for load balancers. And on the right we have these bulk slash historical APIs. This is an example of an open TSTP API. They're pros and cons to both. The real time APIs are really performant and they tend to be easier to use because they're purpose specific, the person or the developer querying them actually knows exactly the question they want to ask. But we build them only to return the last known data point. Versus bulk and historical APIs which actually can fetch historical data, they can even do aggregations and they can work across sort of any type of data source. But the downside is they tend to be slow and you kind of have to learn a new language, a new query language or format to actually use them. Here is a couple of systems that we built along with this with the data that we collect. The left is actually a mashup UI. This actually gets data not just from PANOPTIS but eight other different data sources to give a device dashboard. And on the right is a very specific UI we have which helps troubleshoot one specific network element. And guys, since I mentioned that we have open source, you can try PANOPTIS if you're interested. You can go to this link and actually get PANOPTIS. And here's what you will get. You'll get a Docker container and the moment you deploy a Docker container on the host that you have and you have SNMP agent running on it you'll actually get interface monitoring for that host immediately out of the box. We've only open sourced a couple of plugins. There are a whole lot more that we plan to open source but it's still a very functional and very valuable system right now. We use Influx TB as the time series database in the open source version and we use Grafana as our dashboarding system. And this is what the dashboard for the open source version looks like. So if you download Docker instance and actually deploy it, you'll get this dashboard out of the box. Finally, you know, love your feedback and contribution so go ahead and try it, help us find and fix bugs and here's another interesting thought I would leave you with. There's nothing in PANOPTIS that actually makes it network telemetry specific. There are enhancements in PANOPTIS which helps do network telemetry at scale but there's no reason you can't write plugins to monitor your favorite thing. That's another thing to think about. A quick shout out to the team, you know, for development operations we had Ian Holmes, Wade, James, Malcolm, Sean and Jess and we had our executive sponsors in Ian Flaynter and Ego Kishansky and you know, the important thing is it takes a lot to build and open source something like this, a commitment behind from an executive team is actually a very important thing to make this happen. Thank you guys for sitting through the AV problems and I will take any questions that you have and I finished on time. The most painful part is the diversity on the metrics. Whenever we try and go and pull a new device or type of device we think it should be simple and we're not even trying to do anything fancy we're trying to do simple metrics like CPU, memory, you know, temperature and things like that and it's always a surprise. Either the device does it differently, some are just plain buggy and documentation is bad so we struggle a lot thinking that we're doing something wrong and then finally conclude, no, we're right, the documentation is just wrong. It's anything we've not built, you know, because what we've built is perfect. Any other questions? I know I rushed through a lot of this because we lost 10 minutes so sure, federation? Yeah, let me bring up that slide. So for federation, the first point I want to make is federation isn't the only mechanism we have. We do send the data to a centralized system, right? But and that's where, that's actually, I would say, the most common workflow we have. People tend to look at the centralized dashboards most commonly. However, like I said, since we have to survive a network partition, what we do is store all this data locally as well in addition to sending it to a centralized system and but the problem with that approach tends to be that how do you know how to get to data, a piece of data X. So what we didn't want is we didn't want to, like, we could have built another discovery service where you query that service and say I want to get data for device X, who should I talk to? What we chose to do instead was we realized that while the data is voluminous, the metadata is not. So if DC1 or the other servers in data center 1 know all about all the devices in data center 5, then anybody can hit data center 1 and say I want metrics for data center 5 or device X. And what that guy knows is, okay, I have an index of where this data is. Here, I don't have this data, but here is the pointer. It literally returns a URL. And the client library then says, okay, I've got this URL, I'll go fetch the data from this link. Again, that's the normal flow. We also have a very extensive, something called an Outerband network. So what we have is in all these data centers is if we lose the main routers or the main connectivity there, we actually have backup links which are only used for management. They're not used for normal data traffic flow. And what network engineers can do is actually get into the data center. So they're not completely partitioned off. Over these, they're obviously much lower bandwidth links than our main links. But they can at least get to the data center through these Outerband links and then they can query the local data store because they know the devices that are in that data center. They can query the local data source and say, okay, I'm in data center five. I'm interested in device X which I know is in data center five. So I'm going to just go to data center five through this Outerband link and query the local data store there. So that way we get both scaling and in normal course, this appears like a big signal cluster to the end user. But when there is a partition we still are able to retrieve that specific data in that data center, data point in that data center. Now, we don't store all historical data within a data center. Though we have a desire to do so, but we currently don't do that. So the only thing we store is the last value of that data point. So it's more a troubleshooting tool rather than an analytics reporting tool. So an engineer can get in and say what is the current state of the device? Not what the historical state of the device is. Any other questions? Just remind you guys that you can get one of these. Yes, we have open source the interface enrichment and polling plugins. Discovery enrichment and polling for interfaces. Those plugins we've already open sourced. We have 30 odd plugins that we have which we basically have to clean up and open source. But we'd love you to write plugins. That's one of the best ways to contribute. And this is Python code, nothing special, no special source there. Thanks, guys. Test, test, test, test, test. So we'll give folks about one more minute. Alright, it's three o'clock. Let's go ahead and get started. Got a couple more people coming in. So welcome to latency SLOs done right. I'm Fred Moyer. I'm a developer evangelist at Serkonus which means I'm basically an engineer who talks to other engineers. It's my Twitter handle down there on the left. Fred Moyer with a P, H, and 7F. And we've got the hashtag for the conference down on the right. I want to give a shout out to my colleague, Heinrich Hartman, who's a data scientist. And he originally did a blog post on the material I'm about to present. So let's go ahead and get started. But first, does anyone here not know what an SLO is? A service level objective? Ah, good. Folks know it. So who here thinks that latency is important? Fair amount of people. Can I be expecting? Most folks run a web service of some sort. It's a fairly critical business metric. You know, Google's done all those studies. If your page doesn't load in three seconds, they actually will penalize you and people tend to leave if your site doesn't load fast. So now that we know that latency is important, you know, for any, for the last month, do folks here know how many of their requests to their service, any part of it, came in at under 500 milliseconds in the last month? Does anyone know that offhand? Folks know how to, I mean, if they had their tool, monitoring tools in front of them, couldn't they get that? It's a fair amount of people. How about 250 milliseconds? Same thing, kind of. I don't expect everyone to have an exact answer on hand, but this is kind of a specific question, but I think most people are used to having some understanding of how fast their service runs. But how would you answer that question for an arbitrary metric? Do your monitoring tools give you that capability to glean that information from your system? I mean, there's lots of monitoring tools out there that folks are familiar with. So you probably think, yeah, I can answer that question. I've got the tools to do that. But I'll ask you, do you think your answer is correct? And in terms of correctness, how accurate do you think it would be? What level of precision can you get with your tools? I mean, there's a wide range of tools out there, but often you have to take a look at the answers that they're giving you and ask, are these really right? So I'm Fred, and I like service level objectives, developer evangelist at Serkonus. I've been writing code for production web systems and also breaking them for about 20 years. And I like to, I've been a pro programmer for about 20 years. I duck, don't want to get things thrown at me, but I also, I've written a fair amount of C, writing a fair amount of go to. So that's kind of a little bit of background about me. I'll go over to the talk agenda really quickly. We'll do a service level objective refresher. And then I'll talk about a common mistake that people make when working with percentiles and using those to quantify service level objectives, how their service is performing. And then we'll take a look at three different approaches for computing service level objectives. First, with log data. Second, by counting requests. And finally, we'll look at how we can use histograms to calculate service level objectives. And if we've got some time left over, I can hop into the Jupyter Notebook and kind of show some of that firsthand. So service level objectives. Google came out with this site reliability engineering book several years ago. You know, it's a big hit. The concept of SLAs service level agreements have been around for, you know, at least a decade. But SLOs have kind of been floating more to the surface recently. There's been an increasing amount of online content growing around those. And there's a few different definitions of them. You know, Google has their own definitions of SLOs which I'll get into. But if you look at, you know, say Wikipedia you're gonna find a slightly different definition there and you'll see it some of the, you know, some of the conferences on the circuit. But, you know, most site reliability engineers right now probably have some familiarity with all these terms. SLI is a service level objective or service level indicator. We have SLOs and SLAs. Two additional books. Seeking SRE has come out recently. And the site reliability workbook is also another good resource on these. Seeking SRE is that wine coming through for you folks? Okay, I'm gonna switch to this other mic. Test. Okay, that's a little bit better. So, Seeking SRE there's a really good chapter by the guy who got me into this stuff. The Cona CEO, Theo Schlossnagel. So he's got some fairly in-depth definitions in there of, you know, ways that you can quantify SLOs using, you know, time quantum and a couple other different methods. And, you know, as I mentioned before, there's a few different ways that you can, you know, define your SLOs, but, you know, you're not Google. Anyone have from Google? No. So your needs are going to be different and really what you want to take away from this is how people are defining SLOs in relation to their business. You know, what are the terms that they can use and the conditions that help them keep their business up and running and, you know, keep from violating their SLOs. So there was a good video put out by Seth Vargo and Liz Fong-Jones. Their Twitter handles are up there on YouTube. SLOs, SLOs, SLOs, oh my. I may have gotten that over wrong a little bit. But they talk about these different terms and how they're defined. So one of the themes there is SLOs, drive SLOs, which inform SLOs. That's not always the case. You know, if some, if people are starting from SLOs, those can drive SLOs. But anyway, we'll run with the context that they presented in the video. So an SLO service level indicator is a measure of the service being quantified. You know, so for example, I could have a service level indicator that says my 99th percentile of latency of home page requests of the last five minutes should be less than 300 milliseconds. And, you know, if you've used a tool like, you know, Nagios or Xabix or any of the other monitoring tools out there, you probably seen graphs that graph 99 percent, 99th percentile latencies for different, you know, parts of your application. And we'll talk a little bit about you know, how, you know, the information that those can present to you is not always correct. Service level objectives. A service level objective is basically how we take a service level indicator and extend the scope of it to quantify how we expect our service to perform over a strategic time interval. And this is often, you know, a metric that is used by, you know, like an SRE manager to say like you know, I need to tell folks how we expect this service to perform. You know, so drawing on the SLI that we talked about in the previous slide, we could say that our SLO is that we want to meet the criteria set by that service level indicator for three nines over a trailing year window. So it's basically taking that SLI and saying we're going to meet that for this amount of time. Pretty simple. Service level agreements. You know, probably everyone's been familiar with those. You know, that's a legal agreement that is generally less restrictive than the SLO which the operations team is accustomed to delivering. And most often it's crafted by lawyers and it's a means to avoid risk as much as possible. You know, when drawing up contracts for delivering service, the lawyers will say we're going to give you this level of service except we're going to give ourselves as much leeway as possible because when SLAs are violated, which happens then bad things result. Customers notice and they try to get money back from you. You know, executives call meetings and folks get called onto the carpet about why the SLA couldn't be met. And I've been in a few of those meetings. And here's the kicker. If you do not have an SLO, your SLA by default is your SLO. So your internal reliability targets are now your external reliability targets, which is not a good thing. There's a reason we separate SLOs from SLAs. One, the SLA is a target we don't ever want to miss. We don't want to have to give people back money if our service goes down. The other the SLO is a realistic measure of what we can achieve but might not always accomplish. And when we talk about things like error budgets, we want to be able to take risk with deployments and we expect a little bit of down time so that we can move quickly. You know, I'm not talking about, you know, moving fast and breaking things. I'm talking about, you know, understanding what our levels of reliability are and using those metrics to determine how much risk we can take. Because we do have to ship features. We've got to ship the product. And if you're so risk averse that you can't even do that, your business is going to fail for other reasons. So that was kind of a quick run through on SLOs. I encourage folks here to go out and read those books I've listed. But remember that SLOs are tools for the business and should be tailored appropriately to those use cases. So let's take a look at a common mistake that folks run into when using monitoring tools to quantify those. Averaging percentiles. Who here has averaged percentiles? Okay. I've done that too. So it's probably the single most common mistake when working with latency metrics. Why is that? Part of this happens because averaging percentiles is actually a reasonable approach when your systems are functioning normally and all of your nodes are exhibiting the same workload. And because of that it's easy to get an idea of aggregate systems performance in those situations by just taking up, you know, the 99th percentile from each of your nodes and dividing by the number of nodes. And, you know, that gives you, you know, a pretty easy number to calculate and the data for most monitoring systems makes that very easy to do. So where this approach becomes problematic though is when your specific node workloads are asymmetric. You know, when machine A is not actually like machine B. If you're looking at an average of percentiles when that happens, you will not know it though because this approach hides those isometries. You know, I've got a kind of a mathematical description of it here. And what this basically says is the P95 of data set one union with data set two, which isn't the correct way to make a P95 for two systems does not equal the average of the two systems. So let's dig into that. So this is a graph of different percentiles. We've got a median down at the bottom 90th and 99th up there. And as you can see all the the axis is not to, did not resolve too well but let's say that top bar is about 200 milliseconds. And then at that spike there it goes up to, you know, let's call it 500 milliseconds. So what's the P989 across this entire time range? Well you might say, you know, that top line there is 200 milliseconds. And that spike is 500 milliseconds. So maybe, you know, maybe our P95 somewhere around 300 milliseconds. You know, we're not averaging but, you know, we can do kind of a weighted average just based on that. And that might seem like a reasonable thing to do. That's pretty reasonable. I've done that in the past. But what you don't see here is how many requests are occurring in each time slice. And what if I told you that 99% of the requests occurred in this section where your P99 is 500 milliseconds. And that makes a little bit of sense. Your P99 tends to go up, you know, when you get a lot of requests. So you know, kind of, you know, when you take this sort of percentile based graphing approach, you know, the details can be very misleading, you know, since you're not seeing all of the data here. So let's take a look at averaging percentiles over two different nodes. This is a histogram containing two request distributions for two different web servers along with their corresponding P95s. You know, so web server that has a quicker orientation here, number of samples is on the y-axis and latency is on the x-axis. The farther on the x-axis, the slower the response, the higher the P95. The farther up on the y-axis, the more samples. You know, so if we calculate, oh, so we've got red web server, it's got its P95 right about there and blue web server has it right about there. So the blue web server had more samples at lower latencies and the red web server had a flatter distribution that was generally higher. So, you know, just based on this, we can say like, hey, red web server is probably slower than the blue web server. So what happens if we try to calculate the P95 across both web servers by averaging them versus aggregating the samples and then calculating it. So let's say the P95 for blue web server is 220 milliseconds and for red web server, it's 650 milliseconds. So if we take all of the samples and draw and calculate the P95 from that, we'll end up with about 230 milliseconds. But if we average them, we'll come out with around 430 milliseconds. And that's about a 200 percent difference, which is pretty big. You know, if the sample distributions matched up evenly, you know, I could average those P95s and I could get the same answer as if I just aggregated the samples and then calculated the P95. And, you know, when everything's, when all your systems are behaving identically, that works and that's part of, you know, why people do it. But when your systems break like this, that's when you have a problem and this approach does not work. So here's kind of a visual representation of it. This is actual systems data. So, you know, we've got the actual P95 there slightly to the right of blue's P95 and the average P95 there. And that's kind of the magnitude of your error. And if you're trying to set a service level objective or base a service level agreement on data that's got errors that large, it's going to come back and bite you. And so as folks probably know, you know, most monitoring systems out there, you know, open source ones, you know, you can do, you know, you can average these pretty easily and you can get the wrong answer. So that's, you know, when we say like, why is this myth of like I can average percentiles propagated, it's because the tools out there have allowed folks to do it very easily. And there hasn't been a lot of, you know, training or knowledge provided on the math behind it. So we went through, you know, the common mistake, don't average percentiles. So let's take a look at how we can correct or compute SLOs the correct way. And the first way we'll look at is with log data. So, you know, if I've got say one way to collect log latency data without instrumenting your apps is to, you know, look and say your Nginx logs or your Apache logs and it'll tell you, hey, this request took x amount of seconds. So I can, that's the wrong slide, pardon me, that's better. So I can, you know, I can pull out the latency data from the log file and I can gather all that up and store it and I can take all of those raw samples and I can compute my SLO based on that. So how much does that cost? You know, so there's about 100 bytes per log line with this type of approach. So, you know, 10 million requests, that's about a gigabyte, on a huge amount of information. And if you're doing something like, you know, an API, you know, that's not that much data. But consider this might just be for one web server if you've got hundreds of them or if you're serving web pages and you're calculating latency or you're tracking latency for, you know, static asset loads also, now your problem just got a lot bigger. You know, because you could have, you know, dozens of assets per page. So, what you, you know, the way you can do this is you could take those logs, you know, you could throw them in HDFS or you could put them in Elasticsearch or Splunk. Folks here use Elasticsearch or Splunk. I've done that in the past, you know, or you could kind of do it the old fashioned way, you know, grab out, you know, your latencies and walk them out of the logs and then dump it all into one file and then query through that. So this is a fairly easy approach to implement and it's pretty straightforward. You just need a lot of servers. But once you have those latency metrics available you can calculate your service level indicator. And how you do that is you can take all these latencies, you can sort those samples and find the sample which is 5% of the way from the top. And that latency value is your 95th percentile. As I mentioned before, like Splunk and Elasticsearch have that capability built in. You know, I've actually done this with Splunk. It's not that difficult. And now we can take that SLI and apply it across your SLO time frame. And that's pretty simple too. You know, see if, you know, my 95th percentile SLI has to exceed 99.9% of the time. I can just, you know, take a year of samples chop that into 1,000 slices. For each slice I can run the calculation for my SLI. And if I met that for 999 of those slices I'm done. I met my SLO. So this is, you know, computationally this is a fairly easy approach to do. But, you know, there are some operational details that, you know, that may stymie you for a large scale. So pros of this approach, it's really easy to configure your logs to dump latency out. And you're probably already doing it. It's fairly easy to roll your own processing code. And you might, you know, already be running Elasticsearch or Splunk. And you can, you know, you've already got the tools to do that. And it gives you accurate results. Some cons though. It can get expensive. Who here has bought log analysis software, commercial stuff? Or put together open source clusters. You know, there are some name brands out there that I've mentioned in the past few slides, they're not cheap. You could take a sampling approach, but you know, kind of like I showed on that one graph where you had 90% of the samples in that one bump. You know, if you do sampling you might miss part of the critical data that you're looking for. This approach can be slow because you're shipping a lot of data around. And you can go up to scale. So, you know, the approach works, but there's some downsides of it. So let's take a look at, you know, the second approach which is by counting requests. So the approach to calculating an SLO by counting requests is fairly simple. First you pick your SLI threshold, you know, let's say 30 milliseconds. And then you instrument your application to count the number of total requests and the number of requests that violate your SLI threshold. So I would instrument my app and say, like, you know, count the total number of requests and then count the number of requests that were greater than 30 milliseconds. And based on that, I can, you know, plug those two numbers in and I've got my SLI. That is a very or my SLO. That is a very simple approach. And this approach is similar to the cumulative histograms used by Prometheus. They're fixed buckets. And they specify a number of predetermined bins and then count up the number of requests that are under those thresholds. So computationally, this is really easy to do. And here's a visualization of this approach. Total requests are in gray and requests that violate my SLOR in red. So, you know, a graph like this is very easy to generate. You know, once you've got that data you can plug it into pretty much any monitoring system out there, open source or commercial and generate something like this. It's easy. So let's say our SLO is 90% of requests have to be less than 30 milliseconds. You know, I've got 60,000 total requests thereabouts. I've got about 2,300 bad requests. So in this case, 96.2% of my requests, is 30 milliseconds. So we met our SLO. This is extremely easy to implement. You know, you just have to instrument your application. But let's take a look at the pros and cons of this approach. It's simple. It's really easy to implement. You can use any number of tools. It's very performant. Counting requests is extremely fast. It's very scalable. You know, I can keep those two countermetrics for 28 bytes of RAM. You know, a 64-bit int for total requests and one for unsuccessful requests. The results for this approach are accurate. It's very difficult to script the calculations. But there's some cons. The fixed threshold means that you have to reconfigure your app if you want to adjust your SLI. And that makes this approach highly inflexible. In addition to that, it means you can only analyze historical data for one threshold. You know, unless you've got a system that is tied very tightly to a specific threshold which you spend a lot of time ensuring that it's the right threshold for your business, chances are that your SLI threshold is not ideally placed. So while this might be an appealing approach on the surface, there's some hidden costs and inflexibilities associated with it. And I've actually implemented this approach with commercial monitoring systems. You know, so it's dirt simple to do. So let's move on and talk about histograms. Who here knows what a histogram is? So a histogram looks like a bar chart. It's also known as distributions and it's one of the seven basic tools of quality. It's basically like a graph that has the number of data samples on the y-axis and in our case the latency value on the x-axis. You know, it looks like a bar chart. Each bar has a high and low latency boundary and there's a count of samples in between those and we'll refer to those as bins but you'll also see those defined as buckets in some of the, you know, monitoring solutions out there. And we can use some numbers to characterize particular histograms. We've talked about percentiles, which we can also refer to quantiles using the Q notation. So there's the median. That's the point at which half the number of samples are below that value and half are above. And then we have the mean, which is the average. That's the total number of samples divided by the count which is different than the median. You know, if Bill Gates walks into the room, all of a sudden our average income goes up by a lot but our median income really doesn't change. Q.9 is the 90th percentile. Q is just a fancy way of saying, you know, percentile. Q of 1 is the maximum and we have this thing called a mode and that's just a value where there's a local maximum number of samples and we can have several modes in a histogram. Giltean down there has an excellent site explaining histograms, hdrhistogram.org. I encourage you to check that out. There's several types of histograms and I'm listing just a few of the common variations here. There's linear, approximate, fixed bin, cumulative and log linear. But these different types really just represent certain attributes of histograms that you can combine to create a histogram that fits your business needs. For example you can create a cumulative log linear histogram which has bins in the powers of tens but each subsequent bin would contain the sum of the bins with lower values. And for example the hdrhistogram I referenced in the previous slide is a log linear type histogram. I won't go into detail about all the types but we'll be looking at log linear histograms. I've got one presentation up on the slide share that I'll reference at the end which goes into excruciating detail on many different ways you can combine these to get different types of histograms. So this is the log linear histogram. We use it so conus. There's two GitHub links down here that have both the C implementation on top and the go line implementation. With this type of histogram we have bin sizes that increase by a factor of 10 every power of 10. So if you look at these bins here on the left they're very narrow. But if you look over past the million mark the bins get ten times as large. The x axis is in microseconds, y axis is number of samples. So as an example here there are 90 bins between 100,000 and 1 million each with a size of 10,000. This sample histogram shows latency distribution from a cluster of load balancers in microseconds. And I've overlaid the average median and 90th percentile values here. This is a non-normal type distribution versus something like a population graph for people which would be a bell curve or a normal distribution. And a bell curve the average and median lineup. So this histogram here represents about 50 million or so data samples. It's relatively cheap to store this format. And the nice thing about histograms is the number of data samples is pretty much invariant to the size needed to store it. Because the number of data samples is just a count that line just moves up. But what determines the memory footprint of something like this is the span of bins. And for operational telemetry that tends to be around around 300 bins. And we can store something like this. We can store each of these bins in probably less than 8 bytes of memory. We'll have a couple bytes for the count of each bin and then for the boundaries. So it's a very efficient data structure memory wise. This is another implementation of our log linear histogram. And this particular histogram shows CIS call latencies captured by EVPF for both CIS read and CIS write calls. So you'll notice each of these has a couple different modes corresponding to different code paths in the code. And you can clearly see where the bin size changes. You've got changes right about the 10 microsecond boundary. This particular histogram has about 15 million samples on it. So this is real operational data. A couple more notes on histograms. They have a property called mergeability. Which means that I can take a histogram of values for A and a histogram of values for B and I can combine those. And I can that's assuming that they have identical bin boundaries. I can just take each bin and I can sum up the counts. And I can do that not only across space, different implementations of bins but also across time. As long as those bin boundaries are the same, I can take data from yesterday and a histogram and merge that with data from today. And I can do this for 10,000 histograms fairly easily. So SLOs, how many requests are faster than say 330 milliseconds? How do we calculate that with a histogram? It's fairly simple. I walk each of the bins from the lowest value into the highest value until I reach the bin that has the value of 330 milliseconds. And I count up those bin totals along the way. And when I reach that bin boundary of 330 milliseconds, I've got that running count of how many requests that were faster than 330 milliseconds. And I'm done. It's very easy to calculate that. And so I've given, I gave a lightning version of this talk back at New Ops days at Splunk. And then put my slides up on Twitter. And Liz Fong Jones who was at Google at the time read them and brought up the good point like what happens at the value that you're interested in say 330 milliseconds falls between those bin boundaries. See it's in the middle of one of those bars. You don't only need to be interested in sample values that lie on bin boundaries. If I've chosen 330 and my boundaries are 300 and 400, I can interpolate across those boundaries and get an approximate answer. So errors in the operational data using the log linear histogram that I've shown have a maximum value of about 5% in the absolute worst case. And I'll talk about that in a couple slides. But that brings up the question of like what binning algorithm is used with the log linear structure I've shown. So in the implementation I've shown, you know, we have bin boundaries at 320, 330 and 340. And at the scale of 10, the bin boundaries occur at each integer. So I have 10, 11, 12, 13. And I also have, you know, point or 10 to the negative 6th 1 or 11 times 10 to the negative 6, 12 times the negative 6. The log linear histogram implementation produces a very wide range of values while simultaneously being able to achieve high precision across those ranges. Like I mentioned, we see about 300 bins total needed. So, you know, about the error, let's take an example. Say I've got a bin bounded at 10 and a bin bounded at 11. If I insert a value at 10.99, you know, that would be averaged to 10.5. So, you know, if I, you know, the value I put in was about 10.99, the value I get out is about 10.5. Difference there, about 0.5. You know, so that's about 5%. That's for one extreme sample with one bin. We never see that operationally in practice. So the bin boundaries provide a good base for calculating an SLO across there. So let's talk about pros of computing SLOs with histograms. They're very space efficient. In practice, we see about 300 bytes per histogram which is about 1 tenth the size of the log data approach. We can choose our SLI thresholds as needed to calculate SLOs. And we can also calculate things like interquartile ranges, get counts of samples below a certain threshold. And so I mentioned, you know, how many requests were below 330 milliseconds. We can do standard deviation, other statistical calculations. We can aggregate them across space and time. They're computationally efficient. I'll show the implementation here at the end in C. Typical values for bin insertions are on the order of nanoseconds and percentile calculations can be done in about a microsecond. The errors are bounded to about half a bin size. And there's several open source libraries out there. And we'll look at a couple here so you don't need to go out and write your own libraries. There are some downsides though. The math is more complex when compared with other methods I've shown, but it's still relatively simple compared to like, you know, t-digest approximation or other quantile approximations. You know, there's a little bit of loss of accuracy, but, you know, as I mentioned, that 5% is absolute worst case. In practice, you know, it's on the order of 7%. So let's take a look at how we can use open source libraries to calculate SLOs with that log linear histogram library I mentioned in Python. Any Python programmers here? Okay, a few. It's pretty easy stuff. So there's the LibCirc Lohist C library that has Python bindings. I've got the link up there. You know, if you install this C library on your system, then you can go ahead and use Python's pip installer to just do pip install Circ Lohist, and that will install the Python bindings. I had to use easy install to install pip, but, you know, such are programming languages. So here's some code. I can make a new histogram object called h, and then I can call insert on it to put a value in. I can insert a few more values, and then I can call h.count, and that will give me the number of values in the histogram. And then I can sum them up and that prints the sum. And then I can calculate an arbitrary quantile, in this case Q5, Q.5, which is the median, and that gives me my quantile value. So this is a pretty easy way to, you know, go out and get this open source library, you know, shove a bunch of values in there, and, you know, be able to calculate my SLI. So let's take a look at something a little more complex. So if I've got a set of latency values for a web service, I can create a histogram from those as I've shown, and I can generate a visual plot using code that looks like this. And I can also calculate my 9.5th percentile and draw a line on the graph. So what I do here is I bring in a pie plot, and I bring in my histogram library. I make a new histogram object. I add a bunch of values as shown here. You know, I can do that programmatically, read in a bunch of latency values from a file, and insert those, and I can call the plot method on my histogram object and add a plot line in there. And guess what? That's how I made this graph. I took a bunch of latency values from two web servers, and I ran this code and I got this graph. There's about 10,000 samples here. You can scale this to several million samples with commodity hardware using that library. And it looks like we'll probably have some time to do some of this in action. So let's do a quick review. Be careful of averaging percentiles. It's easy to do, but you can easily come up with results that get you incorrect results. The best approach for calculating SLOs is using counters or histogram. The approach using log data produces correct results, but it tends to be economically inefficient. Histograms give you the widest range of flexibility for choosing different latency thresholds, but I'm only aware of two open source implementations. The log linear one I mentioned, and also Guiltine's HDR histogram library in Java. There aren't any time series databases that support sparsely encoded or HDR histograms yet except for IRDV. You can store histogram data serialized on disk, but you might run into some scaling challenges. So I've got a link there to, well that did not format correctly, to some of my slides where I go into histograms in detail. Folks want to see a demo? Alright, let's see. So, Paula Goodmine, who originally developed this material. Let's see, there we go, that's what I'm looking for. Actually, I want... So he put together kind of the code that was just shown. Let me see if I can change the...this will let me take a different display value scale. Let's try this one. This projector does support it. Can you folks read that in the back? Kind of. Let me see if I can make it a little bit bigger. How's that? Any thumbs down? No? Okay. So my colleague, Heinrich Hartmann, Data Scientist, he created this Jupyter notebook that's available on his GitHub repository under Data Science for Effective Operations. I'll have a link to it in the slides. But basically, so folks here use the Jupyter notebook. Yeah, so it's an interactive way of evaluating code. So here, let's look at some of this Python code. So first I pull in the LogLiner histogram library and I click Shift and Return and that basically executes the statement. So then I'll make a new histogram object. I'll insert three different values and I can print the mean and the count. So you can see here it prints that. I can add another value and I can execute that and it bumps the mean way up. Let's see. Okay. And then I can throw this on a loop and I can get a count of each of the bins. And that's just by H is basically a dictionary here. So this histogram library provides a number of functions. Let's take a quick look. Bin count, total number of bins in the histogram. Clear, which clears it. And these functions count above and count below are particularly useful because I can pass in a value. Say if I'm using an SLI value of like 500 for 500 milliseconds, I can get the count of requests that were below 500 milliseconds. And I can get the count above and just using those two functions I can calculate an SLO very easily. It's a number of other functions in here. It's the quantile function. So I can dump out say 90th percentile. So my 90th percentile is 1060. Let's try a count below. So there's three values below 999. And there's one value above 999. So the quantile function is actually, that's not executed by Python itself, that calls out to the C library behind the scenes. Because I don't know if folks here were in the previous talk about panoptes at Yahoo. He said Python's slow. For computations like this it is slow, but because we have this C binding it's very easy to call out and just have C do the heavy lifting. So let's go down a little bit. So I've got a couple data source files in here. This one represents some time series data. I've got a time stamp here. And then I've got a dictionary here of value count. So these could be metrics that you're collecting for your application. And I can store them in just a text file like this. And then I can open up that file circle through each of those dictionaries and create an aggregation of them which we'll call API Latency. I can define this function here which basically says go plot this histogram and I can pull in pi plot and I can go ahead and actually plot that histogram like this. So this is the same code I used to generate both the red and blue web servers. I may have angered the demo gods. Let me reload these. There we go. So yeah it's pretty easy to use these open source tools to do this. Folks want to look at some we've got about 11 minutes. Do you have questions? Should we look at some code? Okay. Folks here write go. Couple. How about C? Couple. Okay. I'll go through the go version since see folks can usually understand go. But that's not always the case the way around. So this is the go implementation of Circonus, LoHist pulls in a number of libraries. And then I mentioned before I talked about bin boundaries. The bin boundaries are declared here and so we've got a range you know you can see up there on the top left. Or actually the powers of the different log scales are declared here. And then the bin boundaries are each of these differences divided by 90. So here we've got changes at 1, 10, 100. And we're able to take this range from 1 times 10 to the negative 128th to 1 times 10 to the 127th. And I may be off by one on one of those. But this gives us an extremely wide range. Here's the structure for each bin. So we've got an int 64 or unsigned int 64 for the bin count. We've got an int 8 for the value and an int 8 for the exponent. It's a pretty cheap bin structure. And then we can also apply some compression in there to get it down even further. Let's take a look at how we actually calculate a quantile. So we've got this function here value at quantile. And so we basically pass in we pass in a value on the x-axis. It could be a latency and it returns the number of samples. So that calls a proxy quantile. And this does the bulk of the work. So there's a little bit of locking in there. But the bulk of the work is essentially done here. This is the bin walking code. So basically, kind of like I mentioned previously, walks through each of the bins, finds that value, does interpolation if it needs to, and returns the count. Pretty simple stuff. This code is not complicated. We've got this implementation in JavaScript also. There's any JavaScript programmers. And this might look like a lot, but if you spend a few minutes reading it that's not too bad. Have folks worked with code like this in the past to calculate histograms? One. I saw one person. So yeah, it's pretty kind of looks bad here. Looks difficult up on screen. First time I read through it. I didn't grok at all but if you sit down with it, it's pretty straightforward. You're walking that x-axis until you hit the bin you want and then you can do interpolation. Take the high bin value, take the low bin value, and weight the value that you have in between the bins and you can approximate the count there. Any questions on that? So I think we're kind of out of time. Thanks. Go calculate some SLOs.