 We're in progress here, but I got to figure out how to get this thing to flip Yeah, you're gonna help me up Nothing. It's not even showing up. Yeah. Thank you It's kind of funny that this is Going on because a few years ago, maybe several years ago. I went to the scale conference And I think it was closer to the airport At that time and the complete thing broke down like two minutes into my talk and I had to ad lib Luckily, there was a lot of concepts there Yeah, thank you for your help. You're doing good as a mic check Yeah, we'll get this Although this is going straight into my mouth Yeah, maybe I'll just do this switch to manual Look at my deck and see if I don't have I got an image at slide three So Let's see what we can do. You're all here So this is sort of like my first slide Observability three ways not the only three ways but three of them This this is not a new talk But because I don't have a very good memory and I don't give it that often it's a different one each time But I started with this because there was a lot of confusion about you know, the more recently hyped tracing stuff and It seemed like there was there was some forgetting of all the other activities that you're doing and how you're tracking those and How things might be different and I ran across in a in a work group that I'm a part of a Pretty nice way to to look at this from like a triple Venn diagram of logging tracing and metrics and Because a lot of people don't even use metrics or they're starting Just starting to hear about tracing I thought it would be a good idea to to use that type of model to tell the story Most people have done some logging if nothing else their hello world app, right? so we definitely know how to put things into consoles and So, you know my main goal with this talk if we can see the slides or even if we can't is to to kind of relate these things to each other and Because there's a good chance this is not going to work for a few minutes I'll let you decide whether you whether we should proceed with with no slides Until there are Or if you just want to check your email for a while He wants to proceed anyway All right, let's do it. Okay, so what's a unifying theory about this? Well, one is is that if you take these three things logging tracing and metrics that At the end of the day Everything is based on events now. You could you could argue What's the ultimate event like what's the behavior of your application and what's the representation of it? That's kind of neat to do Especially if you have some time But if we want to think about these three things with regards to events we can talk about how they manifest them So in logging it's a little bit literal like it's it's event by event There's not always a any state carried between the first event and the next one Whereas something that's interesting about metrics is that they're Often enough just statistics over events. So you have similar events then you can you can tell about what rates or Min max or any other statistical information That that may be derived from that. So you're getting summary information Okay, I've been where that's still that far Yeah now tracing what's salient about that well or what's cardinal is that there's There's causality. So you have events, but you also have a way to to To follow a chain of calls that that is ideally and usually not not linked directly to time So in case you're working on a mobile phone whose clock is all with all kinds of weird You would still be able to tell what caused what? even of clocks and such a different so These these things are are important to recognize what we're looking at Our you know three things I've mentioned now the thing is I'm gonna have to memorize the next slide So one way to to walk through this is to take like one thing and then spin it around these three different categories of tools logging metrics and tracing and so since I'm pretty good at making Errors happen you can think about error is a nice one. So error usually dumps in a log somewhere Hopefully unless someone swallowed that error, right? Which which is a good thing to mention because if someone is swallowing I got a hand raised Okay, I don't even know what slide I'm on Let's try three if you're if you got the web page. I only messed up the deck a little bit So it's towards the end. So we may eventual consistency may work on this So on slide three, you'll see a triple van with metrics at the top and Tracing here and then logging here and then you know the idea is that there's something in common between them But things are different too. And so if we talk about an error Then that that could manifest as a stack trace or or otherwise or or a shorter version of that error in a log And that that's going to be neat because it can tell you that some crappy thing happened It could also tell you what time it happened because one thing that's in common between all these things is usually time time series So even if logs are our stateless. They still have time, which is helpful and that's carried between them to the degree your clocks are good and If we move that over to Tracing and look at the what what you can do with tracing with that same error Is that you could tell whether that error had impact or not? So if you knew calls of information like who called who then you could tell By looking at that whether that particular error was a transient one that failed And then was able to proceed later or whether it actually failed the upstream request Or was the upstream upstreaming request important at all? So for example, we have all sorts of errors going on in the system They don't matter unless they actually cause us impact now tracing won't tell you what impact is like this analysis And you might be the analyzer but the the idea is that this calls of information can look at the same data and add some color and between logging and Tracing one thing is in common between them, which is that they can get down to a request or more granular scope than request So for example, when you look at a log statement Because it could have many log log statements in the set inside of the same web request for example You're able to get down to that level. Whereas if you look at statistics and you look at it like say a Latency or how many errors per second are having on an end point You can get a good idea of the population of requests But you usually can't unroll that into the individual requests inside because the metrics While they're still focused and they can be focused on the amount of errors that you're getting Usually you can't unpack those into individual requests. And so they can tell they can tell a similar Similar information about an error context within a system Your metrics will usually be categorized by by things that are infrastructure nature or business in nature and They can all talk about errors I think But sometimes we focus maybe too much on like endpoints and things And forget that there's other things happening in the system That each of these tools can can look at too and they're not all necessarily requests. Okay. We're just gonna go So keep dancing So let's think about what can you do? What can you do with metrics that's not infrastructure nature? Well, I mean if you're a streaming Video producer then you're probably not really all that interested in your quest per second unless they affect you affect your Stream started per second or your or your daily active user Statistics which the same sort of metric systems can can you know tell both stories? But looking at a log statement, you're not necessarily going to know that right so Metrics has an interesting dimension to it where it can cross different scopes Because the the classifications Are inherent to two two metrics and tools around them? Logs are interesting one because While you can get lots of neat things like an error and its context into that You have another thing which is well other situations that are occurring on the same In the same space. So if you're in a if you're in a Java world, you could have Different scopes of events like garbage collections that are that are going on and and other things that that tell a story But are not you can't pin them down to like one request because it actually could could impact several at the same time like buffering things like that which which may be Harder to get into certain tools so so the idea is is that and in these tools there is not a golden hammer with with one of them and and So as you see the slides progress, we'll talk about What are some opportunities when we when we take you know, we try to figure out their strengths and also what we can afford to use I have an idea which is probably not going to be that good but No, it's not a good one. I was gonna see like how good my 15-inch But anyway, sorry So to to spin around this on slide four. I'm only on slide four, but I did have two Starbucks, so there's still hope that we can finish Good But let's use something instead of errors that can be represented in all these a response time So for example logs we can stuff response time into certainly people do Matrix are all about that But not limited And traces can tree out your response times to tell you like say for example, I've got one overall request How much was allocated to like the database hit versus your memcache and whatnot So this one's hard to do Because I have on slide five a Log statement, right and at log statements usually you if you're looking at logs and you're scrolling through if you were to do that Then you can figure out response time by just doing math between them between the lines, right? And you do a grep or something you see okay? This was it this millisecond this was it that you do some math and you hope everything's fine right it sometimes you're lucky that the format might actually have a duration field which could it give you a better Less math, but it's actually more important than that too because duration Should should not be computed with wall time anyway because if you have time correction Because NTP or otherwise it you know you end up with with what sounds nice But it's too good to be true and negative duration request And metrics are great at showing anything Including response time in a population of values, and I wish I had a paper because I could draw this but I'm not only sure I can charade a Yeah, really like but if you have like a heat graph like how do you I? Can't do that not yet, but The thing about metrics right we usually like look at histograms and and normal distributions and and it seems all nice And they do show things like response time, but one of the things about response time You'll find as there's lots of Good blogs and such that we'll talk about this which is that firstly they're not normal They usually skew towards a long tail Meaning that you know a block of your request happened in this sort of range But then it like trails out because I live in Malaysia. That's that means like you'll get some really crazy latency at some points the other thing is is that When you start looking at the population of values all of them Not just not just the ones from one box. You'll find that there's multiple peaks So it's not just not normal distribution, but there's multiple distributions for various reasons including performance of one new version versus another that happened to be co-deployed at the same time and wires between them but that Despite how complex that can be The good thing about metrics is that they allow you to get that context And so if you had a log line that said it's the 95 milliseconds, then you'd know It's kind of how that compares to the system that occurred at the time. So, I mean the one thing I like about Metrics is a is a tell us like what is our system the one of this five minutes? Which is going to be different from the one of the next five minutes How was it during that time? Was it was it actually notable outlier or otherwise and and metrics is probably the only good source of that so One of the things that I was talking with a colleague of mine John and by the way, I'm Adrian I think I missed my interest life but Was that okay? Well now we've got all this complex stuff and people are saying okay, well Latency isn't normal and there's multi modes and all this other and but we still have to alert right like So what do you do with this? So if if 95 milliseconds was bad? How would we know it was bad and and why if averages suck there? What do we do right and I think a lot of people struggle with that and His advice was that you alert on the max and you tune to the 99th percentile and What that means is if if you've got? If you can if you take things into what you can actually act upon You can often control certain things and you can't control other things so you can't control about the worst 1% There's all sorts of things that can happen. I can totally mess that up But maybe you can minimize pause times and things that can can make some of your worst requests take longer Maybe you can use It should be to or quick or it should be three now I guess it's called or all sorts of technologies to reduce certain types of latency but But at any rate You if you think about max you definitely can't control that but you should probably know about it and so now how do you how do you actually You know Act on that information. I think first off this was just one opinion And there's there's plenty of people at any conference like this that will have have opinions on it But get opinions I mean so the thing is is that one of the things I wanted to just highlight with that Is that and this would be slide Seven I guess Is that there are different opinions just like logging metrics and tracing are all different tools and they're used in different ways together You know take another opinion. It doesn't hurt It might not be the good one, but You'll be better off thinking about it The way that traces represent a response time Is is graphing but in a different way than metrics Because we're only looking at one request or sometimes a In aggregation of multiple similar requests like a network flow diagram or something and so we're looking at flows and The neat thing about that is we can tell Some calls of information We can tell where there's Because we're we're seeing things we have a good chance to To tease out what is actually abnormal About this particular application behavior versus maybe the last time you saw it or versus what the programmer told you they did Now Tools like traces they can draw these these water flow diagrams Which are basically you've got a top line and some lines going down and supposed to look like a waterfall I don't actually think it does but but They do represent what was recorded but they don't actually represent intent And i'll get to that later. Um, I don't think we still have gotten a good way of Recording intent But like if you think about it as someone says, oh, I just installed this async library But then you still see a stair stepping pattern Because the application is still doing it one by one after the other one finished Then there's actually a break in intent there that they can see when they look at it but To actually get that Behavior and and the deviation from the behavior Actually highlighted to you and annotated automatically. I think that's still work in progress in the industry But tools tools that retain causality will will probably be closer to that And uh, yeah, so To summarize this sort of wander is that We've got some tools that are that have Kind of easier to reason with and more sophisticated data models Logs probably can be considered the easiest one. It's an event. It's got timestamp Hopefully and uh, you know, there's there's a multitude of ways of admitting that people are very familiar with that um metrics You know have that ability to help identify trends. They don't identify the trends But analyzing it helps you identify Trends and then and then traces will will add that that causal link. That's uh, that's what Uh, they do Are we there? Yes, a lot of blinking Now that was a slide nine So if someone has a 19 inch laptop it might work Or if we can connect straight to that thing Okay, that has been tried in the past so, um How do you uh, I what let me do this? I think where we've got uh, half hour left I'm going to look through and and skip Slides that have code on them because they don't really work this way Um, and I'm going to go to to sort of uh impact. So I'm going to skip ahead to to slide uh 14 and just wander through this and was that Because I I I'm offset impact of timing code and um what What i'm trying to do here is talk about like how how sometimes we put the active recording Um information into the foreground in our apps. So um Oftentimes we you know, we have a you know Some it's a business code and then a log statement that someone has personally handcrafted with inside with the flourish that like this is This is a request and it happened here Oh my and they have messed up their Their pattern substitution. So it actually oh my didn't get into that log message um, or you have uh some tracing code where it was supposed to really Show the end of the end of the response happening But somebody forgot to put the async callback and it didn't get there or or the metrics that um, you accidentally put a constant number in and I'm actually making fun of the fact that we can make all sorts of mistakes in timing code Um, and that's one of the good reasons for not trying to to encourage people to do routine Development of timing code because there are so many types of mistakes that can happen Um, that's all right. We'll just keep doing it Um, but there are some things that are dramatically different about the way Of these types of apis work that are important and the main thing is is that log Um, you know, you can definitely mess it up Um, but it's it's a fairly understood area and because there's no state that's needed between log messages At least not automatically propagated. It's it's something that's a fairly easy Easy api I would I would actually uh claim that that timing api or at least Ones that record values are easier because it's if you think about a log Even if you have like the fancy keyword logging now Um, you can put almost anything in there and people do and that's how you get 20 megabytes attachment on your log, right? but um Metrics api oftentimes they they're like the they've got the Um, number shaped hole and you can only put a number in that So you can totally mess up the dimensions of that data But you're probably going to have a harder time messing up the actual recording process And so one reason I kind of like the metrics code is that the apis are very constrained in what they can do And so it's much harder to mess them up um And the tracing code um the thing that's nice about them is that they have state you can pull out of it later sometimes For example, you might have a request id that can be handy and that could be integrated with your logs and all sorts of neat things Um, but but they definitely are the hardest ones because they do have to pass that somehow and it's um that's a problem that's Between one side of the process and the other like getting it across threads or actors or anything like that and also between processes Whether it's squirreling them into headers or magic envelopes There's a lot more to do with that type of api. That's why sometimes people say tracing is high overhead It's it can be high overhead actually And that's it has a lot more chances to be high overhead And so it's good to recognize that even even if it's not guaranteed to be a high overhead There the nature of that type of recording system Um, it has more work to do And so my slide should you write timing code? Uh, we'll basically kind of Argue that try not to Um, unless you really like it Hi Yep, ask a question Yeah, so I was trying to use, uh um, the like the response time type of of fact and then Like three different apis to to record Like a response time and because I was missing the the actual image of of the code in a log statement You'd basically like here's now log Request started cat, you know remember that number go back Now versus that number log that Um, and and so sorry that this is a limitation to Yeah looks like what Good good question. So I'm not actually advocating head in the sand What what I am advocating the question was like well, what what are the ways or What I'm actually advocating is to not, uh Not proliferate end users doing timing code So what I'm saying is that you should still do timing code. Someone should do it But try not to by uh, by either using, uh, hooks and frameworks that that are that are gonna know Like significance of certain lifecycle events And less of the Rolling your own, um, that would have been my point Sure. Thank you for bringing it up um So because you not just uh, so there's there's the edge cases, but I did also want to mention that Um, there's important to the significance and sometimes the significance is actually in the frameworks and Uh, when I work with uh, some of the platform developers and the stuff I do They they kind of have a better idea of when the first bite hit the wire Then then somebody who's casually using the software would and uh, oftentimes if you're timing something You might actually be timing a cached event And then recording that as if it isn't So there's all sorts of not just edge cases on the programming api, but the context around it which which, um, Which basically means that yes, it has to be done But try to see if there's another way to get that information ideally from someone already doing it Um, and the slide 1816 sorry my eyes are blurry Is uh, how to not see tracing code and this is probably go more into Into my my co-presenter Uh, this point Which is that like, okay, let's say you want to do less of this coding. Uh, how how would that happen? so I use like the word buddy because like before Mesh became really popular term, uh, like people would call like buddies or proxies or Or bastions and all sorts of other ways of describing certain pieces of functionality about process interception um And uh versus like an agent which is usually focused on one host or one process inside of it Uh versus a framework which is stuck inside your app These three categories are not the only three but because this was observability three ways and I don't have much Brained I have to use three. So this doesn't mean these are the only three ways of doing it um, the buddy tracing slide on 17 Is basically talking about the idea that sometimes you can treat pieces of your system as a black box and that can be handy um, so if you're thinking about, uh Latency and other attributes like that if you have something that's intercepting your traffic already Then usually it has some way of recording statistics around that possibly even Adding in trace headers and things like that to tell what's going on So in in the slide it was just something I left it from linker d years ago, but But the idea would be that if you have something which is fooling another process into thinking that it's it's Going to a remote destination Like your your favorite software as a service But it's actually just forwarding it to a local host and then something else is doing that um, that's kind of handy uh for You know observing latency, but it also actually introduces problems Um, one of the problems is is that the the application is unaware that what's actually happening They just see this really long request But and they might not know in that log file that this was long because it was retried a couple times by by the proxy on your behalf And so And even like the some of the network mechanics are now split in two zones of existence And it's it's sort of another take at virtualizations problems But um, but there's i'm not trying to scare you. I'm just saying that that there are They're pros and cons to it and when you when you treat something as a as a black box and you kind of like work around it um You if if it's purely black box, there are going to be problems like that now I would guess that a lot of solutions are are developing more into gray box mentalities where you're going to have Some things that are handled for you with some feedbacks. Um, but just be aware of that That was a buddy tracing slide agent tracing on 18 I'm focusing on a java agent in the in this this code But it's in general one of the reasons why application performance management companies use agents um To do things like record record behavior is uh, actually, there's many of them But but one reason is is that they usually need to be able to be turned off and the applications still succeed They need to be able to run with several versions of software at the same time Without needing to be upgraded every single time that software package has changed But uh, they also Have, you know, in many cases to hide Uh proprietary information about how the performance is being collected. Do you think about most of the more sophisticated? performance tools That ip is really expensive right and the way they ship data and everything else So they wouldn't necessarily want to throw that source code in your app Um, but let's just bring it back to open source if if this were open source agent What would happen is that you have some code that while your while your code is loading or assembling itself it's going to go ahead and Find points of interest and then time between those Um, and later things to use Written thing the one thing you do in this in law calls is that this or not really big What does even have to be and um, so if you really put that up you can get it But in good stuff out of that it requires a little more CPU Um, I have a weird site on 22 what called bucketing ratio Shanthea is is that when you think about um, creation, the classifications of are often, um, important Like what is so versus not so what's an outlet versus what's not an outer and we have tools for doing that histograms and such Some way even do manual bucketing like literally laying out one of the five things is or I think they're gonna Web spec somewhere that has like a bunch of like categories of uh performance And uh, it's important to know that um, bucketing the size of the data, um If you think about what that means in terms of both what you're going to do But what you expect them to do you can start whittling down like whether these customers are actually the right customers for your service Because you can't be everything to everyone and there's probably customers using your service Not because it's actually a good fit But because like they just knew it and it was easier so they used that and they could have used something else That would be a better fit and so if you're having mismatches here Maybe you can direct them to a team that actually wants to build the thing that they want And make your problem easier So let's look at a more concise example about this. So let's look at zookeeper again Um, so as you would expect we have monitoring in zookeeper if an ensemble gets in trouble and ensembles a group of machines that somebody would use Um, if it gets in trouble then our on call gets a ubn task and we try to fix it Right because we want to run a reliable service and this could be for all your normal reasons There's some hardware failure or something's going wrong like something that clearly the zookeeper team can and should fix No problem. Um, but what about that thing? Like why did this thing happen in the first place? I was a little bit generous about like their special workload Part of what was special about their workload Is that they would do regular code deployments and when they did that they would restart their service and mass And they would create a thundering herd or a connection storm reconnecting to their zookeeper ensemble and thrash the ensemble for 10 minutes at a time And after 10 or 15 minutes the ensemble would recover and it would all be fine again But it would create a bunch of alarms it would create a bunch of noise Um, and you would go and look at this and you'd say oh, this is a wormhole ensemble We didn't have any metadata but you look at it and you learn the numbers and you go Oh, it's wormhole again They're doing a deployment because it's that time of day So like I could go bug them about it, but they haven't fixed their deployment yet So I'll just let it go and then you start ignoring it and eventually an on call gets tired of this And they create something like this so that we can Programmatically ignore that scenario because we think we know that it's not real But now we've done this and there's no time of day here. There's nothing that actually validates it And so we could totally miss a valid issue here Because we were trying to ignore this noise because we didn't want to go have a conversation with them So in the best case If like customer loads something your customer is doing is causing alarms for you and you're not talking to them This is a symptom of some kind of problem. It could be a couple of them We can kind of see them all here and by triaging I mean like the thing where you get an alarm and you take it And maybe you forward it on or you choose not to forward it on to some customer that you know is causing the problem So it could be as the case was ukeeper that the system couldn't defend itself, right? We didn't have blacklisting. We didn't have some pushback mechanism. And so we were at risk of a system failure Because we couldn't push back on our customer But if we could do that as we saw before we would just converge on customer failure, which is not that much better It could be that we don't have metadata So if we don't have that scribe metadata where we know an on-call that we can contact Then who would we programmatically forward that task to or that alarm to we wouldn't be able to do it We just do it because the on-call happens to know that. Oh, it's that ensemble like and that's not better So we could do those things But it could also just be fear could be like well They're kind of difficult and like they didn't really want to fix a deployment process And we didn't really want to argue with them about it. So we just didn't do it um If this is legitimate fear and like this is actually going to be a real problem for you Like that is a whole other talk that we don't have time to get into But chances are what's really happening is we're all engineers and we'd much rather fix a technical problem than a people problem And so if I can write a little bit of code so that I don't have to go talk to someone That's what i'm going to do And so that's what I want to encourage you not to do is you actually need to go find this person and have that Discussion about why their deployment works that way and how they can make it better This goes back to that expectation going both ways because the engineering solution that we tried to paper over this problem Was highly sub optimal to say the least But if they'd actually fix their deployment process that would have been a better engineering solution Because we had that conversation right and so our sla and setting these expectations and writing them down Is about putting us in this place that problems get solved by the right team But in order to facilitate that conversation you need metrics You need to be able to show these problems you're having are like the difference between and a failure of your system Something that your team should fix or something that's caused by customer load so that you can demonstrate that you're Actually expecting to fix something that they caused and that they should do that because if you start sending them noise Then you're not going to get a good response So that means we have to talk about monitoring. Um, this is the thing I call the p100 problem And that refers to the difference between the 99th percentile and 100 percent so Monitoring is not just how you offer a reliable service. It's also part of the service It's part of how you interact with your customers. And so I want you to start thinking about exposing that to your customers And the way in which you do that So let's talk about 99 availability. We can pick whatever nines you want, but like i'm bad at math and 99 percent is easier So let's use that So let's say you have 10 000 servers running chef and 100 of them are failing Like is that good or bad for chef availability? I'd say it's actually pretty good because 100 servers are probably have bad hard drives Or they're flaky for some reason like chef just can't run like I don't really care Like most of them are web servers anyway, and they're all stateless like whatevs Um, but what if it's one out of 100 database masters? Like these are actually like a subset of hosts that are very reliable and we expect to not have these sorts of problems Um, and the impact is much higher for a single host having a problem So this could actually be a real problem, but it's still 99 percent Um, but what if it's one zookeeper ensemble out of 100 in this case? 99 zookeeper customers are totally fine and they're not having any problem at all and one customer is totally screwed Like there's no 99 percent here. There's just 100 and zero The same thing with scribe categories If 10 of your scribe categories aren't taking rights those customers might totally be screwed Um, and everybody else is fine like so how do we do this right? We need different levels of monitoring that like that are both global and specific to customers So let's talk about chef and how we did this Chef went through a few iterations of its monitoring So as the chef team when we first rolled out, um, like, you know, uh, managing the whole fleet Of course, we needed to monitor chef itself We needed to monitor our back end infrastructure and know when the chef servers were failing And we needed to have some notion of chef runs across the fleet Right if like the majority of hosts across the fleet can't complete a chef run Clearly, that's a problem and chances are it's a problem with some core cookbook that the chef team needs to fix So we better know about that and fix it for our customers That's all pretty clear But we went a step further and started trying to address some of these more subtle things for our customer teams And so we automatically generated monitoring for each customer team individually So for their individual group of hosts like like database servers or whatever it is They have a custom alarm just for those servers so they can tell if those servers are failing Um, and so this allows us to identify problems where like maybe there's a small number of hosts And we at a global scale never actually see the problem because it's not enough hosts But it actually has a an outsized importance because they're database masters or whatever Um, and so they can actually catch a real problem And this is and chances are because it's just that one type of server It's probably something unique to that type and the customer team is in a better place To identify what the cause is than we are because they know more about say databases than the chef team does Um, and that's a good thing for them. It helps them be more successful. Um, but we need some flexibility here Right. Otherwise we're going to generate a lot of noise So first of all tunable thresholds like we have a mandatory minimum You are obliged to monitor your chef runs if you own hosts at facebook You're not allowed to ignore failures entirely However, like you can tune this a bit for noise, right? Maybe 99 percent is not the right percentage and so we give you some flexibility to tune it for your use case um But also you need configurable notifications So like your on-call may not work the way the chef team's on-call does You want tasks to happen a certain way with certain tags or I don't know. I don't even want to know. I don't want to know how your on-call responds to incidents I just care that they respond and so if I can give you as much flexibility as possible to let you work The way you're used to working I'm more likely to get the response from you that I want and that's all I care about Um, so we eventually over time added configurability here to let people tune things like that We didn't plan that from the beginning And finally dependencies, right? If chef runs are failing globally and the chef team's responding to this We better not generate a thousand alarms for every customer for a thing that they can't fix because everything's broken everywhere We absolutely did this at first and it was terrible And we very quickly added dependencies there. So if there's a global level failure, then we suppress all the customer level alarms automatically But those types of things are important to keep in mind Because you're iterating on this and so you're like, oh, it's close enough like no you're going to make people mad really quickly Some of this you want to get right from the beginning um So yeah, so we're talking about you know Monitoring for us but also monitoring for our customers and things that we can give them for free that helped them be more successful using our service um That is my last thing about monitoring. Um, a couple of things that I'll run through very briefly So how does this get even more complex from here? This isn't there's no end goal here It just gets harder as you get better at this and as you grow um So one of the ways that can happen is uh managing life cycle. So most people build things for what's happening right now But what happens when customers go away? Uh, so one thing that happened for us is if people stop using as you keep your ensemble or stop using a scribe category Chances are they're not going to come tell us Um, they're just going to stop using it and so now we've got this thing sitting there If this was an external customer, we just keep charging their credit card and as long as we keep getting money And like that's cool. Whatever. They just like one more person. We don't have to talk to you Um, but for an internal customer, this is all just wasted hardware sitting there like Costing us money and not actually doing anything useful for the business So it's actually important to discover this and it can be very difficult to tell if these things are unused Like if someone's writing to a scribe category, but never reading it. Does that mean it's unused or just that they like have some like Um, intermittent purpose and they just haven't used it lately Or what if they're not writing to it? But they're trying to read from it and so like just nothing's happened And so we think there's no rights, but like maybe there could be if there was an error condition Um, same thing with zookeeper like they might be expecting that data to sit there But they haven't written to it lately, but we can't actually go and get rid of it Um, so this can be a difficult challenge to figure out if it's actually okay to To get rid of customers that you think are no longer using your service Um Another one is customers with customers. So this happened to us in scribe We had a metadata service and you could programmatically register categories But typically people would do this manually when they were writing a new application They would do it once and then they would go and run their application But all of a sudden we started getting the metadata service itself crushed under load from people registering categories Because somebody wrote a stream processing framework that depended on categories to glue the stream processors together And so when you created a stream processing app, they would register a bunch of categories for you automatically And then they wrote tests around it So they were constantly spinning up and down new apps that would create and destroy categories And they destroyed our metadata service for a while We had to scale it up to handle the fact that we now had this framework And that framework had its own customers dealing with that team that we knew nothing about And that basically sent us back through all of the things I've talked about in the context of Our customer and our customers So yes, this is where you end up with customers blaming their customers So you kind of just have to go back through this stuff This is like just a pointer, but it's all the same sort of thing Like does your customer have the same level of maturity? Do they have metadata to know who their customers are? Do they have a plan for how they're going to send alarms to their customers? Like if I see a problem coming from them, like do I have to figure out how to send the alarm onto their appropriate customer? Or do I just leave it to them? Like these are all things you have to work out with your customer And this just gets more complex That is the last example I have from you I'm going to kind of skip over some of the summary stuff because we're pretty close to the hour I'm just going to skip to Some final thoughts As I said at the beginning, there's no one right answer to this This is just like you take your service as it is today and try to make it a little bit better The one thing you do get to do is decide what you want your service to be So this goes back to all that expectation setting Like you don't have to be everything to everyone as long as you clearly define what your service is and what it isn't And the more you understand that and the more you make that clear to your customers The more you can solve the problems that your team exists to solve And not spend time trying to be something that your time Your team is not trying to be in the first place and your service probably is never going to actually be able to be That's the best I can tell you I hope some of the examples are enlightening But that's what I have for you. So I will say thank you If you have questions or you want to ask about anything I will hang out after or find me at the facebook booth How do we do on time? Cool, right about the right time All right Not yet The slides will be online. I will put these slides online, but I'm going to do a demo and This demo hasn't been written down anywhere yet. So A lot of notes Demo is definitely new Since I came up with it like an hour an hour ago, so Okay, we'll start in about one minute Hands up if you've used ebpf so far Awesome. So that's about a third of the room What about ftrace hands up if you've used ftrace That's actually more people it should make steven happy steven's giving a talk tomorrow morning on ftrace, which would be really good I use both tools I happen to be talking about ebpf for the next hour Okay, thanks for choosing this talk. My name is britain gregg. I am a senior performance architect at netflix And as of tomorrow, I'll have been at netflix for five years Time flies that means I'll have worked on linux full time for five years When I came to netflix I was a detrace expert and performance analyst and coming to work on linux full time. I knew that I'd have a challenge of Lacking all of the tools that I created But I knew this challenge would also motivate me to find out new ways to accomplish things And I discovered early on that Linux had ftrace and so early on at netflix I created a suite of tools called perf tools that were based on ftrace and they let me solve a lot of problems I also met a lexie story toyov who was working on a new virtual machine for the kernel called ebpf And I saw how that could be used to do more programmatic things And so I've also created a lot of tools using ebpf And so now five years on at netflix I can I finally do what I want to get done In production using all of these new tools. It's really amazing and exciting And also we're at the point with these technologies where Early on we made sure that they got into the kernel because that takes a long time to propagate And now we're getting them packaged and so I was Swapping emails with canonical yesterday to make sure the bpf trace package Appears for the ubuntu distributions, which would be really really awesome They've already created a snap for it And we have bcc which is the other front end has been packaged So we're at the point where these superpowers you can use them for analysis just by doing a package ad to a system But what I want to start with is a live demo and Often in talks we come up with these nice canned demos where everything works since all sunshiny And myself and Martin we've been doing these demos where it's like we want something more gritty We want something realistic Something where it's not like a 10 line program that we're debugging And so we thought why don't we just like fire up minecraft and then do performance analysis of minecraft It's going to be fairly complicated And so I started this an hour ago to see how far I would get So my demo will is using ebpf for performance analysis of minecraft Is anyone in the room who has worked on the minecraft source code a developer from mojang? Yes Open source yes, well, yeah All right, so you might be able to help me out No, I'm serious. I am actually going to use ebpf on minecraft, right So here's minecraft Hands up if you've played minecraft Okay, so about as many people as ftrace. So that's pretty good I tell you what between minecraft and ftrace. I don't know which is more fun. I'm a performance Engineer and I love ftrace I'm certainly better at ftrace than I am at minecraft So I've created this world called scale 17x And The arrow keys don't work because I have to press w and things. Okay Like I said minecraft is working Now what I'll start by doing is So there's a couple of front ends for ebpf tools. One of them is bcc for the bpf Compiler collection. We've been working on that for many years now and the other is bpf trace, which is newer. They're complementary Bcc is really good for having all these canned tools that do one thing and do it well So I'm going to use profile, which is nice and simple. In fact, it's in my path anyway Profile is a cpu profiler. It can sample stacks at a timed interval And minecraft is running. How many javas do we have just one? Okay Minecraft happens to be a java application Hands up if you've done performance analysis of java Okay, it's actually pretty hard because java is a jittered runtime And the compiler is constantly moving things around and you don't have a symbol table This is the exact type of gritty horrible thing that Martin and I were interested in doing a demonstration of So this is where things actually don't work out well when we start doing performance analysis And that can teach us more about how things work So I'm going to do profile Of say 99 hertz in fact, even though I wrote the tool. I do forget all the switches I want to do I'll just do the process ID I know I don't like whatever Okay, so back to playing some minecraft And here's my stacks did miss some Some stacks it can translate. So this is c2 compiler thread But a lot of stacks it can't So there's all these unknowns unknowns unknowns And they look a bit broken anyone know what's going wrong Multiple things actually going wrong. What's one of the things that's going wrong I'm missing symbols. Yeah, that's one thing. What else stacks should not be one frame deep usually so bpf runs in kernel context with interrupts disabled It's a Very dangerous context to run in you don't want to do anything really seep you expensive and the bpf verifier Will check and reject programs to do anything too dangerous The stack walker that we have in bpf right now is a frame pointer based stack walker And so that's where the the conventional frame pointer register which is rbp on x8664 Saves the Location of the previous frame and so it's very simple. It's just a linked list I can read the frame pointer register rbp It tells me where the next one is on the stack. I can read that and under known offset plus eight I can fetch the return instruction pointer so by walking a link list and then referencing The plus eight address I have a stack trace and the stack trace shows me the flow of code Stack traces are great The teacher's is what's going on here. I'm doing cpu profiling using profile a bcc tool using bpf It's sampling at 99 hertz on all cpu's It's walking the stack trace using the frame pointer and it's and for efficiency It is frequency counting the stack traces in kernel context the output that we see here It says I saw this entire stack trace eight times Most profiles prior to bpf will emit these samples to a Log or a file. So if you've used the linux perf utility, it has the binary perf dot data file Now over the years they've done a lot of optimizations to perf dot data And steven can explain them better better than I can for how we get the Data written out, but you still have to write out an entry for every sample With bpf The kernel will will do a hash of the stack trace and if it's seen it before I can just increment a counter I don't need to keep writing these samples out So it all summarizes in kernel memory where it's very efficient and cheap and at the end of my profile It then dumps out the frequency counts So this is a new generation of cpu profilers cpu profiling At netflix is what we use all the time to understand our code is a really important performance analysis tool and With bpf. We're actually making it much more efficient, which is exciting cpu profiles by the way have been around for decades So to think that we've just had a major innovation in cpu profiling in the last few years is pretty exciting So that's great and and this stack trace works We start at start thread and then start thread calls thread native entry and so on we get into the c2 compiler And we're doing phase idle loop, but a bunch of these are broken yeah, so the kernel does does have to do the Hash comparison of stack traces. I don't have performance numbers It's something that alexis i'm sure has put a lot of work into alexis an x compiler engineer So he's very sensitive about cycles and where they've gone So it's something we could test but I trust alexis has optimized this pretty much as far as it can be optimized Yeah, that seems like a quick hash algorithm Now what I want to talk about is why these stack traces are broken Like I said, this is a great example of where things go wrong and you can learn more about when things go wrong Then if I just showed you a beautiful sunny example of things going right The stacks are broken because the Compilers Will reuse the frame point of register as a general purpose register This came from 2004 and it was a single change to gcc By a contributor who thought it was a good idea for i 386 on i 386 you had four general purpose registers So if you were able to free up the frame point of register as a general purpose register The compiler would go from four to five registers So you'd have a fairly significant performance boost for all the software that you compiled So that makes a lot of sense But for x86 64 that's a different architecture. We have 16 general purpose registers Freeing up rbp just gives us a 17th It's not really worth it and the problem is it breaks stack traces Like this So why did the gcc developers accept this as the default for compiling by the way? This is java. So it's actually this the c2 compiler has done the same thing, right? They accepted it because As part of the patch Suggestion the developer said oh Stack walking is a solved problem because the gdb folk have done all this amazing work with Dwarf and doing this the unwind tables And so i'm not breaking Stack walking by making the frame pointer a general purpose register because everyone's using this fancy dwarf unwind tables for doing stack walking What they didn't take into account is traces traces like bpf, which didn't exist back then traces If you're a debugger like gdb, you can do expensive things. It doesn't matter. You use a level process When you're a tracer you're running an interrupt disable context So we have to be very careful about the code that runs and doing the dwarf Style lib unwind of of stack traces is actually very expensive And we're not sure we will ever do this in bpf Like we may never ever code this in bpf We may someone may say oh here's a solution and then the security folk will say well There's no way we can accept this because it breaks various security policies in bpf Linux did come up with a new stack walker recently called awk by the way, there's a there's a Orc awk There's a theme here. So you've got elf, which is where this comes from which is the executable linked format You've got dwarf Which gives us a lot of debug info and awk is a new stack walker. So Orc is smaller than dwarf although that doesn't make sense now that I say it out loud Anyway, the awk format is smaller than dwarf So we've talked about maybe we can do an awk stack walker in bpf But at the moment we've got frame pointer based stack walkers because the interrupt disabled context is expensive when we need this to be fast So what can we do? I can fix that So let's go back to minecraft. Let's get out of this Now if it sounds like I'm going on this weird tangent, I'm not This is exactly what we have to do at netflix when we're working on production applications This is the exact issues we run into it's like stack traces don't work We need to turn on frame pointers. We need to figure this out Minecraft amazingly has ways you can do custom launches So I can say let's do a frame pointer version of minecraft And I'll say for the jvm arguments minus xx plus preserve frame pointer Oh, there it is Got that right Now let's run a frame pointer version. Oh, don't go full screen Now just to be sure Okay, so I did match something so I am running with the frame pointer Okay, so it still seems to be going just fine even though I just turned on the frame pointer For java Now I'll do my profile I should mine things, but I haven't figured out what key do I press to mine things? Who knows Who knows how to play this game? It's the mouse press Oh, yeah, here we go Now I'm doing it. I'm mining things I should have brought my son up to help me do this presentation. My son's nine years old Eight years old so Okay, now things look different Now when I look at these stack traces Once I get past the compiler, which is busy So we have these unknowns. Okay This looks much better unknown unknown unknown rather than just having one line and and then it stops The reason why it says unknown is that The location of those frames is in the heap or it's on like some in-map segment This is where the c2 compiler has compiled methods on the fly And then put them in memory and run them. This is jit compilation just in time compilation The fortunately is a way to do symbol translation for these Perth the links perf tool came up with it. It's supplemental symbol files. They live in slash temp their perf dash pit dot map And for run times like java and node.js and various other ones that have jit compilation There's usually a way you can dump a symbol table So the perf can read it There is a open source java agent called perf map agent which can do this for us And I've written a utility called j maps, which is on my flame graph repo If I run j maps It snapshots the symbol table So there it is That's what it looks like So it's got addresses and then what the whatever the java thing was So by running j maps I can then profile it again and then Okay, now it looks a bit different So now we're starting to see the Minecraft source code BNQ BNV BNV If you've seen this sort of thing before it looks like they've got like some sort of uh Obfuscation utility, but sometimes it's not obfuscation. Sometimes it's just for compression They just want to compress down the symbol names So there's some sort of compressor or obfuscator, but at least we're actually doing some of the translations now There's a lot of these interpreter frames as well And that's because with the jvm it Initially runs methods in the interpreter and after it hits a compile threshold or then compile it and then Run it natively and we want it to run natively because then it has a symbol address and and location in memory In fact, we want it to run natively so much. I'm just going to modify modify it yet again launch options And say pile threshold equals 10 Did I get that right? It's been well since I've used it compile Yeah compile threshold equals yet So now what I'm saying is only after 10 implications of a method you should compile it up once you've compiled it up will be able to See it Okay, now if I do my profile Now I'm going to do a different type of profile because there's a problem with this jmaps in that jmaps takes a snapshot of the It takes a snapshot of the status symbols So if I run like if I run jmaps and then I talk to you for 15 seconds and then I Run profile there is a time in between running jmaps and running profile Where the symbols may have changed and so sometimes symbol translation gets a bit off I'm going to demo different till BPF trace And I'm going to write a profiler as a as a one-liner Profile hurts 99 if our pit is the positional parameter you stack Frequency count the user stack trace now on the end routine I can do this Run jmaps Then print out our stack traces And then clear it This is great because I can I can do some custom things. This is I'm now switching from bcc, which is profile to BPF trace Which is the new BPF tracer And it's nice because I'm saying run jmaps Immediately before you do you use that symbol translation to minimize the possibility for churn Oh, there's a pig Can I give the pig the grass that I'm holding? No Okay, let's have a look So it did the symbol translation you might have seen it really quickly and then immediately We printed things out. There should be fewer of these interpreter lines now But things are looking Should be looking better What I'm going to do is I'm going to turn this into a flame graph so we can see flame graph of minecraft I'll go back to profile Make sure there's only just one of them And profile has a minus f option to output in folded format And I want to do it for you know, I'll just do it do it And then I can take that I can Turn into a flame graph and open that up in my browser Go here. It was a I was in BPF trace All right So now I can see that Anyone used flame graphs before Awesome. This is like almost everyone so I can mouse over and I can see while I was playing 30 percent of our time was in the c2 compiler So you wouldn't have thought it when I was walking around looking at that pig and throwing grass at it But the cpu time was actually mostly the compiler And then over here is the stack trace for java I need to merge these symbols should be joined together. Anyway Let me fix that You have some reason the There's these extra semicolons the perf map agent is sticking into things And I'm just going to use said to get rid of them. I need a g in there There we go much better So now I can browse this and I can see all the time in java And I can quantify it and look for optimizations It would be more exciting but they seem to have run it through an obfuscator. So I've just got these these characters Now so far that's a neat demonstration of cpu profiling But BPF gets much more exciting BPF gets much more exciting than that because we can do off cpu profiling We can instrument reads and writes. We can instrument mutex locks and all the other exciting things So here's an off cpu profile of minecraft So in as part of bcc there's a tool called off cpu time It's another tool that I wrote what this does is it measures the time from when a thread Blocks and goes off cpu to when it then comes back on again And it does that with the stack trace It's the counterpart to cpu profiling with cpu profiling you I can browse around and quantify How much time I am spending on cpu and which frames are doing it so I can see 13% of the time was in lax y whatever that is and there's all the child functions And so if I'm the developer of the code, this is awesome If I'm not the developer of the codenome and operator assist admin It's still fairly useful because I can still get some information on that Target software even though I don't probably don't understand most of the functions But that's cpu time I want to know this for off cpu time if you look at a fine graph for cpu time and off cpu time That's all of time. That's a hundred percent of time So it means you can you can basically take on any performance issue So do my off cpu profiling I want Java Yeah, let's just do it back to the game Make some more holes All right, so off cpu profiling and here we go. There's lots of stacks So here's um, there's some lambdas happening in get color happening in minecraft Okay, so that's interesting. I think the time is in microseconds. So there's 13 milliseconds in total was here for whatever reason But if you look at a lot of these stacks something looks wrong We're back to these unknowns and if you look closely a lot of them are over p-thread conned weight or read or E-pole weight or things like that Why is the stack broken now? I thought I fixed the stack traces. Anyone tell me why they're broken now? Again, we've hit this in production at netflix. This is why I wanted to demonstrate this. This is an excellent simulation of the greenness of real world analysis It's broken because e-pole weight and p-thread mutex lock and all of those come from libc and well lib p-thread and they're compiled with Amit frame pointer the gcc default so Even though it's only one frame deep you're walking down from kernel context you're going to lib p-thread And it's like a black hole like sucks up the frame Frame pointer and you can't go any further and so I'm missing the rest of the stacks How am I going to fix this? Well Let's do this again Go back to my launch options So here's something I made earlier it's java with So it's a user bin java dash whatever and it does an ld preload of versions of libc and lib p-thread That I compiled earlier with the frame pointer and then it goes and runs java So I've done my own compilation of those libraries and I called this wrap out java dash libc frame pointer and Minecraft is awesome like you can just tune this stuff. So the java executable is my version java libc frame pointer And then save that Okay, so now let's play minecraft I'll get my profile ready Okay Look around Okay, it's better. So actually I need to run jmaps. I forgot to run jmaps. So now we're able to go So here's an example here vmp erotic task normally would stop here p-thread conwait because libc would throw out the frame pointer and you couldn't go any further Now I can keep going I can see that came from os platform event monitor IO wait and so on Watch a thread sleep and so now fix those stacks as well And now for the final minecraft demo I can do This is a flame graph Back to the game Dig in some holes. I guess that's what you do Now I can do Okay, that looks good. Remember that said thing I did earlier. I probably need to do that as well I should fix perf map agent now I can fit into flame graph I'm going to change the background color Because this is an off cpu flame graph. It's not a cpu flame graph. I want that as a visual reminder all right It translates nice Make that font size a bit bigger so you can read it Okay, so now if I browse across this is off cpu time I need to set the samples So I can see that seven seconds was in this stack seven seconds was in this stack 22 seconds was here and if I zoom in I see that These are a lot of threads that are waiting for work I probably was tracing for seven seconds. I'm guessing and then I hit ctrl c And these ones that are larger are multiple of seven seconds because it's probably aggregated three threads together on neti server io So it's getting to 21 So here's again, this is a gritty example This is what really happens when you do off cpu analysis is you look at and you go Wait a minute. This isn't interesting at all. This is just all the threads are sleeping I was interested in When am I blocking during something interesting when like you're actually playing the game? I need a way to exclude all this stuff. I need to exclude all these sleeping threads and only look at the application threads So if I look at the cpu flame graph that will give us a big clue as to where that is So the so if this is the game over here I'm in the game when I'm in this thing like minecraft server That's the name of that Function it's really small on the screen Okay, so let's profile it. Oh, I need to do my jmaps as well jmaps off cpu time I'm going to be grepping this out of the the profile Now when I go to generate it now Okay, go back to the other one That's what it looks like. This is all our sleeping threads. It's not very interesting I can just say grep minecraft server Because the folded format is one stack trace on a line Which is great because I can use grep said knock and I can just manipulate it however I want Bam Now this is the off cpu time just playing the game and I can see that I still got to sleep in here But there's another stack. That's pretty thin. There we go. I should be grepping for whatever this is Minecraft server B So now here's the the real blocking time while I'm actually playing the game in between the frames And so I can see there's some blocking time here. What if that is abstract wrapping script iterator We're in object monitor complete monitor locking So I've got some blocking there And this is really like the first time I've browsed through this stuff. So it's pretty cool. So we've got a lot of locks You know monitor lock without safe point check And so it's telling me how much how many microseconds are in each of these co-paths This is really awesome So if I give the developer the cpu flame graph and the off cpu flame graph They can browse them and and find where the cpu time is consumed and where the blocking time is consumed during the application BP f makes this possible because off cpu tracing is actually really expensive Because I'm tracing when scheduler Whatever the scheduler does context switches and they can be millions of times a second So you really want the ability to do In kernel aggregation and in kernel frequency counts, which is what BPF can do To dump out schedule events to a perf dot data file, which I've done before you can have a gigabyte per second So it's not really practical So this makes it all all much more practical Right. I'll stop playing Minecraft now. Let's look through Slides and I'll go back and do some more demos Enhanced BPF Also known as just BPF Can do lots of things. It's kind of crazy to explain It began life By a company called plum grid. They were doing it for software to find networking And alexi sartoyov came up with it And that was so that it could handle packets and then you could define a Some sort of proxy or router or whatever you want in software and it would handle handle it This is really good. It's really good for us because packets can be very frequent Packets themselves can be millions of times a second And so since this was designed from the beginning to handle very high packet or very high event rates It means it's been optimized. And so we can use it for things like off CPU analysis and schedule analysis It's been used for lots of other things as well denial of service mitigation intrusion detection container security observability what I do BP filter Even device drivers those are infrared device driver written in BPF There's been a number of discussions about BPF taking over everything to the point where it seems that The ability to write a user defined program and have the kernel run it And that user defined program to have more and more capabilities Sounds a lot like a micro kernel So micro kernels is where you run write stuff in user space and you have a minimal kernel that runs it and so It's a little bit scary to think about micro kernels and Linux But anyway, this I think is Jonathan Corbett who first came up with that realization that we're heading towards a micro kernel User defined BPF programs. They go to a verifier checks. It's safe And then BPF can attach to socket events k probes u probes trace points and perf events and whatever If you're an end user the bcc repository Has lots of canned tools so you can just use them straight away. I only demoed two so far profile and Off-CPU time, but there's lots the tools have man pages example files documentation And usually you can get bcc on a system nowadays just by doing apt to get install or yum install You need at least Linux 4.4 Yeah, at least Linux 4.4. It's it's kind of a sliding window because BPF was added in parts And there's another front end called BPF trace Which I was working on a lot last year BPF trace is good for powerful one-liners and custom scripts And so with BPF trace I can say BPF trace t is for trace point Cisco sys enter open print out the file name or do a histogram of the read size distribution by process things like that so Very nice very very simple Syntax and you can write lengthy programs in BPF trace I thought I pressed f5 Shift f5 right I can save timestamps. I can do custom latency calculations and so on EPF is solving new things off-CPU analysis was this Cool thing we really wanted to do but it was so prohibitive to to trace schedule or events Now EPF is letting us do it. Another thing it's letting us do is I can save stack traces when a thread blocks Now I can save stack traces when a wake-up event happens and I can associate wake-up threads with blocked wake-up stack traces of block stack traces So I can get take that off-CPU time example further with Minecraft Because sometimes when you're looking at a lock and you're saying unblocked on this lock your next question is Well, why why was I blocked for 10 milliseconds on this lock? Who was holding the lock for 10 milliseconds? I need to know them Well, you can know them if you trace the wake-up events So the kernel has to manage this the kernel has to so a thread releases a lock The kernel then does a wake-up from that thread to the blocked thread So you grab the stack trace there and you associate it here Doing that association was something we've never done before in kernel contacts. Dtrace could not do it But it's something that BPF can do There's lots of frontends for BPF Raw BPF is where you write instructions like this BPF instruction level programming There's some samples inside the kernel source code. I think very few people are ever going to need to do this. It's really hard There's also if you look in the kernel source code, there's C examples as well Which is a little bit higher level But we've spent years trying to make it even easier So we've come up with a bcc frontend which has has a python interface a c++ interface and a lua interface so far And I imagine people are going to add Other interfaces to bcc as well other languages And so that lets you Define your kernel program in c and then the frontend that manages it in python or whatever that is With bcc it's the first time I can put an entire program on a slide The other ones are way too long And then there's BPF trace and BPF trace I can do a one-liner like this where it's Very simple I've been updating this slide for a while As things get developed So BPF trace is Now approaching maturity. We're about to release version 0.9. We've been working on API stability Cystic just announced ebpf support, which is great So you can run cystic without a kernel module and it will allow cystic to do more I keep having to bump perf along because it gets perf gets more features in every version I keep having to bump ftrace along because it gets more features of his triggers and synthetic events But I've been moving it downwards as well because it's getting harder to use with his triggers getting more powerful Trace command, but trace command isn't trace command. Well, no, actually it's a frontend. So if I had trace command Which is and also kernel shark. I know I should add those You're right If you're interested about what trace command and kernel shark are about i'm sure steven will tell you tomorrow And then on the right, I've got the BPF tool set. So raw BPF, which is brutal to use And then we get easier and easier and easier. So bcc. What are people developing on bcc? It has a lot of those canned tools And then BPF traces the easiest of them all As an example, just to have one in the slides This is this is a bcc tool called bio latency. It's showing the disk i o latency as a histogram It's efficient. So the the count column is what's maintained in kernel context And it just copies the count column to user space and user space prints this out Prior implementations, I had to dump out every disk event and post process in user space and that's too expensive That's the b cc program for bio latency Kind of long Here's the BPF trace version for bio latency fits on a slide. Well, if it's on a slide in a way, you can read it So kpro block account is start save a timestamp Here i'm using arg zero as a unique identifier. So I which is the block pointer And on completion if i've got the timestamp do the histogram Delete the start one. We're done So much easier to read than that. That's the bcc version However, bcc is more powerful because you're you're in python for a bunch of this And so if you're in python, I can do arg pass. I can do socket libraries. I can do whatever I want And so that's a difference between these tools BPF trace is great for just hacking up a quick custom tool that does something simple But once you're like no, this needs to be an agent and it needs to support all these Needs to support google proto buffs and needs to support this and that Then you want it to be a bcc tool because then you get to use all the libraries So that's the difference between them bcc canned complex tools and agents BPF trace is great for one-liners and custom scripts That summarizes bcc for an end user pretty well You can you've got all this new visibility Into all sorts of different subsystems And so just to give you a couple of examples Except exec snoop tells me what software is running So if I run minecraft launcher, let me get out of here, okay Except to exec snoop tells me what's happening when minecraft starts up So I can say okay, so that ran and then it ran this thing And then it ran these other things Okay, thank goodness. It didn't have my minecraft password in there come to think of it Go this stuff Open snoop Shows all the files that are opened. In fact, it's it's going. Hey wait It's great for locating config files and log files and things that uh That can help your troubleshoot or debug software Bash read line just is a different example This instrument's the user bin bash program every time it does read line. And so it just prints out the What was entered on bash shells? I use it because it's an example of user space tracing But for performance tools, there's things like ecst4 slower I wonder if Minecraft does anything So ecst4 slower will show slow ecst4 Oh, I should be able to do sub seconds It it shows slow file system IO. You can give it a threshold So I'm saying show me IO slower than one millisecond at the file system level or I could say zero milliseconds for everything Okay, so lots of stuff happening that's great for Proving that there's a file system based issue before you get down to the disks because of the file system issue Interface you've still got process context file names and all sorts of things Oh, I should run tcp life on minecraft when I started up Or tcp connect hmm So tcp connect Oh, but there wasn't any from here yet When's it checked my authentication? Let's try tcp life Maybe it's using utp I thought when you started up minecraft it went and checked your With your authenticated This is all chrome chrome chrome chrome. Maybe it's not tcp. Maybe I have to run a utp tool. Okay I haven't written them yet Okay, I haven't written the utp tools. I haven't written utp versions of these yet. I should write them. Okay so Anyway, lots of tools. This should be a gap over here. Yeah, there's a gap down here. There's no utp tools If you'd like write some utp tools so that we can like analyze them as well So anyway lots of new tools for performance analysis software troubleshooting and debugging BPF trace there's fewer tools so far that we've published because you're more likely to have custom tools To do various things This is where BPF trace development is right now Started in december 2016 October 2018 we were a major feature complete at least for version one of BPF trace We we've released version point eight so that it can be packaged for debian We're working on version point nine right now. There's only one one thing left to fix And then we'll get to version 1.0 where at least the known bugs will be fixed It'll be nice to get to api stability We're not quite there yet And so at the moment I I haven't done many blog posts about BPF trace Because every blog post I do I know I have to go back and change the syntax Because we are tweaking things but we're getting close to the point where it will be pretty much set in stone And so they'll make it much easier to begin to write documentation But we're not quite there yet Packaging work has been going on So as I said, I was in emails of canonical yesterday about getting it packaged for winter The syntax of BPF trace It's inspired by many things from the past like orc and detrace and system tap and c So I've got my probe event. So k do nano sleep optional filter And then an action So here i'm saying com frequency count com, which is the process name and plus plus So the question is when you use BPF trace, how do you know what? Variables are available to you because I just use calm I wrote the reference guide. I spent a long time writing this reference guide And so in the reference guide there is variables built-ins And so here's the list So pid it's all in there. It's all the reference guide Yeah, if you search for pid that's this So Yeah, there there I was So there is a reference reference guide for BPF trace and for bcc and I wrote them both I also wrote tutorials and things to get you started reference guides really long. You can print the thing out Different probes that are available. So you can do kernel function tracing user function tracing trace points Profiling hardware events as well. We can access PMCs We added shortcuts. So you don't have to type trace point the whole time all the time. You can just do t filters use billion operators straightforward Actions there's per event output things like printf and system There's also time so you can do custom time strings And then there's BPF map summaries so I can say it's the at represents a map symbol At equals count or just at plus plus at equals his histogram And as I say in the slides, they're all in the reference guide And so There's so for there's lots of functions So if we're doing log histograms linear histogram look to histograms linear histograms Counting some min max average doing symbol resolution And then working with maps The variable types. Yep Does the symbol resolution have problems of spectrum meltdown mitigations? yeah, so more Yeah, more often more often nowadays We're booting kernels with the address space layout randomization stuff and the pie stuff And we've had to do work in bcc to work around it. And so BPF trace calls into the bcc symbol resolution functions And there was a while where it was broken Because of this because we were more often doing address space layout randomization So yep, if you want to do any security turn off ebpf If Some people are saying that BPF isn't secure But the thing you've got to remember about security is with BPF is you have to be rude to use it so far So I can't I can't make the BPF syscall unless i'm rude unless i'm caps sysadmin So like the ship's already sailed you root user What's that you can I've heard of people turning off BPF and ftrace because they're security minded, but it's like I mean, what's the vector like you have to be rude anyway to be able to use this stuff So The ship is sailed right There is we do eventually want to make BPF the BPF syscall usable by non-root users Okay, when that happens. Yes, we have to then then people want to turn off BPF There's already assist ctl for it for disabling privileged access Right perf has been through this and you can look up the cvs that perf had to deal with Because you can run perf as non-root of course But we're already getting prepared for that so unprivileged BPF disable you can already turn that on And some people would say that should be on by default whatever fine Because it's basically is on by default now So in the future future, which maybe take like five years There may be non-root BPF access a big user of non-root BPF access is going to be containers Because people want to run the BPF tools within a container Right now netflix I log into the host and I analyze containers from the host But I can't go from the container because it doesn't have capsis admin so So there's lots of functions that are in the reference guide Variable types basic variables associative arrays Built-ins are in the reference guide I should put that on the slide as well And so now when you look at a program like bio latency after you've gone through the syntax It should make more sense. So there's my k probe of kernel function Save something completion and then print it out If you're not a kernel engineer, you may find it intimidating to go through kernel stuff You can't get a lot of value of BPF and without ever use writing code just by using the can tools And as for writing tools try to use trace points as much as possible because they're Stable and and maybe more documented than k probes in the first place I drew a picture of how BPF trace worked internally But that's online if you're really interested we we do go and use bcc So it does leverage that And the program gets sent through a lex and yak parser And then there's clang parsers for trace points and coming up with the structs and semantic analyzers and so on There's still things to fix maybe you want to help out I want to mention a couple of other tools At netflix we've been working on vector And that's our per instance analysis tool its open source and recently We have added flame latency heat maps to it's already got flame graphs. We're the latency heat maps to vector I've been adding flame graph like the off cp flame graphs, but so latency heat maps like my colleagues And volunteers have got this in in there, which is really great so Like the histograms we saw Or maybe we didn't say because I didn't demo it So here's my ext4 file system. Uh, what boring histograms? Okay, it's a little bit better So these are histograms right so I can look at like latency distribution in microseconds Oh, look it's bimodal and there's a latency outlier that's four to eight milliseconds I want to see that over time. I don't want to see that for during tracing I want to see that every single second and so that's what latency heat maps allows to do I first came up with this like 10 years ago And it's good to see them start to get used again And we've been open sourcing that and I did want to stress about GUIs I do think most people who use BPF in the future future will be using it via GUI and they probably won't even know that BPF is there I can go to the command line and demo all this cool stuff but The CLI tool end users is going to be much fewer than the GUI end users And at Netflix SSHing on to an instance is a last resort We have something like 150,000 instances running. Which one do you SSH onto to work on a problem? and so it We have to make things work through GUIs to operate at that scale And so most of the users at Netflix for BPF are going to be using it from Vector Right, they're not going to be SSHing on and doing stuff and be using that And Vector also has flame graphs like the off CPU flame graph. I just generated from Minecraft We've automated that as a push button inside Vector Maybe I could have demoed it with Vector and I wouldn't have had to like run all the commands. All right That's the tool users. So yeah, GUIs that there's a lot of people use it from GUIs and that's fine If you're going to develop BPF tools BPF trace makes it a lot easier. So I expect there to be A few thousand people who will be Writing custom things for their application environments of BPF trace There's probably going to be a couple hundred people who are doing BCC tools custom agents because BCC is quite powerful Remember BCC You can use all sorts of extra libraries and interface it to things So you might may look at BPF and say well, it's great But at work we already have this dashboard monitoring tool and we need to plug BPF into our existing monitoring thing And so you probably want to look in look into BCC because it's python. There's probably a library so you can connect the dots I don't think many people are going to write raw BPF or the C interface because We just don't need to we've got higher level interfaces Some people will because they'll have some weird reason for doing that And to mention a couple of other tools there's cube CTL trace for running BPF trace on kubernetes And I mentioned earlier cystic have added the BPF support and we'll see ebpf added to other Tools as well Bless you. So takeaways you can easily explore systems in powerful new and custom ways You can also contribute And also, please if you can't share war stories and blog posts and give talks about these tools it's it's a lot of new stuff and And if you come up with a cool usage, it's going to help the BPF community to share it I'll publish these slides online And that's my talk So I'll take some questions. So while I'm here What questions do you have about BPF BCC BPF trace? Yes So the question was last year we we wanted to change the name of BPF because BPF is kind of a terrible name BPF maybe I should have said it it stands for Berkeley packet filter But nowadays what we're doing has nothing to do with Berkeley Little to do with packets and little to do with filtering And so a better name would be something like virtual kernel instruction set vkis our virtual machine anyway Alexi and I engineers and we're terrible at naming and we couldn't come up with a better name and Alexi has settled on BPF because he sees it as And like he wants it to be BPF and not eBPF because he sees it as equivalent to x86 and arm It's like because it is BPF is an instruction set. And so this is a virtual machine instruction set So he wants it to be three letters like everything else x86 arm BPF So that's why we're stuck with BPF Other questions. Yes Yes, so every two weeks we have a iOvisor con call So the learnings foundation project that hosts the bcc and BPF trace repositories is called iOvisor The name iOvisor is also not a great not a great name because we're doing things more than just iO anyway iOvisor and we have a con call every two weeks. It's open to the public And there's iOvisor mailing list and so i'm usually on that con call We spend an hour discussing what's going on who's developing what So that's a good place to find everyone but also the iOvisor mailing list for email If you look at the iOvisor mailing list Brennan Blanco sends out when I say iOvisor I should type it in so iOvisor like that You can see how iOvisor is iOvisor project So if you just do the internet search for iOvisor mailing list You'll find the mailing list Brennan Blanco sends out the meeting invite and then you can join the meeting Has there been any progress on you probes in go? I did some work on this a while ago I haven't touched it for a while. So I don't know. I have to look up what the current state is Yes Okay, the state of the tooling in centos and rel centos. I'm not actually that sure Rel for 7.6 beta and I think for 7.6 official They have done they've put ebpf for observability into the kernel not ebpf for networking yet things like xdp and Uh, they've packaged up the bcc tools. So if you're on rel you can add bcc and whatever rel way And it'll works I can't I cannot get my head around it's still being a 310 plus backpots So surely it's it's no longer based on 310. Surely it's based on 4.0 something for for this all to work It is stable Anyway, I don't know. I don't I haven't actually used the red hat one yet So I haven't dug into it, but I'm glad that they I'm glad red had have done the support Other questions. Yes So I couldn't quite hear but I think you're saying when I use usdt I have to specify a process. Okay. So usdt probes that's user level static tracing and It's handy because they're A stable api to use them, however Often you have to specify a minus p for process. Now. What's great about bpf is I can bpf trace U probes bin bash read line Do you rep probes my screen word wrapped? And so I can I can trace all Invocations of bash. I didn't have to do a minus p, right? That's how a lot of the probes work. We get used to this. It's really fun But when you're using In fact, it's in here as well when you start to use usb usdt I have to specify a minus p, but I don't want to specify minus p I just want to do it system-wide like I'm used to doing it. That's really nice The problem is that usdt probes if I've got an example Uh What might I have What might have it I have it you do you have some? So here's some here's some probes um lld has um usdt probes and so I can I can go and instrument these The problem of these semaphores if they exist Okay, so they're not in use for these ones So since they're not in use I should be able to do usdt of This And then I can do it start This is pretty cool. So I'm actually using usdt right now So ls just called an it start which is the usdt probe and I just traced it out of ld This worked because I didn't have to do minus p because if I look at that thing The semaphore is zero So this is why why this thing exists. It's it's this concept of vis-enable probes These probes have arguments So Coming up with a stable probe is not just about a name. It's about arguments So if I trace a database a query What do you want the arguments to be well? I want the arguments to be the name of the database the user who's logged in the query string Okay, sounds good. What if one of those arguments was computationally expensive to instrument So So it's just not available as a C string at that point in the code You have to go and run a routine to reassemble the string the problem with usdt Points is when you put them in the code They get compiled into knobs and so the knob gets replaced with the The jump to go and do instrumentation and so It's hard to come up with that that more complex argument. So they wrap them In basically nif statement if This Probe is being instrumented only then pay the tax of doing the expensive computation to coming up with that string And so then there's the problem of if if I am in my code, how do I know that there is some Body using bpf tracing me So imagine I am a c program and I am running instructions loads and stars How do I know that there is out in the universe someone running bcc on me? I'm an instruction pointer, right They do it by having a known address and You actually write to memory And I check that known address to see if it's been flipped to one or zero And if it's been flipped to one, it's like, uh-huh Someone's watching so now I'll do the expensive stuff and calculate those arguments that thing is a semaphore That's what's listed here And some of the usdt probes use it And so if if if it's semaphore based you have to use minus p And then bcc and bpf trace will actually go and flip bits of memory inside the instruction Inside the memory address space of the process, which sounds a bit scary But uh, hopefully we didn't make a bug. We don't corrupt things But that's how this works. That's how this has got to work. I mean all this stuff's pretty scary We're like dynamically patching instructions live. That's how this all works So Yeah, there's a long answer, but that's that's why that is. I'll take a couple more questions and we should end Yes Um, I had to for my minecraft example It was really good because it was such a pain to get this to work because I had to recompile everything in the frame pointer Right now you say well, what if that's too expensive to run software the frame pointer? Please check this out and prove that it's actually that expensive because every time we've done it It's negligible. It's like less than one percent There's one case where I've seen that it was we think it was much more than one percent And I'd love to analyze and see what went wrong But I didn't have p mcs and peb so I can't get a good profile of that So it should be negligible. It should be Not a big problem We shouldn't have to go through all this pain of like recompiling things for the frame pointer I would like debian and ubuntu to give me libc packages with the frame pointer And lib p thread with the frame pointer like the frame pointer We should it's time to bring the frame pointer back for x86 because it's such a pain to live without it And so I want to see it turned on everywhere so that we don't have to compile everything Too many times I've I've spoken to other companies about how do you deal with it? Like oh, we just recompile the whole software stack with the frame pointer. It's like thanks So everyone's doing the same thing like we need to fix it upstream Last question Yeah, we turn on we turn on frame pointer for most microservices. Yeah Yeah all the time. Yep There's only one microservice that had a major performance problem and they said it was 10 Now there's a lot of weirdness about that microservice because they had uh, their stack depth was over a thousand frames That's after inlining this broke linux perf So annalto calvado de mello went and fixed the kernel. That's actually why we have So this is why we have the perf Event max stack this here This was added because of netflix's microservice that had thousand frame deep Stacks so that at least we could start recording stack traces and if a thousand sounds bad That's after java is inlined it the unenlined stack traces 3000 frames deep And so this is like an an insane application, but they Thought that the frame pointer cost was 10 percent I i'm not sure if I should believe it. I what I want is a profile Where I can find out why is it 10 percent? Is it 10 percent because the extra extra function prologue instructions? Is it 10 percent because we're just busting the icash for some reason To do that with all the profiling accurately I need not just pmcs, but I need pebs precise event based sampling Because you can't profile the function prologue accurately without pebs and pebs is currently not enabled in the cloud So I can't do pebs piece profiling which annoys me until I deploy a bare metal server on ec2 Then run the java stuff there then I can do pebs then I can look at the function prologues Then I can explain why they had a 10 percent regression And at the end of all of this even though they had a 10 percent regression with enabling the frame pointer They did it anyway because the value of cpu flame graphs was more important They initially didn't for a year and they had this canary. They was the flame graph api canary But then they got sick of it and so they turned it on everywhere We're all gonna have to fight this battle. I know I know I it's really I hate unsolved performance issues I want my pebs back so that I can I can put my finger on this is where all those cycles are This is why it was 10 percent the kernel, but come on the kernel doesn't have that deep stacks Perhaps it's still pretty big Rel has frame points don't turn frame points off. I want frame points. How are we going to stack walk? You're going to write orc in bpf in bpf Yeah, yeah, you're using orc from from debuggers are like the um messages But like how are we going to code orc in bpf? There's not take like I know that I know that but I need to implement that in a bpf program Okay, we can have a Yeah It can be done All right Yeah, we can have it. We can have a helper. We can have a helper function Yeah, we can have a helper function and then call it. All right fine All right, I'm around at the conference. So feel free to ask me more questions. Thank you very much Yes, if you're in this room, please say stavenstock. It's gonna be awesome as well testing Yes Check check All right Check you guys everybody hear me Anyone kind of Not really how about now any better? Yeah, I think I well. Yeah, I'm a loud talker. Anyway, is that better? Go figure whispering into the microphone isn't effective All right, I think uh big hand is on the four little hand is on the six so we can kick things off Uh, hello everyone, uh Welcome to curse of cardinality history and evolution of monitoring at scale Uh, this is basically going to be kind of a little history of how uh, we have Yeah, down. Okay closer. All right How we have traditionally and uh monitored our services at Ticketmaster and how we do it now and how We intend to do it or hope to do it in the future Uh, my name Next slide Is michael goodness I am a systems architect and tech lead on the kubernetes and cloud native team at ticketmaster Uh, just celebrated my 105th week at ticketmaster that is uh two years and one week Uh as of today actually so congratulations to me Self congratulations are the best congratulations Uh, three years production experience with kubernetes at two and a half years Production experience with prometheus Both of which are relevant because when it comes to my part of the of the talk today They're going to focus on prometheus and kubernetes. I'll let Abe introduce himself. I am michael's former co-worker Weeks ago. I'm now a solutions engineer at gravitational. We make enterprise software We're kind of best known for Privilege access management gateway called teleport. It's open source. It does shortly of certificate Management very similar to brendan greg's bless But written in go lang and it's also a full recording proxy man. I was there for about two and a half years And it's been a while since I left So i'm going to cover the parts of the story that lead up to the journey with kubernetes and prometheus And these days because I Do customer facing engineering? I actually don't get to play with technology that often Maybe in my spare time I get to compare products or or dive into something Interesting for the most part I go around and speak to teams. I talk to them about their problems and one of the defining features of that is that When you're in a large corporate environment you think that you're unique or you think that The rest of the world must be doing something better But the the ball truth of it is we kind of all have the same problems and we all have very very similar solutions So this is going to be a path The journey of monitoring over the last 20 years or so a lot of it's going to cover ticket master I'm we have a couple hecklers in the crowd. We probably have no more details about some of the stuff that I do Because I just inherited what was there at the point in time when I joined So with that let's just dive in and go way back To the early evolution of history of monitoring How many of folks here would have used MRTG way back in the late 90s multi router traffic traffic. Yeah, this thing was awesome So MRTG like all good projects started as an individual engineer scratching his itch This was a guy named Toby In England who needed to prove to the university administrators that they needed more internet bandwidth So he basically wrote a pearl script that would scrape the The gauges and the counters from the networking equipment store this stuff in ASCII files And then he used this to prove that hey We're downloading games whatever they're actually doing And then this was the kind of early era of Linux and of open source So he gave it back to the community he published it But one of the things that was fascinating It spawned another component called rd tool. So it used ASCII files and it was kind of Uh Something that was just quick and dirty to get done and then once it picked up steam There's all these community contributions and it was a popular thing He went back and he wrote what ended up being the kind of canonical Implementation of how do you do round robin database? How do you store? Metrics and then archive them and build graphs off of them So everything that we do today with Prometheus and the font and all the other great stuff Actually probably originated with this if you talk to someone who's been doing this for a very long time They all remember rd tool. They remember this reverse polish the notation kind of goofy Way to to specify the map of how things should be compacted or what different Predictions should be put on this stuff. It had a few interesting and notable Qualities these are fixed size files. It was this very clean c library even today It's version like 1.7 on github which kind of is a testament to You know just a university based dude who's a network administrator Who just had to solve his own problem, but then you know took took a lot of care and craft and doing this One of the other interesting things because it's a round robin database had really predictable performance When you size this you specified exactly how big the database is going to be you specified exactly what the compaction factors were going to be And it was a little bit maybe iops constrained or you needed the right kind of disc underneath But it was for the most part this self contained very simple thing From there we go into nagios and nagios is another thing that probably everyone has had experience with Again, it's something that came from The university world this guy His name's Ethan University, Minnesota's implementation of that Kind of yahoo is famous for being huge proponents of it So this actually came from an engineer that stumbled upon one of the early kind of four square type social sharing Things and they had they already had a large kadoop cluster And so he was like wow, this is metrics is basically column oriented data I can Magically scale out and then I can overcome a lot of these deficiencies in prior time series databases by using hbase as my basis for this As part of this he had he created this thing called time series damon in this very simple wire protocol Where you would basically specify like a dotted pair of metrics names Uh the actual counter or gauge and then some tags go along with it And then he did a really simple like shell out to gnu plot to display a graph And if you've ever dealt with graph packages or you've ever dealt with web performance data gnu plot is pretty awesome Because a scatter plot with like tens of thousands of points It'll if you try to do that in some kind of javascript graphing package It'll kill your laptop But if you do it in gnu plot you can see things that you otherwise can't see so Being this clustered thing that always kept data That it was infinitely scaling this if you were in a large enough organization. This was like the godsend This is like oh wow, we're gonna do this. This is awesome. It also fit right in with that notion of stats d in the emission of metrics This kind of shows you the topology In an environment like ticketmaster it became part of like the base build of systems and before you knew it We just had thousands and thousands of metrics per second streaming We had at the system layer eventually at the application layer And it became kind of like the watering hole that people would look look to because it was relatively real time It didn't break that often And it was just awesome to be able to like you this plunk would get slammed and save searches and take a while or be 15 Minutes behind but in the case of this it was nearly real time and almost always worked There were some problems like any One of the events that happened. Um, one of my co-workers his name. I should be not his name is john he uh We were telling Phil five this guy's He instrumented a caching layer And it basically had a pipe in Something like redis or memcash where he was trying to find a hot like why it was it was uh There was a shared new text that was locking that was causing like the entire backlog for all the systems So he added open tstd. Oh, it's to be like emission of metrics right at that point And he and when he did the first time there's a big load event Another time uh, the nature of how The tst damon with q few metrics if there was another outage This could kind of pile on after the fact and I'm trying to like basically flood the data center with connections It didn't do connection reuse or some elements of the topology internally to do connection reuse And the other fascinating things about how h-based works is uh, if you did a like basically a data warehouse style query For say historical metrics because this thing kept metrics forever like a what happened two years ago in some performance event A lot of other people were trying to read fresh data at the same time that could very easily Thrash the cache which would then cause the entire thing to just kind of wobble And not go very well The other weird thing is this is the curse of cardinality Anytime you have a highly reliable data store that gives you real-time metrics off production people want to stick funny things But the nature of how the open tstd tables work is you have like limited namespace maybe 16 million keys Once you use those up you're done So they always wanted to like hey, we want to measure the page load time off of the The venue or off of the artist or something that was A linear set of things that would always expand it So it was something that just kind of limited the usefulness of tstd Now this brings us into the kubernetes error and this was something I was showing the slides to michael and he's like wow I didn't realize that you fixed the open tstd thundering herd problem with kubernetes what we actually did so One of the kubernetes 1.2 or 1.4 So the time series daemon when we get backed up when hbase would like the region servers were doing this massive scan Someone would be asking for terabytes of data And they kind of fall over and the the normal operating procedure before kubernetes was like someone would call the Internal service groups a hey can you restart open tstd and they would go and restart So our first like live production kubernetes deployment within ticket master was actually to run the time series daemon as a pod On kubernetes because as soon as one of these massive queries came in and fail its health check and get restarted you had this natural back pressure provided by The mechanism of how kubernetes livens probe their health checks work No, I didn't know that so at this point we're into where prometheus comes So in excuse me late 2016 Ticket master made the decision to move into the cloud and into kubernetes and at the same time or Really probably surely previous to that decision prometheus reached 1.0 And had you know had been in had been a member project of the cncf for some time at that at that point we joined the ticket master joined the cncf as end user members And decided that since we were going into kubernetes and that prometheus had a very good native kubernetes Story that that was going to be our next generation of monitoring platforms. So Who's familiar with prometheus already? Okay, so quite a few so for some of you this section may be remedial I apologize for that but I'll try to make it interesting We have some dashboards that that I can show off showing It's non-prod traffic. Sorry, but uh Dashboards nonetheless So a quick little background on prometheus originated at soundcloud by Couple of really a couple of engineers former sre's at google It was based on borgman. I think based on I think I don't think anybody would say it was kind of a you know a ripoff of or or You know, nobody would say it is borgman like open source borgman Um What I found interesting about prometheus kind of day one is that it uses a pull model So you instrument your applications And then you create uh, then you make those applications discoverable And then you point your prometheus instance at those applications and Define a schedule and then on that schedule prometheus will scrape the metrics endpoint and Get uh get the metrics Dimensional data. So each time series comes in with labels that you can Define and you can do things like rewrite those labels And you know, then obviously you use the prime ql query language, which is a very powerful language to be able to run queries and Use those labels to select the correct data at the correct values Uh, it's primarily so prometheus Up until relatively recently was simply local storage. It was Uh, they didn't allow you to really export the data elsewhere You, uh, you know the size of your the size of your The amount of memory that your prometheus had available to it Plus the size of the persistent value volume was what you had available But then They introduced remote storage so that you can Typically you would down sample and then send those metrics to some other system for kind of long-term storage long-term Reporting a couple of examples of remote storage Our graphite there's actually A graphite back end And then there are uh, let's see Thanos is another one That came out of Sorry, I yep, I shouldn't have mentioned that because I'm not sure where it came out of But Thanos is one option and then there are several several others out there and You know what what really one of the things that appealed to us is that there are dozens of exporters dozens of integrations available or available for Prometheus kubernetes in particular exposes all of its metrics using the prometheus format We at ticketmaster have written TM aws an aws limits exporter So it's a way of of scraping cloud watch to find out What how close we are to reaching our account limits on certain resources and that's actually one of the dashboards that I can show And then there are several other integrations for shipping metrics to different places or Even bindings for different languages so at at this point What i what i'd like to do is show quick demonstration of Of those dashboards that i've been mentioning. Let's see if i can attempt the demo gods here Yes, all right. So, uh, this is the prometheus. This is the prometheus console You can see at the top alerts graphs status the way this discovery services discovery works is it will show you Based on the configuration that you passed into prometheus The services that are available for scraping. So we've defined a number of a number of jobs here container metrics at cd at cd kates at cd pods service endpoints etc keep state metrics in particular call out is So as i mentioned before kubernetes exposes prometheus metrics in the cubelet that is the the node agent So you can get things like node cpu usage Container, sorry not node, but container cpu usage container memory usage But then kubestate metrics exposes metrics about the kubernetes api itself number of deployments Number of stateful sets number of pods etc So that ends up being a very handy exporter for kubernetes monitoring and specifically monitoring of workloads You define your alerts here and i'm very actually surprisingly I'm surprised to see that they're all green. That's not something i expected when i when i fired this up earlier today For better or for worse So the way that promql works is you can And this is just kind of the quick and dirty node query editor In if you're going to create a dashboard you'd probably do that in grafana But for quick evaluation of a query you could do something like node cpu. This is going to show The cpu counters on each node and this is a fairly large cluster. So It's going to take a little bit But then you can see and you can see here the dimensions that are available to each of those each of those time series availability zone cpu for for Multi multi processor instances, of course the easy to the instance name Some of the labels that we assign to those to those nodes So if we want to drill down a little bit more, let's say we we're only interested in the idle time of Of these cpus. So we're going to add a dimension to our query and it is why am I not finding it? Mode equals idle So now it's only going to return those series that have mode equal to idle So now you can see that's that's limited our series. However, uh, you know, we don't necessarily care about the differences between the idle time on cpu zero or cpu one on each node. So let's drill down even more. So we'll do a sum by instance So now what we're doing is we're summing Those the the time series That have mode set to idle and we're summing them by instance So now we're getting kind of the total idle time per instance we can get even fancier now by showing the actual um The the non idle time. So this is going to be actual cpu utilization and bear with me. I don't have this one quite memorized Um, so I'm not going to bother. I'm just going to copy and paste so So here we get the uh, basically the average utilization per or uh, yeah average utilization percentage uh for the last It's doing it's doing a It's doing an irate over the last five minutes, but this is really the instantaneous Uh cpu cpu utilization percentage. You can see by this By these numbers that we have some severely underutilized worker nodes in our clusters we are We are very aware of that and it's something we're trying to address with with our users But our our workloads end up being much more memory Intensive than cpu. So we end up, you know in order to spin up the right amount of Memory we waste quite a bit of of cpu Uh, and then there is a built-in visualization tool in the prometheus Console so you can get a get a pretty good idea Visually of what your what your query evaluates to So this is all well and good except, you know, you don't want to have to write those queries Each time you want to look at some numbers, which is where of course grafana comes into the mix Grafana is an open source visualization tool. It's It's undergoing Just some ridiculous growth really they are Building in lots of new functionality There's you know Prometheus typically you would use its bundled alert manager to handle defining your alerts and then triggering those alerts Grafana is also adding that and then there will be some There's some integration coming so that You can visualize your alert manager alerts through grafana and handle them appropriately But we're what we're really interested in today or what I hope you're interested in today is seeing some of the things that we measure so This is uh as the dashboard is helpfully named. This is our aws capacity And this is showing Network transmit and receive Again instantaneous Although I do think we scrape these metrics Um, I think these I think we scrape these every five minutes So it's instantaneous, but you know averaged over the last five minutes Some of our you know, so you can drill into each one of these and see which Which instances are transmitting more data than the others? Same with reception You can create some pretty interesting gauges So I mentioned before the aws limits exporter that we have so One thing that we've run into is availability of application load balancers in our accounts We primarily use albis for our ingresses And so we've had to put special monitoring on those so that we don't exhaust our pool while Uh while users are trying to actually create ingresses into their applications So we're sitting at a pretty good place right now Uh, I think once we you know, once we have 75 we do have an alert set up so that it'll let us know And then it will actually, uh create It won't create a ticket because that that api as far as I know hasn't been exposed to us the aws api for creating tickets hasn't been Created however, it will fire off a notice to us that it's time to create a ticket and actually Preformats the ticket so it's an easy copy and paste into the aws console Um Clearly this has bitten us Enough times that we thought it was justified going through all of those steps and having some automation around it Uh target group, uh again Is something that uh We've we've been bitten by it looks like we're we're in a pretty good place there You and I usage so all of these things again, you know kind of restating that these are all things that we Have either been bitten by or think will be interesting So we've created we've created dashboards around them and in some of these cases alerts and then of course, you know, um We want to see who's who's doing what? Uh as you know as ops engineers were we're a little nosy So it's good to see who's who's using up all of our alb capacity And ebs capacity as well. We don't run a lot of stateful applications but uh for you know, we do have We do have showback mechanisms in place and so we we need we need to be able to see who's using what Because our kubernetes clusters run in a central account. So this is one way in which we do accounting and and kind of attribution of assets Think okay capacity Yeah, so by instance type showing us what type of instances we have um This is total capacity. I'm okay disclaimer. I'm not sure what this one actually is showing us So i'm going to switch over here. Uh, this is another dashboard. We have so we Have been running nginx as a shared cluster service so that anybody who needs ingress doesn't want to use an alb Can opt into our centrally managed nginx ingress controller and this is a dashboard that that monitors That monitors that controller So we can see request volume Controller success rate how many we're actually running. This is an example of where kubernetes metrics comes into play Quest volume success rate 100 like to see that Etc. So that's You know kind of a quick example. I'm not going to go through the process of of actually creating a dashboard But I thought it would be useful to to see what what we do with Just a couple of examples of what of what we do with prometheus and kubernetes When it comes to deployment and this is actually something I meant Intended to mention earlier. Let me see if I can now switch back Good enough, right? When we started using prometheus, uh, we were using the upstream Oh, I went back We were using the community helm chart just one off prometheus installations we run one prometheus instance For the cluster as a whole and then each application team also deploys their own prometheus that way they instrument their own applications they are free to To uh, you know label their metrics they get to determine their metrics they they name them They label them whatever they like and then they set up their own prometheus and create rules around it and dashboards Uh, that got to be quite a bit of work the Uh prometheus helm chart got to be Uh kind of a handful Uh, I I get to say that because I actually created it uh, so CoreOS came up with a project that was designed to make uh prometheus more Operable operatable in uh kubernetes. It's called the prometheus operators the first example of their operator pattern. It's designed to you create you define the the The characteristics of a prometheus instance that you want you feed that into your kubernetes cluster and then this operator brings it up and makes it happen and It kind of abstracts some of the operational details of running a prometheus and so that's what we went to We moved away from the community helm chart and created our own that actually instantiate instantiates these prometheus Resource definitions and then teams can also plug in their own rules They can plug in their own service definitions and what comes out the other end Are their prometheus instance with their rules defined and then We provide a centrally managed grafana that they can use to for dashboarding That's worked very well. Uh, there are some other things that we want to embrace from the prometheus operator At least within the kubernetes cluster things like service monitor Which is a way of rather than having to add annotations to your services You define a service monitor and it does the heavy lifting of of making sure that services are tracked correctly Uh prometheus rules is a similar way of of declaratively Defining the rules that you want to create for prometheus rather than having to create a long list of well You still think you know if you're trying to define a lot of rules It's still going to be a long list, but it's a uniform way of doing it um So prometheus has worked quite well for us. There are of course challenges We because we run it because we allow teams or force teams to run their own prometheus is It can be challenging sometimes to uh to keep them behaving correctly The cardinality isn't a very well understood uh, or you know limiting unnecessary cardinality Uh is is not a very well understood uh concept for for some teams. So you they end up measuring um You know instead of doing things like averaging by a specific uh application they'll average by Uh, you know every instance of an application and when you've got a highly scalable workload You're going to end up with a ton of metrics and That's you know, that's not really what you're looking for You're looking for you know an average across all of the instances of that application um So that that's been a challenge that comes along with For uh, so in kubernetes in recent versions of kubernetes You can do custom auto scaling or auto scaling based on custom metrics and The way that we want to implement that is through the use of prometheus The problem is that uh that metrics api is is a singleton it runs centrally So you have to have one prometheus instance that provides all of your metrics that are intended for custom auto scaling as i said We don't have any kind of limitations on teams right now in terms of what their metrics are named So if somebody wants team a wants to scale based on requests per second But team b wants to uh Uh Wants to scale based on Uh, you know http requests per second that that's small difference in names means that you know We would have to have we would have to provide both metrics from that central prometheus Uh, it's not insurmountable, of course Uh, but it is a it is a challenge The other uh, is that we tried providing long-term storage as a service. Uh, i've mentioned before thanos Uh, but it turned out that that just and that uh added quite a bit of operational overhead Some additional overhead to um to my team the kubernetes team and our devx Uh, toolers Uh, so we've had to deprecate. Uh, we've had to shut that down or actually just make it unsupported If a team wants to store long-term metrics, they're kind of on the hook to to handle that themselves So what next i've been paused on this slide for a while what next? In single word, it's you know this track observability where we've come a long way in terms of You know actually capturing metrics being able to see what our applications are doing But we're not you know, we're we're certainly not there You know, we're we're not at the end. I think there are we will get better Uh, we are going to be introducing more metrics better metrics Uh, educating teams on what makes a good metric what makes a good alerting rule Uh, we recently introduced distributed tracing to our park to our platform portfolio Using open tracing and uh jagger Um That actually that project tracing was kind of a was a runner up for our 2018 Kind of excellence awards, which means that observability even in these kind of early days at ticket master In terms of distributed tracing Is already is already showing quite a bit of potential And that better logging We have not been very good at As you know as as Abe said when when you give teams the the ability to output logs You know, whatever logs they want there tends to be some sloppiness, so We're actually in the process of spinning up an observability team to kind of create some of these processes to work with the developers to create these processes and Develop some tooling or adopt some tooling to make our observability better At ticket master when you know every time a ticket every time a big show goes on sale Uh, we are you know asking people to come in at DDoS us. We need visibility and That's that's really what we're trying to get at with and have been with all of these tools But as the ecosystem as the technology matures, uh, that's that's what we're trying to get at So, uh, that's really all we have today We'll open it up for questions if if anybody's curious about you know the the history of Of monitoring at ticket master what where we're going where we're at Sir, sure. So the question is Uh, I mentioned other you know other tools are available honeycomb cystic data dog Lots of you know observability is a huge uh huge area right now So what would be the advantage of adopting one of those technologies or several of those technologies uh over Kind of the the diy And and that is absolutely something we're looking at as well. Um Because You know running prometheus while we've been able to while using the prometheus operator has taken a lot of that operational burden off You know off of our shoulders There's still somebody who has to figure out what observe, you know, like what observability means what the what the correct metrics are We've I don't The only service that that I know that we've engaged with uh is Shoot is around open tracing And i'm drawing a blank At the moment, uh, however we have we've we've looked kind of cursor cursorily at, uh, honeycomb honeycomb has some very Has a very interesting story that I think we want to that I think we want to look at even closer because You know, uh, we're We're not in the business of developing our own met, uh, our own monitoring solutions Especially when you look at our history and how many how many of these times how many of these systems We've gone through each time trying to get a little bit better I think there's there's certainly something to be said for letting the pros do it Uh, so I can't you know, no specific examples, but it's certainly something Uh that that we're looking at as yeah as I think we should Others yeah Right, so the question is how do we deploy how do we operate open tracing? Uh, specifically are we using a service mesh? Uh, no, we're not doing service mesh yet We've certainly had uh requests for service mesh. We've there's a lot of interest among the teams For istio in particular We definitely so the reason we've kind of avoided it So far is because is due to the operational overhead that it The you know the manageability that it that it introduces and that's just not something that My team has been eager To bring on board yet, uh, and there's and there's there's also the question of really, uh the kind of It cost benefit ratio Uh, we ask teams why they're interested tracing is certainly Kind of it's at the top of the list Uh, but in relation to the rest of the overhead it just isn't uh We just that equation doesn't work out for us. So in terms of what we are doing Yes, we inject a container into each pod and that's how we're and then we're using uh, jagger to to um Together those traces and Right, right. Yep And you know, so again that that extra You know, we are we we do acknowledge that we're making them do More work, right that and and we would really we really would love to provide something that would just kind of take care of You know at least the 80 20, right? Um, there's still value in in actually instrumenting your own application with traces But it would be really nice and it would really kind of push our kubernetes adoption and and you know observability If we could provide something like that day one when they deploy We're just not there yet Sure the other questions I think everybody gets some some time back then So thanks for coming AWS limits so there's one it's there's