 Testing? Test, test, test. Is that OK? Cool. So yeah, thanks everyone for joining us today. I am very pleased to introduce Timo Tihoff, who is a senior performance engineer at the Wikimedia Foundation and a member of the architecture committee and a long time contributor to both our code based and our projects with too many accomplishments to numerate. So yeah, with that said, I'm going to pass it over to Timo. Thank you, Ory. So my name is Timo Tihoff. I'm a manager on the performance team at Wikimedia. I'm here to talk to you about creating useful data. Over the five years, five years, I worked on various different products and updated to visualize the data for that product. These are my main job to represent the data, but I felt very interested in stopping on different ways of representing the data. More recently, I've been doing this with the aim to do web performance, to make web pages load faster and to make the back end more efficient. But even before that, we have many different types of dashboard to visualize data. We have dashboards that, for example, track the usage of deprecated functions in JavaScript, something like that. And also, our rate is of exceptions in the Wikipedia environment. So we have many different dashboards. My aim here today is to help you create better dashboards and to get an understanding of the problems you might bring in to do that. This is going to be most useful if you've already made a little bit of Grafana. We have more. So, but this doesn't apply to just Grafana. This applies to any data visualization. So I'm hoping this can be useful, even if you're not specifically interested in Grafana. I also tend to have a separate talk that introduces the specifics of just Grafana, but this will be general advice. For questions, please take the end. You can either raise your hand if you're in San Francisco or if you're following the live stream, you can direct them at Ori in the week to get you off the channel. You know what I'm saying. So let's get started. So I'm framing this talk in terms of various lessons that we've learned. Lessons that I've learned in the performance team and before that, using real-world examples, all of which are up public on Grafana.wikimina.org, if you're interested. And we'll start with managing fluctuation. So this is the rough overview of what I'm gonna talk about. I have five subjects to cover and we'll talk about 30 minutes and then we'll leave it to questions. So let's start with fluctuation. Fluctuation is basically when data is always changing, even when you didn't do anything, it could cost that change. So basically you have no deployments, no software was changed and yet your data is still constantly changing. And that can be quite frustrating because it means that that line on the graph doesn't really tell you anything about how well you're behaving, but it tells you everything about the users that are behaving. It can be very, and maybe that's okay. Maybe you look at a graph to get an estimate of your current average. Maybe that's your use case. Maybe you just wanna go with a graph and you don't need to follow the line, you just wanna see what is the current average? What is our traffic right now? What is our performance right now? Those kinds of questions. And so here's a few examples where you can avoid that. So for example, you might not actually need a graph. Maybe what you actually want is a single stack. Here's an example where you have several single stacks. You just pull out that one number that you wanna focus on and just leave the graph to what it is. You might not even need a graph at all. So sometimes the best graph is no graph at all. You just wanna pull out that number and focus on that instead. And here's an example from Anna. Her own dashboard, here's error statistics on errors and 500 server errors. Again, you just pull out that percentage and just present that what it is. Here's another one, navigation timing again. What is the current facial time? Just answer that question immediately without focusing on the graph. And you have more space in the dashboard to focus on for things that you actually need a graph for or more advanced information to have placed that way on your dashboard. So, but in this case, you have fluctuation and you actually want to have that graph. So what do you do? How do you deal with that fluctuation? You need to deal with that fluctuation in some way. You need to change the data. You need to massage it into some functions. And so let's cover what we can do in that one. The purpose of this talk is to split fluctuation up into two broad categories. One is seasonality and the other one is oscillation. So I'll explain what those terms mean in this context. Let's start with seasonality. Seasonality is basically when your data changes over a large period of time, not at a small period of time. So for example, you have a particular pattern that repeats itself every day or every week or even every year. And these kinds of data tend to have very smooth swings. This is just an example of what that might look like, a relatively low frequency and a very predictable pattern. We can meet at these kinds of patterns are very, very common. For example, if you look at our page view statistics, there's a daily pattern to that. Even though we have a very international audience, there's still very clear daily spikes and daily drops and there's nothing bad about them. They're just a daily pattern that we are. But it can make it harder to spot anomaly in your data because it's changing by default. You need to account for that in some way. The way I'd like to describe seasonality is by describing it as small changes on a small scale and large changes on a large scale. What that means is if you take a slice, say 10 minutes of your data, you would see almost no change at all. But if you zoom out to say the whole week, see a very large change that's happened slowly over time. And what you can do with that is actually very little. You wanna embrace this seasonality. That's the first lesson I wanna start with today is embrace seasonality. You don't wanna cancel it out. It can be very, by default, one tends to try to cancel out this noise and make a flat line. You wanna have one line that doesn't move at all, unless something important happens. That's a good instinct to have, but not when it comes to seasonality. In that case, you wanna actually, we can switch mics potentially. All right, I'm hoping this is better. Okay. So we were at embrace seasonality. So when you have seasonal change, you wanna try and keep that for what it is. You don't wanna remove it, but that means you need a different way to detect change. So let's get back at this graph. So this is an example of graph of weekend media traffic on a daily basis. This is a 24-hour capture. And you can see that it varies from 4 million to 8 million per minute, which is a very wide range to cover. And if you were to average that out, you would lose a lot of detail and a lot of value for the graph. In that case, you might be more interested in a single stat if you just want to have the single number. And a danger with averages, it can be a very difficult thing to explain. It took me a while to realize like what problems with an average in your graph. I'm gonna explain it by an example of a quote. This is often attributed to WIE gates. It goes like this, then there is the man who drowned crossing a stream with an average depth of six inches. So six inches is about 15 centimeters. So imagine a lake that's 15 inches deep, 15 centimeters deep. It can like, how can you drown in that? Well, the key word here is average, right? So it could be a huge crater in the middle of that lake or it could be a huge mountain with an island in the middle of that and you wouldn't know from the average. So let's get back to our graph here. So if you want to keep this shape somehow and still detect change, how can you spot a regression from the normal pattern? One common way to do this in Grafana and in most statistic systems is by adding a time shift. And a time shift simply adds a second dimension to your data that shows you the same period of data from, for example, the week before. And then you can follow the two lines and see where they diverge or if they don't, then that means the pattern is normal or at least it's occurring in the exact same time as last week. That's an example of there's multiple ways to do this. You can also use forecast algorithms like cold winters, which are also built into graphite. So embrace seasonality. The second type of fluctuation sometimes deal with is oscillating data. And oscillating data is very different in its behavior. It tends to have a very high frequency. It behaves very erratic. And it's kind of the opposite of seasonality in that it has very large changes on a small scale and very small changes on a large scale. A typical example of this tends to be page load times on Wikipedia. So page view time could be a quarter of a second or it could even be up to a minute, depending on your proximity to the data center, the kind of device you have, your internet speed, the page you're looking at could be a short space, could be a very large page. And so this means that your data changes constantly in the gap of like one minute or even a few seconds. And it tends to look something like this. It has a very high frequency in a second. So you may want to deal with that. It's actually very different. In that case, you want to try, squeeze it into something more compressed. You want to try and make a stable baseline that only changes when actual values have changed in some significant way. So you want to soften the erratic metrics and create a stable baseline. And a typical way to do this is by using a median. But there's more than one median out there and there's very different ways of producing the median. So here's an example of page load times. As you can see, it's very erratic, but counterintuitive as it might be. This is actually already a median. So then why is it still so erratic? The reason it's still so wobbly is because it's a median per minute. And so even one minute of data is actually very tiny, tiny enough that you still have massive changes from minute, from one minute to the next because there's not a even distribution of different devices between any given minute, right? It changes throughout the whole day. And so what you can apply here is a moving average where you take a few minutes of data, you take the largest span of data and take the median from that rather than from just that one single minute. If you apply that to this graph, you get something like this. Again, this is the exact same data, the exact same time span, but suddenly we can actually spot a pattern here. It dropped about 70 milliseconds over the course of that day. So have we now answered the question, is this a regression? Is this an improvement? Actually not quite yet. Because in this case, the data is both oscillating and seasonal, which is also very common. In this case, it actually always drops to that point during that of the day. And it also comes back up, it's appointed me the next day. It always follows that pattern. So in this case, you actually wanna have, you wanna account for both types of fluctuation. And fortunately, that's very easy to handle in Gribana because you can actually handle multiple functions on a single metric. You can nest them and pipe them from one function to the next and have many different alterations. In doing this, you can also retake your data too much. You always be careful with that. But in this case, you wanna first push it through a moving average and then you can add a time shift as well. And that shows you that it actually wasn't a regression or improvement after all. It's just a normal daily pattern. So these have all been graphs of situations where everything was still normal. There was no anomaly to speak of. So what does it look like when there was an anomaly? Well, in that case, the lines diverge, as you can see here. So during the first two days, everything was still normal. The two lines kind of fit like gloves. They just follow each other. They show exactly the same kind of pattern. But then around there in the middle, around January 1st, the green line starts to go down. And this was actually a performance improvement that we have to claim this year. So getting back earlier, sometimes an average can be useful, but only if you look at a very large period of time. So for example, this is that same metric spanned out over a period of four months. And you can use a function like summarize to take an average per day and per week and graph that out to see very large scale changes of where you're headed right now and how you've been in the past to get a larger perspective on things. Again, these functions are documented very in graphite and grafana. You don't need to memorize these names to discover within the interface. So don't worry too much about memorizing those names. So again, soften the aerobatic metrics, but when I count for seasonality, when you have to. So no matter if you've suppressed all the data into a nice static line, what about too much suppression? Things can go bad there as well. So the opposite of fluctuation is compression. And where fluctuation is changed that is made invisible gets to us too much change. Suppression is changed where the changes in visible gestures not enough change. And this tends to be caused by a very different class of problems, something that isn't a problem in the data, but a problem in the presentation. So if you have a graph that has many different data points then your lines get squeezed down into a flat line. So the typical characteristic of suppression is where you have a flat line that doesn't move even though the values didn't move. But it was too small to notice. So you want to avoid compression in general. You don't want to squash your metrics so much that you can't see the difference anymore between what is change, what is not a change. There's an example of that. Again, we're using navigation timing. So we've got three metrics here. First paint, page load time and media wiki load. Don't worry so much about what these three are, but they have different behaviors of their own. They're all related. So it's interesting to track them together, but we're having them in a single graph. You can't really tell which one's changing when because they become almost flat. This top one is actually going up a little bit. The yellow one is going down a little bit and the green one is obviously going up a fair bit. But if these were in different graphs, you're able to detect these changes more significantly. So you want to try and expose the changes as much as you can. And then you can get a graph like this where suddenly the change is much more obvious. This is maybe an exceptional case if you're looking at a single graph. Remember, you're usually building dashboards which will present like 20 different panels or 10 different panels. So by optimizing the usability of individual graph, you can make it easier to discover problems and to claim improvements. The second main problem with compression is zero binding. So zero binding is an interesting problem. So imagine if you have a graph, typically you have your zero point in the bottom left corner and you go from zero to two seconds on the left side and you have your timeline on the bottom and you can get a graph like this. So what is the problem here? Well, that zero binding is very traditional but it can also make it very difficult to actually find change. So with the zero binding, it looks like this but without the zero binding, it can look like this. It's quite a big difference. And suddenly you can see that we're here, the line looked mostly static following a particular pattern but not really going down much because it has to take up all that space to get to that data point all the way from zero. And again, it's a linear scale, right? So it wouldn't catch up there quicker. You can drop the zero binding in Grafana and suddenly you can expose that change more specifically. You can actually zoom in to that point and get the change exactly. So you wanna avoid compression where you can. It doesn't just help you focus on an image or metrics but it can also help you find patterns that you wouldn't previously see. So for example, here is resource loader minification efficiency in terms of a percentage of cache hits and cache misses. These two lines tend to follow each other a little bit but one is clearly more significant than the other as you can see. But that's actually not the case. The blue one is following the exact same pattern as the green one. They even followed the same relative change but it's impossible to see because one is compressing the other because they will have to fit in one graph. And you can fix this by applying what is called the second Y axis. And here you can see that while the blue line looked smaller in change, relative to its own history, it's actually making the exact same change. Which you can now see if you fit them together with the second Y axis and suddenly they're actually exactly the same. And that can be very helpful and also again you can use this to determine if they avert at any point to find a regression in one of your minification algorithms or something like that. So now that you've learned about fluctuation and compression and everything out into separate graphs, now you have a bazillion graphs and that's too much as well. So you need to be careful about overload. You don't want to have too much on your dashboard. You have to be picky and find the ones that really matter most. If you find yourself going over your dashboard and routinely ignoring certain graphs because you don't know even what they mean or they follow a pattern that can be up or down and you don't know whether it's good or bad and it just changes by itself and nothing actually happens, you might want to drop that graph or maybe you want to find figure out what that means and mutate the data in some way through a few functions to get better representation out of it. So you want to ask yourself the question, is this graph useful? Is it helping me answer questions like, can you recognize what is normal? Can you recognize what is a regression? For some software properties that we use, they have built-in statistic methods that expose like a whole bunch of properties at once. And it can be very attractive to just graph all of them on your dashboard because you can use wildcard queries and automatic expansion and you get a whole dashboard for free. And now you don't know what to look at anymore or what the individual properties mean or what kind of values they have tends to be worthwhile to take a few of them that you know what they mean, give them the right metric, give them the right units, give them a good title and all of that and be able to actually find a relationship between these two different metrics. If you have a large number of sub-properties, you might actually want to go in a very different direction. Again, remember that you don't have to print these graphs, right? There are interactive web pages. There's a lot you can do with a graph to make it more interactive to make you able to discover the data without having it all on one large page. So for example, Grafana has something, template properties, which we'll get to in just a second. But in general, you want to avoid creating a graph for every single property, especially if that property is in the hundreds of different variations or if it's something uses supplied. There's a few things you can do when you have a lot of properties. You typically want to find common patterns. You want to find a few of them that stick out, that point out in some way. And so you can use something like a percentage graph or a stack breakdown, but just for event logging here. So we have hundreds of different schemas in production at different points in time. And giving each of them their own graph would take a long time to figure out which one is rising or which one is dropping. But you can create a single graph which tracks them all at once. And most of the small ones just follow along at the bottom. They don't take that much space. You obviously want to disable the legend in this case, that would get very, very big. But you can see here, like there's a few things going on, there's patterns. This is essentially a lot more worthwhile than a top 10, where you pick out the top 10 properties and graph those because the top 10 always applies to the current value or the average value over the whole timeline, which would make you probably miss the red line over there. If the average was below the top 10 at that particular point, so you tend to lose individual spikes that you could otherwise get if you graph them all. To this case, we're graphing a single property. We also have a lot of data where you have multiple properties that are nested, for example, the job queue here. And what you can do in that case, you can aggregate those properties into a single one and take only the sub property. Graphite has functions for some series and Asian methods where you can take a whole class of properties and aggregate them by a common sub property. And then you can track, for example, all the job queue events where something was inserted or duplicated or somehow failed in some way without tracking each of them individually. Of course, then you still wanna answer a question like which one is the most common duplication or the most common failure? And then you can create a separate graph for that, where you can focus on just that particular aspect of your data. So there are hundreds of different job queue classes, but by creating a percentage breakdown, you can find very quickly, which is the most common one. For example, in this case, duplicate inserts tend to be dominated by a serious search at the moment. So again, BPK don't create too many graphs on single dashboard. You don't wanna cram them all on top of each other. I'm gonna find a few data points that really matter to re-application and try and expose those for what they are without much compression or without many functions that misrepresent the data. And ask yourself the question, can I see in this graph whether the situation is currently normal, whether there's a regression going on? And this can be especially valuable after an outage. So let's say something just went wrong and you fixed something in your application, we're like, wouldn't it be nice if we can automate this or if we can detect this? Try and go back into those graphs and see if you can spot it retroactively. It's okay that you didn't see it at that time, but can you at least retroactively prove that it happened? Does your data actually show that? Or does it not? In which case you can try and optimize your graphs or maybe you missed it because there were too many graphs. So that's also coming to keep in mind. And to get back at the interactivity, leverage interactivity as much as you can. There's many different features in Grafana. I won't cover them all today, but that's why you to create a very fine subset of graphs that you can dedicate to a particular feature without duplicating that for every feature in your application. So for example, here is event logging. And again, there's hundreds of different schemas running around in production, but you can still have a dedicated dashboard for one of those schemas. So as a user of one particular aspect of event logging, you can bookmark just the one that's for your particular schema. So in the top left here, there's a schema selector, this one, where you can type ahead and just find the one schema that actually applies to you. So that way you can have one dashboard for all schemas, and yet focus on an individual one without repeating or having to scroll or find the right one. And individual teams can bookmark the ones that are relevant for them. And in building these dashboards, it can be very easy to find, to fall into a pitfall of the wrong metric type. There's many different metric types that are exposed to graphite BS stats theme, which is the software we use to aggravate data points in production. And I would encourage you to try and get to know your data metric types. There's many different ones, and they might look very similar, but being completely different things. So there's a page that I wrote on Wikitech, which I use a reference a fair bit, where I've documented most of this, but I'll summarize just the two most important ones here, counters and meters, which are the most common properties. It's the most common metric types. And so you might be inclined to, example, graph out the mean of your counter, right? Like what is the average rate per second of what is happening to my application? Like the average number of pages in production, for example. And then you see if that line, and you wanna, well, how did that happen? So in this case, it's actually a research loader page, research loader requests, of which the mean is fixed at 1001 into infinity. And so you wanna know how did that happen? Well, I won't get into the details too much, but basically there's many different dimensions to your data. There is the mention of your application, like what is the value that I'm adding? Like you can add at some point, I've detected 10 page views and I'm adding 10 to my counter, but adding 10 to your counter is single value that represents the value 10. But from stats these perspective, it's just one. And so there's different perspectives on that same counter. And the main one you wanna look at is the rate one, which is the actual value that you submitted. But it has many sibling values that can be very confusing if you're not aware of them existing. And so in this case, the research loader demon buffers in groups of 10,000 requests and then it buffers them out to stats V, which is why the average increment of that statistic is 10,000, actually 10,001 in this case. But that's not your actual request rate. So this is not what you want. So what do you want instead of net? Maybe you want the count. This is starting to look a little bit more realistic because it follows somewhat of a reasonable shape. And then if you look at the numbers, it says 100, 110. We definitely have more requests than that in production for a research loader on a permanent basis. So that can't be right. So a count unfortunately is also an internal metric. It's not the real thing that you're looking for. In theory, if a research loader were to send a message to stats for every single request, then the number of messages it received is the same as the number of requests that you counted. But in most cases, they're very different things. So that's not what you want either. What you want instead is the rate property. And the rate property, as we can see, represents exactly what you want. In this case, it's one and a half, 1.1 million requests per minute on average, which is the actual rate of research requests right now. The rate property is a standardized property between all the layers of a media wiki, JavaScript, stats, the graphite. They all aggregate it correctly. So you can rely on this property always being accurate throughout all the different layers that we have for our metrics. If in this case, it's standardized on per second. So if you want something per minute, you can apply a scaling function and multiply it by 60 and then get the number of requests per minute. Again, these functions, they're all auto-completed whenever you type something into Krufanos. You're gonna have to memorize them and just look them up. There is one more property I wanna cover just in case you run into it. I saw this too many times to leave it out here today, which is the sum property. The name is so incredibly deceiving, but also an internal property. It's not the one you're looking for. But yeah, if you graph it out, it actually represents the exact same number as we had with our rate property earlier. So rate time 60 is the actual request per second, the one that I told you you can trust. And then the sum is actually the same value. It is also 1.1 million. And yet this value is wrong. You might be wondering why is the sum wrong? Sorry, just got the wrong slide again. So why is this one wrong? Well, right now we're looking at the last seven days. Everything is still fine. And now let's look at the last 30 days. The shape is still fine, but the number says 16 million. Well, and even on the right side of the graph, which is the last seven days, also now has a different number. It's no longer one million. It's now suddenly 15 million. Well, how did that happen? It's the exact same property. It's the same counter. And this is because sum is not what you think it is. It's, again, it's an internal property that in this case happens to add up all the values that happened within a particular resolution window. It's an internal buffer that graph had has. You don't wanna know about it, but it changes depending on how far your query goes. So, but even if it's per minute, which is why it happened to match what you wanted. Then when you go back to 30 days, it's actually on a per hour basis or a 15 minute basis. And then it tends to be a completely different number. So again, if you wanna track the rate of your counter, the actual value that your application is sending to graphite, use the rate property. And if you want it per minute or per hour, you can multiply it as you need. Again, all of these are documented on graphite for the different sub-properties and what they mean and how to interact with each other. And the same applies to meters as well. So for metric types, use the rate for counters. You wanna avoid it or cause anything like count or sum. You're not what you think they mean. And try to think about suitable metric names. So when you use something like templated values where you can autocomplete and have a dedicated dashboard for just individual metrics, and that is you wanna make sure that you put wildcard query for all the properties in the list that they're actually all properties and not something that wasn't in there. So when you structure your metrics in your application to send to graphite, make sure that when you use something between two dots, that there's always the same thing. So in case of the job view, we have something called dot all, just all the events. But that is in the same namespace as one of the jobs. So it can sometimes cause your metrics to be twice as much as they should be because you have all the sub-jobs, plus the one that is magically called all. So now the total of that bucket is what it should be. So in that case, you can either leave it out and use something like some series or you can create a separate bucket for it under a different name. In summary, there's a couple of questions that I often ask when building dashboards. Again, from the beginning, so single stats, sometimes you might even need a graph, might be better just to have a single stat. Can you easily spot a regression when you're looking at your dashboard? It's an important one. And again, you wanna avoid things like compression, embrace that change and see exposed change as much as you can by removing zero binding and using aggregation functions as necessary. And can you look at a graph and instantly know whether everything is okay or whether something has changed since the last time you looked at it? And you wanna avoid cognitive load where you have two graphs that try to mimic each other and said you can use something like a time shift or a second Y axis. And the last one is retrospectives, right? After something happened, go back in your graphs and see if you can find that problem retroactively and improve your graphs that way. So I've got two documentation links here for learning more about Grafana and for specifically our graph and install and how they interact with stats, you know, those properties that I mentioned, they're all on there. And for that, I will look for questions, very much. Do we have any questions in RSE? Anyone in this room? No, you're out there. Use with the mic. Hello. Yeah, the other function I find useful is also the divide cities function Grafana. Sorry, is that again? Divide series is also I find useful in some cases. So I don't know if you mentioned I can last track there. Yeah, there's lots of functions in graphite to aggregate data or to expose change in a better way. So for example, you can use some series to get all the sub properties in a particular series. You can use division to find a particular one that descends out. You can use forecast algorithms like hold and winters to, if your pattern is seasonal, but not strictly repeating by the week or by the day, you can use a forecast algorithm to get a more complex prediction line that you can follow and then see if it diverges from that. And yeah, there's a whole ton of functions in graphite. So I would definitely recommend check out the graphite documentation list then and see what kind of functions you can use to mutate your data. All right, thank you very much.