 Hi everyone, thanks for coming so early. Like it's been incredibly early for me. I'm like, I don't know, I'll try to wake you up if I can, but we'll see how it goes. So I'm Dimitri, and I'm here to talk to you about benchmarking. Now I'm actually curious about how many of you actually have something to do with databases. Yep, and how many of you actually curious about database performance and like, you know, benchmarking stuff, profile and so on. Okay, I see, I've got my audience. That's really cool. Yeah, one note. So the point is that I know everybody pretty much got strong opinion about benchmarking and performance. That's why in this talk we'll try to reduce the scope a little bit, and what we're going to talk about is following. So usually what happens and what usually I was doing personally is that when you would like to evaluate the performance of your database, like maybe in conjunction with an application or just like, you know, separately within some particular environment, you just pick up some particular workload generation to, you know, like something that runs to PCC or, I don't know, D, H, E, whatever existed out there, or maybe if you need to maybe even run your real business application, you run, maybe you even replicate a real business workload, just like, you know, a copy from the production database, and then you start this tool, you run, you get some numbers, you're happy, you report those numbers, maybe in some, I don't know, in some reports, high up the line, you know, to let other people know what's going on, and then you forget about this, and then maybe after a week or so, you try to reproduce the benchmark, and you think, oh crap, numbers are quite different, right, so in this situation it's a very unpleasant situation to be in, unfortunately, and I have to admit, even me personally, every time when I'm trying to evaluate something every time, every single time I get the situation, literally, it's very embarrassing, but that's why I have learned this stuff. So usually what happens is that, yeah, you figure out maybe there was some misconfiguration, or maybe there was something that was forgotten, I don't know, cloud infrastructure was behaving strangely. Anyway, you figure this out, if you fix your numbers, you repeat it again, you correct your numbers in the report, and so on, but now, we've got a problem. From now on, from this point on, you do not trust the numbers that you get from the off-the-shelf tooling. So literally, all those numbers, you start to ask yourself, how much, like how can I get more confidence in these results in those reports that I'm getting from those tooling, right? What kind of methods could I use to boost my confidence in this regard? And that is going to be essentially the scope of our talk today, so we will try to investigate what kind of methods and tooling we get, especially from statistics, to actually get more confidence about what we're doing when we're evaluating performance, especially for databases. So now, before we start, we have to actually establish the mental model. How are we going to think about this problem? I appreciate that everybody is, of course, thinking differently about this problem, and this is just one of the examples, but I find this particular example convenient to use. So I personally prefer to think about benchmarking and performance evaluation as a problem of evaluating a complex system. So the database, in this case, is going to be essentially just something that looks like a complex system for us, and we've laid this system on the space, you know, in mathematics, it's called usually a phase, space, or in statistics, it's going to be a design space. So essentially, like a sort of a graphic representation, graph representation of your system where every single point on this graph represents essentially your system, some state of your system. So on this graph, you, for example, see on the slide, you can see an example of like one of those phase trajectories for Lorentz attractor taken from some interesting scientific paper, and you see that it produces a very interesting form here, which you can project on one of the directions, on one of the dimensions, and you get even something, you know, something that looks quite interesting, like something that looks like, I don't know, maybe even some, I don't know, performance metrics you can get from your database, but if you pull them together, they produce a very nice picture in this 3D place, and that's what I'm talking about. Now the problem is that unfortunately it's not so easy, and when we're talking about databases, they're extremely complicated, right? So there's going to be huge amount of various dimensions we're talking about. So obviously database parameters, right? Every single parameter you're going to change, you're going to change your, essentially you're going to change your system, your behavior. Obviously we are constrained to a certain degree with hardware resources as well, which means that like amount of available memory or CPU cores and so on, it's also going to be obviously part of our phase space. Now interestingly enough, the workload itself is also going to be a part of our phase space, because it's pretty much like intuitively you can get it. So if you run for example one query per minute, you will get pretty much different result when you're going to run like a millions of queries per second or something like this. And now finally performance results are actually going to be one part of our model as well. It's something that we're actually going to get out of the system. So like for example query latency, throughput or something like this. Now what usually happens is that obviously there's a lot of dimensions. So when we benchmark something, we essentially say that we would like to fix almost all of them. We were going to say that they're not interesting for us for something that we're testing, and we evaluate only how the database is performing within some subset of those dimensions. Like we pick up literally a couple of those, like in this case, we evaluate how Postgres is performing in regards of query latency, shared buffers that we allow for the database to use and query thread that we apply. And here, so it's not a real date, it's something you can theoretically expect. So the idea here is that essentially if you give a Postgres some reasonable amount of shared buffers, not too much, not too few, you're probably going to get best performance. And at the same time obviously when you're going to increase query rate, you're going to make everything slower, you're going to take more resources. So your query latency is probably going to go high. So that's what it looks like, but now it's still not a full story, unfortunately. So the point is that some of those dimensions I was talking about, they are regular dimensions. They're like deterministic, shared buffers. You just set the value and it is how it is. Now unfortunately, some of those dimensions are non-deterministic. So like query latency, for example, is not going to be the same over and over no matter how frequently you run it. So it's going to be variety, obviously, which means that here's the point where we actually have to use statistics to define those things. We have to model these dimensions as random variables, so with the corresponding distribution behind, with the corresponding statistics and so on and so forth. So that's pretty much it for the model. And to summarize, when we're doing benchmarking, essentially we are dealing with two factors. When we evaluate performance of a database, we have some already known part we are aware about, like some known details. For example, we know how shared buffers influence the overall performance, and we try to relate those factors in the presence of some unknown factors. Like for example, we have no idea, let's say, how work mem is affecting database, or there is some noise in the system that we have no idea about. Or for example, cloud infrastructure is doing something strange, right? So this duality is very important for benchmarking. And in fact, throughout the whole presentation, throughout the whole slides, I will try to emphasize this. And this duality comes quite close together with another duality, when we're saying that when we benchmark something, and we'd like to understand the results, we always have to combine two skills. We have to combine the common skills, general skills, like a statistics or something, or experimental designing skills, when we would like to address this unknown part. And at the same time, we would like to use some particular domain model knowledge, like for example, postgres internal knowledge to get, to extend your known part and to reduce your unknown part. So you cannot go without another one, you know? So now to emphasize this part, we're going to essentially the rest, part of this talk, the rest slides are going to be divided into two categories. The first, we're going to talk about some postgres specifics, and the second about like the common parts about the unknown things. So let's talk about how to actually not screw up when you benchmark postgres. Now, how many parameters do you know that you usually have to configure for postgres database when you like create it and run it? Any ideas, any clues? Just chat from the list. I think the latest version was containing about between 300 and 400 parameters or something like this, so like these global unified configurables options. Obviously not all of them are very important. I don't know, it doesn't really matter how much you're going to configure, how you're going to configure, for example, authentication. But I've put on the slide some, just from top of my head, some relatively important things you have to be looking at when you would like to evaluate the performance. Interestingly enough, there is a paper from the author tune folks. You've probably heard about them. They're trying to apply machine learning algorithms to tune the databases, well, like MySQL, Postgres and so on. And in their recent paper, they have actually mentioned that the win performance when they can get from machine learning algorithms, the biggest chunk of it could be actually acquired by configuring only two knobs. And I have put them at the top of the list. Shared buffers and max wall size. So that's essentially the most important or like the biggest chunk of the performance when you can get just by configuring those two things. But obviously the rest are still important, workman, checkpoint or all the flush after parameters. They are sort of important to fine tune your database. And what usually happens obviously, if you misconfigure this stuff, you get like maybe some parameters too low or too high. Unfortunately, there's always this sweet spot you have to achieve and it's hard to find it out unfortunately. Now, the tricky part is that Postgres relies quite a lot on the operating system, Linux or whatever you run it. And the problem is that if you would like to really get the best performance, you cannot get away without configuring the operating system as well. So if we're talking about Linux, for example, most of the time when like everything what has to do with file system cache, for example, or IO, it's obviously something that you have to look at. So like for example, yeah, you cannot get away. Yeah, you cannot forget like huge pages anymore on the big servers. You have to configure probably dirty amount of like how the file system cache flashing is working and so on. And obviously like IO scheduler and so on, it's always interesting to experiment with. Now, another important part, quite frequently it happens that people are essentially ignoring the noise in the system. So they're saying, yeah, it's my real system like the noise is a part of the system and so on and so forth. I strongly disagree with this point and I always suggest when you benchmark something try to reduce the noise as much as possible. Obviously because even in the real system, you would not like to get your performance for quite a lot. So where the noise is coming from? Unfortunately, answer to this question lies quite significantly, it depends on the workload you're using. So for example, if you're doing some CPU bounded load, most likely you have to thinking about, for example, pinning of the CPU or Numa cores. So like, you know, to minimize the context which is and so on. You have to think if you're like really doing this on a bare metal, you have to think about P state or similar 24MD or frequency scaling. So like, you know, power consumption versus power consumption policies. Now, if you go one level higher and you're thinking and you're creating an IO bounded benchmark, now you probably have to think about different matters. So the different scale of things start to play role and for example, you have to think about the file system you're using. So for example, most likely you would like to run your benchmark with already pre-created wall segments because creation of the files including wall segments is actually an annoyingly slow operation just because of this like journaling in the file system and so on and so forth. And obviously like, yeah, you have to configure your NVMe device, it should not be trimming while you're doing something and so on. And the last part, probably the most nasty one is about cloud infrastructure. Whether we like it or not, unfortunately, most of the benchmarking that we're doing nowadays is happening in a cloud infrastructure, which means that we have to always think about keeping mind noise in neighbors and as much as possible try to isolate our environment. And at the same time, we also need to understand virtualized infrastructure quirks. So for example, when, for example, hypervisor stops you or something like this, unfortunately, it's not always possible to configure these things, but at least you need to be aware about them. Now, one interesting thing is that people also, Paul is forgetting about, especially those folks who used to do performance evaluation for a stateless applications before, is that how long should you run benchmark actually? So the question is very, well, it depends quite a lot because it depends on what kind of effects you would like to capture in your benchmark. For example, such important things as auto vacuum that I put on the slide. If you would like to capture it for the real database, you have to configure it and you have to wait long enough. So sometimes it depends. Sometimes, for example, there are advisors to run your benchmark for one day or something. So of course it depends. It should not be always such an extreme, but at least you need to understand what kind of factors in your postgres database or capturing whether it's a checkpoint or auto vacuum and such. Now we're starting to go to fun stuff. So one sometimes forgotten thing is the workload generator plays quite important role as well. So the point is that people not always think about this, but workload generator could be a bottleneck on its own, unfortunately. And for example, PG bench has its own history, unfortunately, of being a little bit slow than necessary. For example, the very first example I was showing before with two different results, that was actually obtained simply by running PG bench, but the second version was a record in the latencies for every query. So even this operation, relatively fast operation was enough to actually slow down a real throughput numbers from the database. But that's only a part of the story. There is something more. So the point is that usually when we think about benchmarking or when we're thinking about the real workload, what happens is that queries that we receive for the database are relatively independent from the database responses. So they're just arriving at some certain rate and they're departing at some certain rate. Now in a Q and theory, there is a such thinness open and closed system, maybe if you've heard about this already. And they're quite different in behavior how they behave. And the real system is usually open system or open-ish system. Now the thing is that usually when we run some workload generator like PG bench or something similar, we usually simulating a closed system, unfortunately, because what happens is that the generator fires a query, it waits for the response and then fires another one. So what happens is that we have a fixed amount of queries on the fly. And this is essentially by definition a closed system. And the problem is the difference is that for example, you'll get quite different latencies based on such a difference when you run test open or closed systems. The tricky part here is unfortunately almost all of the workload generators that I know, they always simulate closed system and do not even allow you to configure this except a bench base. I have seen that they allow you to do a Poisson distribution for queries. So it's essentially going to be an open system. Now we're done with the first part, with the particulars. And now let's go to some unknown details. How to actually use statistics to find some factors you have no idea about. Now the first thing I wanted to mention is that, actually the problem is not new. And industry have faced this problem already in many different ways in the last century. So you see for example, here I reference an article from a student from more than a hundred years ago. When essentially what he wrote is that yeah, like all the benchmarking of course, experimentation is matters as much as you only get some value out of it if you understand the statistical distributions behind it. And essentially this is a paper when that's for example, student famous student T test was introduced. So you see the problem is being tackled already for quite a long time. Now the problem is, well okay it's not a problem I'm going to continue here. So what does it mean? So this is essentially the whole statistics I'm going to show you. Well most of the statistics I'm going to show in the slides. And this is what you usually see when you like read some performance reports or something. So usually people just operate with those terms without thinking about them too much. But what happens is that essentially we're saying yeah, okay. Our values are random variables with certain populations behind them. Those populations have some parameters like I don't know mean values or deviations so on. And then with benchmark, we sample those populations. We acquire some amount of observations. This observation in turn has some statistics like averages, standard deviations and so on. And based on these numbers, we essentially could get some confidence, well assurance for our data. So we could for example, if we compare two different data sets, we could use a t-test to verify how different are they or not. Or we for example, as the second one we could build a confidence interval for our data. How confident are we that the real value of this population lies within what we have observed. Now you don't have to remember obviously those formulas. They pretty much exist in any single data processing software you use. So yeah, you don't have to worry about this, but that's actually probably a part of the problem because you can easily misuse them. Unfortunately, if you do not really know or if you apply for example them in the context when it's not really suitable, let's say this way. So I'm going to explain. Unfortunately, all those methods were actually developed for the situations where people were doing some experimentation on natural science. So like you know, chemistry, biology, physics, whatever. And usually things are distributed normally there unfortunately. So there is a significant assumption behind all those two linear was shown before. So there are two assumptions. One is normality distribution and another one is what is called independent identically distributed variables. So essentially it says that there are a lot of small details that contributes to the noise but they essentially contribute both in an equal way. And on this graph I'm showing you a probability distribution, this bell curve, the blue one. That's essentially how normal distribution looks like. Now, in computer science we have a big problem. In computer science almost everything that we're going to test is not normally distributed unfortunately. So you see this red line. That's essentially how it looks like normally. And the problem is that it's heavily skewed. So there are quite a huge tails usually and you see there is a significant skew to the right and intuitively you can even understand why is it happening. So the thing is that in the queue in the computer system everything is a queue essentially and there is a baseline you can get, the best case and then whenever the queue is getting congested or something that you are getting slowed down and essentially it means that we can get only slower usually that's why we have this skew, significant skew and things could just not be suddenly magically faster. Now this problem means that for us that unfortunately we could not use or we are not theoretically allowed to use all those methods or all those tooling I was showing before which is quite unfortunate. Now probably a little bit a controversial topic but although you may hurt in like you know there are a lot of advices or like things that people are saying yeah do not use those things for non normal distributions and so on. There is a trick. I have put a couple of references on the slide so that like I'm not going to pull this controversial topic out of myself. There is like there is some research behind this but the point is that you can actually try to say I'm not going to use a normality distribution but I'm like normality assumption but I'm going to approximate my data with normal distribution. Now you shift your questions toward the question how good is approximation and so on but nevertheless you can at least try to use those methods anyway and you still can get some reasonable results. So for example in the first reference you can get there is authors are showing some interesting data about how various statistics are actually how much they are robust or not robust or normal normality and to show you that it's not like a theoretical thing. Last two references are something that you can find in the field. So the second one is about for example a project or hunter when thoughts are essentially trying to do performance analysis for the ACI and they are searching for change points and they have explicitly modified it. The algorithm to actually use a TSTAT T test student T test because it was more robust in their guests and they were just like fine with it so it was working for them. And the same is for example for click house so they have in their benchmarking tools they also have a possibility to get some TSTAT test evaluation when needed. So it's also one, it's also something that sometimes it works. So it's not like it's completely you should not do this ever, ever, ever on this one. It's just something you have to keep in mind and sometimes you can try it out but you have to be aware that it's not always have to be working. Now again to show you that it's a very practical matter. You can extremely easily end up with a situation when your probability distribution is not normal. So here's on the slide I can show I am showing you the probability distribution for latencies for one benchmark I was doing literally on my laptop. And the thing is that the amount of memory for Postgres was quite limited there. So for file system cache and for Postgres itself and you see that we see here two spikes for latencies. It means that one was essentially when we were satisfying our queries from the file system cache or for database from shared buffers. And another one is when memory was not enough memory was not sufficient and then Postgres had to go and literally read something from the disk. So that's why we see quite a significant difference between those two spikes. And you see it's completely not normal. It's even worse than that. It's a B model. So if you would like to get some like averages or medians you will not get reasonable results whatsoever. And the usual advice like, yeah, repeat your results, repeat your runs and then try to find the converging consequence. It's not going to work here at all. Now to prove that it's actually coming from the IO it's actually important to verify our benchmarks. So I'm showing this in one slide. There is a difference about like 150 microseconds between those two spikes. And you can for example verify it with BPF trace. There is this tool for IO latencies. And sure enough, when we take those distributions we see that the most of the distributions are actually lying within this window we're looking for. So that's indeed an IO effect. And this is a situation when we have a slow path versus a fast path. That's actually quite frequent in the computer science. So it's very easy to end up with a situation. Now I've told you that you're not allowed to use normal tools. So what should you do when you have this situation, this type of situations? Well, fortunately to a certain degree things are still straight forward. And you just have to replace not necessarily one to one but you have to use different tools. So things already, of course, they have been investigated for quite a while. And it's not the first time when people are, you know, facing this type of problems. And again, fortunately for us, although they are more complicated than usual there are things that exist out there. And again, pretty much every single data processing, a framework sci-pi are whatever that you could use to actually do the very same things as I was shown before but for non-normal distributions. So for example, you can replace an averages with just a median because a little bit more robust and non-normality. It's still not going to work for a B model but it's a little bit better. You have to probably stop thinking about like average standard deviations and you probably have to think about quantiles. And if you have seen, for example, an industry, we're always talking about percentiles and quantiles. And actually that's a part of the story. We're always interested in the outliers or not necessarily outliers but like what is a 99 percentile of the whole value because we're curious about the different, like, you know, variants of our data. If you would like, for example, to figure out the outliers, you have to start thinking about using inter-quantile range because that's essentially the best tool you can get for non-normal distributions and the things like, for example, student detest, you could replace with something like Mann-Whitney test which is a little bit more complicated. So that's why I'm posting here a sci-pi functions for it but the idea is the same. You get two data series and then you compare them but then you're assuming that like you're doing this non-parametric test so it's going to be finer that data is not normally distributed. Now, a couple more details. One interesting question that pops up every now and then again about statistic is how much data do I need to collect actually? How many runs do I need to do for my data? And unfortunately it's hard to answer in general case but here I'm quoting for you some interesting article about this topic when folks were investigating how much noise do they get from the cloud infrastructure. It was not a good idea, it was something different but nevertheless that's like a pure hardware noise. And this variable E with this, a lot of numbers they're essentially the says how much runs do we need to get to need to perform to get within a 95 confidence interval so that the value lies within 1% of the average value. So it's like essentially in the normal English it says that yeah, your confidence is good enough. And for the efficient variance of 0.3 so like the variance of the data was very slow it was a CPU bounded benchmark. They were saying, yeah, okay, 10 runs is enough for us. So 10 runs is nothing. You can get like a millions of this and it means that essentially for a stable enough data you're fine if you'll just get a little bit, you know efforts and run something 10 times. Now when they were running something with a coefficient variance of 9% they already had to run 240 benchmarks and I think this was the case when they were even getting something against they were testing some IO workloads with Theo against like the devices they were using there and there they were even have to run they even had to run 640 or something. So along these lines, so those are ballpark numbers you can expect when answering these questions. Obviously as I mentioned before it's very much depends on your particular situation but at least this is your expectations what you can see from your benchmarks. And now another very interesting question is about like and this is actually partially an inspiration for this whole talk because Mark Olegon was asking this question once like from the like statistical perspective what does it make sense to run several shorter benchmarks or one longer benchmark? So is there any difference? So obviously at least I'm still convinced that most of the time especially when you're talking about databases those are two different cases you can not really compare them but from the queuing theory there's a very interesting thing that you can take a look at where essentially we're saying that when you run a longer benchmark and we run several shorter benchmarks they're going to be equivalent if we assume that our system is ergodic. In this case ergodic essentially means that history is repeating itself at some point. So when for example we have a queue and then the results of this queue at some point our session essentially start to repeat itself. And intuitively you can get that yeah it makes sense because yeah you can essentially at some point just start to discard your history and then obviously it doesn't really matter if you run shorter benchmark or longer benchmark. But yeah so it's a very interesting result you may essentially replace one with another one but at the same time yeah again it depends very much on your situation. And the last but not least things I've mentioned a lot of details about how to reduce noise and how to fight with this using against various statistical approaches and so on but still sometimes there's just noise you have to live with unfortunately or for example auto correlations you cannot get rid of and something that you just not in your control. And in this case it does not happen that quite frequently but some people are doing this but you could try to use a randomized testing. The idea is essentially that you have for example two setups to database setups you would like to test with and then you do not test them like one and then another one but you send queries to one and another one randomly at the same time. So for example Click House folks they even go to a little bit more they go to a little bit more higher degree for example they even run both database on the same virtual machine when they're doing this testing but essentially randomized part is yeah you just have to switch queries every now and then. And the point is that via this you're introducing some normal distribution that is going to essentially spread this noise normally across two different tests that you're performing and overall the noise is going to contribute the same amount of influence on the benchmark. And now final thoughts a couple of takeaways for you folks to remember. Probably the most important one is I would like to remember is that benchmarking is not so dry and you know about numbers and statistics benchmarking exploring is extreme fun actually and every time when you evaluate something you learn something literally that's really cool and this idea about known versus unknown and common versus particular is actually very important because you cannot get away doing evaluation only knowing all the statistics but without knowing the details of the Postgres for example and the same time other way around if you know all the things about the Postgres and journals but have no clue about statistics it's still no use for you unfortunately. And the last thing I have not mentioned explicitly but you could see this throughout the slides is that statistics is very important as a language so if you use those primitives and those tools you can always ideally you can always convey the idea of what have you found actually in a more concise manner and so that everybody can understand you. So that's pretty much it I hope you have some amount of questions I guess we have a couple of minutes so yeah, yeah any questions? Yeah please. I can actually recommend in this regard I have a reference somewhere here I think with the moment where was it? Yeah this one, this is a talk down there well it's a paper and talk as well from HPC folks and they're doing even more fancy stuff exactly for networking there because networking for HPC is always important to distribute work log and so on so check this out that's definitely interesting. Okay I get it just too early in the morning so probably you're just asleep so yeah thanks nevertheless and yeah have a nice day.