 Hi. My name is Victor Stiner. I'm working for Red Hat. I'm a Python core developer for seven years now, blah, blah, blah, but I'm not here to talk about me. I'm here to talk about a very serious issue which is benchmarking. To explain you my issue, my story starts with a small optimization proposed two or three years ago to optimize a very simple optimal instruction just to add ints and ints like 1 plus 1, 1 plus 2, etc. Different people propose patches because there are different ways to implement this optimization, but we were unable to check if the optimization was faster or slower. And even if it's really worth it, which means is optimization fast enough to justify the change? And because we were unable to check if it's faster or not, the major step in this optimization was to run the grant unified Python benchmark suite, which is a suite of 50 or 60 benchmarks. And the issue is that in some cases it was faster. In some cases it was slower. And because we had different authors of different patches, different people ran the benchmark and we got different results depending on the computer. So the question became not really if the patch makes Python faster or not. The question was more if the benchmark suite is really stable or not. To explain the issue, if you don't have unstable benchmark, it's likely that you can take bad decision on an optimization. Because as I said, you must make sure that it's really faster and it's really worth it. So you need a very good tool to check the performance. And the goal is very simple, is to check if a specific patch makes Python faster, slower, or is not significant. And another point is that the benchmark must be reproducible between multiple runs, even if you reboot, even if you change something, you must get the same results, at least on the same computer. Don't try to get the same result on two different computers because computers are too complex today to get something really stable. My talk will use a very famous methodology to analyze the benchmarking. It's called what the fuck matter. You can use it in code reviews. For example, for a good code, you only hear what the fuck, what the fuck, only twice. And a bad code, you can listen to the review and what the fuck is this shit, dude, what the fuck? Okay, so the first issue is the system and noisy application to give you a very obvious example. Let's say that you have a very tiny microbenchmark to measure the time spent to just compute the sum of a very long list. In Python 3, the range doesn't use a lot of memory, it's an iterator. So this benchmark is really CPU bound. It's only the bottleneck is the CPU. On an idle system, you get this number. But if you simulate a very busy system by spawning a lot of Python process, which under unlimited loop, you can see that the number is much worse. There is a huge difference. It's 1.6 times slower, which is quite significant. And for that, the issue is that the system and the application share the same resources because you only have a few CPUs on your computer. The memory is shared by everyone. And the storage is also shared resource. So if you have a noisy application using the CPU, it will have an impact on your benchmark. But the Linux kernel has a very useful feature called CPU isolation. You add an idle CPU equal free in your command line, and the Linux will not schedule processes on this CPU. And later to run this benchmark on this specific CPU, you can pin the process using the task set command and you pass the CPU list or in this case only one CPU number for your script. And to come back to my benchmark, you can see that you get exactly the same timing, which is something very impressive for me because as you can see before, the difference was quite huge. And if you use the CPU isolation, even if the system is very, very, very busy, you are able to run benchmarks, it's still very stable. You can use even more advanced features of the Linux kernel. Another feature is called No Heards Fool. It's something coming from the Linux real time team. The idea is to be able to disable all interruption. Interruption is something at the hardware level to interrupt the CPU. So it can occur anytime and it takes a little bit time on your program. It's an interruption, so it has an impact on benchmark. And using this feature, if you have only zero or one process running on the CPU, all interruption will be disabled. Another feature is called RCU underscore No CBS. I don't know the detail, but the idea is that it will not respond kernel code on the CPU. So you are sure that you are alone on the CPU and you can do whatever you want. You will get very stable results. So with this tuning, I was able to run benchmark on experimental change that I wrote in April of last year. I set change in Python to call function because in Python 3.4, 5, you have to create a temporary tuple to call functions and the creation to destroy the tuple as a cost. And in my experiment, I saw that calling built-in function is much faster, between 20 and 50% faster, which is quite significant, in my opinion. But I did not understand why, but on some benchmark, it was slower. It was maybe not 50% faster, but I didn't understand how an optimization like that can make Python slower. So that was my first issue. And my patch was quite big. It was something like 20,000 lines of code. So what I did is to remove changes, to reduce the patch to something very simple. And after a few days, I simplified the code to only a patch adding two functions, which are not used. They are not called. It's just added to the C code. And it was slower. What the fuck? In fact, it was even more strange because I ran the benchmark. Using the patch adding to a new function, it was slower. But a colleague asked me to simplify my patch even more to reduce the function to an empty body. The function doesn't do anything. And it became faster. I see that people, you must be new here. The issue, in fact, is known by a few engineers, especially engineers working at the very low level on CPU. It's called the code placements. It's a little bit tricky because it's very, very low level. But the idea is that depending on the memory layout, depending on the function addresses, it doesn't impact on the CPU cache, how you use the cache. And for very tiny benchmark, like I did previously, the CPU cache has a huge impact on performance. And so you must be careful when you use the caches. But the problem with that is that it's very hard to control the code placement depending on your c-code because it depends on the order of the function. It depends on the compiler option. It depends on the order in the build process. It depends on too many things. So it's very difficult to take care of that. To give you an idea of the impact of performance, in the worst case, I ran the benchmark just to call a method in Python. It was stable for one year. And just once, one day, it becomes 60 percent slower, which is quite huge. It's a huge spike. And for me, it was very impressive to see a huge spike, especially because the next run of the benchmark becomes from the reference performance. The best solution, in my opinion, to solve this issue is called Profile Guided Optimization, or PGO. You just compile Python using dash dash with optimization. This one is a new option, but you get it on all Python versions. The idea is that the Python is compiled in three steps. First, you compile Python, and the compiler will add new code to instrument the code to collect statistics. Second, you run the code. In our case, it's Python test suite. And it does collect statistics on branches, like if, else, loops and conditional gems. It also collects statistics on the code path to check how many times a function is called, how much time you spend in a specific function. And using all the statistics, the compiler is able to generate much more efficient code. For example, the code will be moved to a specific section, which is much faster. And it is able to exchange the if, else block, depending on which one is most likely. And using that, the benchmark becomes much more faster, much more stable, sorry. Okay, at this point, I expected it to be really, really stable, to not have any issue anymore. But I found something new. In Python 3, it was decided to randomize the hash function by default. It was to fix a security issue, because you were able to inject specific keys in an HTTP header. And using specific HTTP keys, you was able to crash a server, so, a DOS. And because of the randomization, if you run Python multiple times, you get different timing. And to see the effect of randomization, you can specify a variable called the Python hash seed. And you can see that depending on the value of the hash seed, you get different timings. In some cases, it's slower. In some cases, it's faster. The reason of that is that in Python, we are using a dictionary for variable. And the dictionary is implemented using an hash table. So, depending on the number of collisions, you may need only a single iteration of the loop to find the variable. In some cases, two or three. So, depending on the number of iterations, the performance is not the same. So, to fix this issue, it's quite simple. Instead of using randomization, you can just specify one value. Or you can even use the value zero, which does disable the randomization. But it was not enough, in fact, because after more and more tests, I realized that some other things, which were not expected as an impact on performance, for example, if you add environmental variable, which are not used by your application, variable which have no value, which are not used by anyone, they are still also changing the performance. Even the current working directory has an impact on performance. But also, if you add new command line options, still, which are not used, all these things have an impact on the performance. What the fuck? So, the first idea on all these issues is to disable the randomization of the Linux, which is called address space layout randomization. You can disable the randomization of the Python hash function. You can try to get always the same working directory. You can try to get always the same environmental variable. So, I tried to do that, but in my opinion, it had lost cause because there are too many things which have an impact on the performance. So, it's not possible to get 100% reproducible environments. In fact, if you did some math, some mathematics, or especially statistics, you may know that if you have a random noise on an event, a solution for the random noise is to compute the average, because of the using average, you reduce the noise or completely remove the noise. And another issue in my benchmark is that the Python time it's module of the standard library, use a minimum. And if you use a minimum, it's very likely that you will get one value. And as an experiment, you get different value, which may be lower or bigger or smaller. It's not a good idea to use a minimum because of all these factors. The best solution in my experience is to compute the average on multiple samples. By sample, I mean to run the benchmark multiple time in different processes, because as I showed, each time that you spawn a new process, you get a new address space layout, you get a new randomize hash function, et cetera. So to make things simpler, I wrote a new module called Perf. The idea is quite simple. It's a module to spawn your benchmark multiple time in different processes. And it does tricky things of benchmark for you. For example, it always runs the benchmark at first time to warm up the benchmark, because it's very common that the very first sample of each process is much slower. It's called warmup, because usually you have caches, CPU caches, you have caches in Python, you have caches in the canal, like everywhere. So the first one is used to warm up everything, and the next one will be much more stable. And when you get all sample, Perf will compute the average for you, but also something very, very useful which is called the standard deviation. The standard deviation is a range where most of the values are in this range. It gives you an idea if your benchmark is stable or not. Okay, so everything was very, very stable for days until the new drummer. Suddenly, a benchmark became 20% faster. What the fuck? To understand this new issue, you have to know that today's, in fact, for longer than 10 years, the frequency of Intel CPUs is no more fixed, it's no more of a single value. The frequency of CPU is changing like anytime. It's something fully automatic. You don't control exactly how it works. It depends on many things. For example, the workload on each CPU. It depends on the CPU temperature. It depends also on the number of active core, CPU core, and I didn't know the last one. Maybe you already saw this button. It's called the turbo button. I had that in my PC tower 20 years ago, and at this time, you had to push a button to make your application faster. I was always impressed by this button because I didn't expect that a button can make my application faster. In fact, you have to know that it still exists in modern CPU, but it's not an explicit button. You don't have to press it. It's fully automated today because in today's most important things for CPU is not only the pure speed. It's also the energy efficiency because you have IRM CPU which are very used in the embedded devices because the CPU power is low. Intel has developed a lot of technologies to make the CPU speed to always get the best performance but not use too much energy, too much power. The idea of turbo boost is that depending on the number of active core, you don't get the same CPU speed. For example, on my laptop, I have two physical cores or four cores using EPS running. If I have between two or four active cores, it's around 3.4 GHz. But if I have only one active CPU core, it becomes 5% faster. And there is a direct link between the CPU frequency and the CPU speed. You can try using for example turbo boost that the impact on performance is very obvious. And to see the different level of speed on your CPU, you can use the CPU power frequency info. It gives you such kind of information. On a desktop computer, for example, you have more levels depending on the number of cores because usually you have more cores. For benchmarks, the best is to just to disable the feature. You can disable the feature in the BIOS. But I don't like to do that because my idea is to run the benchmark on my desktop computer and to be able to work on the same computer. I don't want to have three or four computers just for benchmark. I like the idea of doing everything on the same PC. So I prefer to disable the turbo boost just during the benchmark. And you can do that on Linux by writing one in this device. If you do that, you are sure that you will never get the short spike depending on the number of active core. And now, I run benchmark for days, even for weeks. Everything was super stable. Mission accomplished. Yeah, we did it. But one specific afternoon, after one week of benchmarking on the same benchmark, everything was stable, everything was fine. So I stopped working. I closed my session and the benchmark became twice faster. What the fuck? The nightmare never ends. Okay. Stay to breathe slowly, Victor. Everything is fine. So for the system on noisy application, I found our solution to isolate CPU. For dead code, code placements, I used PGO. For ASL, pattern hash function, environment variable command line, I decided to use average and standard deviation. For turbo boost, I just disabled the feature. So what's next? So I tried something even. Please don't do that at home. I used a sheet of paper to block the fan of my CPU. I wanted to check if maybe the CPU operator has an impact on the performance. No. The modern internal CPU are very, very efficient. Even if the CPU temperature reached 100 degrees, which is quite hot. This operator means usually that the water boils. So it's very, very hot. Don't try to touch the CPU. When you reach this operator, the CPU only becomes a little bit slower. In fact, it's something in the hardware, depending on the temperature, the speed decreases, but very, very slowly. So I don't think that it was the CPU temperature because usually when I use a GNOME, it's not like my CPU is burning. It's almost idle. It's something very closer to the hardware. I will not explain the whole story because it took me weeks to understand that. It's very, very complex. But if you recall, the previous slide, I used a very advanced feature which is called No-Herz Full, which is able to disable all kind of interruptions. But on modern CPU, you have two drivers, which is Intel P-States on Intel Idle, which is CPU drivers that are using a callback in the Linux scheduler. So they are supposed to be called regularly to update the thing called P-States. It's not exactly the power state or performance state. It's something different, but it has an impact on performance. And the Intel Idle is a driver to control the Idle states because you have P-States on Idle states. It's not, I don't understand everything, but it's quite complex. But I understood that if you don't interrupt your CPU, the scheduler is not, you don't get the scheduler interruption. And the scheduler interruption called lock in a proc interrupt is used to wake up the scheduler often something like 100 times per second. And when you wake up the scheduler, the drivers are called to update the power state of the CPU. And if you combine all these things, in fact, the power state of the isolated CPU, no more depends on the workload of the dead CPU, but on the workload of the other CPU. So depending if I'm using my computer, if my other CPUs are doing something else, the CPU will be slower or faster. I don't know. I send emails to the Intel engineers who maintain their drivers. And they just told me that they never tried this feature, which is to disable interruptions. And I discuss also with the Linux real-time developers as they told me that it's not a bug. It's a feature. Okay. So the fix is you have two different ways to fix the issue. First is not to use the feature, to never disable interruptions. I decided to do that because I think that it's too complicated to use no health fall. The risk to have issues is too high. But if you really want to do that, you have another option is to use the fixed CPU frequency. Because by default, the CPU frequency can move between three gigahertz and one gigahertz, which has a huge impact on performance. So in my tool, I also developed a configuration to use the fixed CPU frequency. So just three points to summarize all what I did, the takeaway. First, you must tune your system to run the benchmark. Don't try to do steps manually. You can just use my new module. It has a command to tune your system, which means disable the turbo boost, disable Linux perf events, which is something else, which has also an impact on performance. It does change the frequency of the CPU. It is able to pin the CPU for interruption. So it does a lot of steps for you. And at the end, you get a system prepared for benchmark. Stop using time-its. I tried to fix the time-its module. But for backward compatibility or for other reasons, it's not possible to change the module. And these modules are a lot of flaws. For example, it uses a minimum, which is not a good idea. It only runs the benchmark three times. It only uses a single process, so you must not use it, or the result will be your just pure noise. So instead of typing dash m time-its, just write dash m perf time-its, and you're done. And if you don't want to use time-its but more advanced benchmark, you can use the perf module, which is an API to write your own benchmark. Usually, it takes something like three lines to spawn a benchmark, so it's quite simple. But please also look at the documentation. Because in the documentation, I also explain what are the common issues, what are the traps. So just to be careful, read the documentation, it's safer. How much time do I have, sir? You forgot everything. To give you some schema of the performance, so this is just the same picture as before. The spike, which was 60% slower because of the CPU placement. I got access to the speed.byton.org website, so now this results are already online. You can browse results on Python 3.7, Python 2, and other versions. And I decided to remove all data and to recompile Python using PGO and redo the benchmark. And this results are longer than one year, so you can see that the results are super stable because there are some spikes, but it's not possible to have something perfect, but at least it's no more 70%. It's a very small spikes, so at least I think that I reached my goal because it's very stable. And the good thing is also that you can see that the performance of Python 2 is becoming better because the line is going down, and down means faster. And just to finish, my favorite benchmark is called Telco. It's a benchmark which used a file which computes sums on long files of data. It's used for telecommunication. It's a benchmark written for the decimal module, and you can see that it's also becoming much faster at the end of the year. The end of the line is February, so it was a few days ago. Just to finish quickly, the perf module has different features. It does collect metadata on the benchmark itself, but also on the hardware, like your CPU speed, uptime, Python version, number of kernel tasks to check if something is running on the same computer or not. It has tools to compute results and to check if it's significant. It uses a complex statistical function that I don't understand, but trust me, it works. You can get the minimum, maximum, mean, median, number of samples of different things. You can dump all timings, including warm-up, to look very closely at everything, and the perf module checks if the result is stable for you, so you get an warning to say if it's stable or not. So thank you. Yes? A gap? Well, I don't understand you. Come on. Yes? So, in physics, if you measure some variable, right, and you don't know how it is distributed, usually they say that it's normally distributed and kind of use all the math based that the variable is distributed normally, right? Have you tried to plot statistical distribution of your benchmark results? Okay, the question is about analyzing the distribution of samples. In the perf module, you have a very simple histogram tool. You can render the histogram of samples. Yeah. So you can see at least the shape of the distribution. Yeah, should be like normal. Usually you expect a normal distribution, but in some cases it can be be normal when you have two centers. It can have a different shape, but the idea is that the perf module gives you access to everything, and you are free to analyze that later. Okay, so is it possible to export all the measures from your perf package that you can analyze it? All the results of the perf module is a JSON file, and you get all individual timing. You get all numbers, so you can do whatever you want with these numbers. Last question. Wait, I propose to discuss outside if possible. Okay. Thank you a lot. Thank you a lot, Victor.