 The background is that, as I was already introduced, I'm actually doing quality insurance with an SAP. For one of SAP's major products, SAP HANA, which is an in-memory database and is basically powering the various enterprise applications you can find in the portfolio of SAP. What we are actually doing is we are testing each commit coming into the source code and in our scale, that means we are testing around 800 commits every day. Therefore, we are operating a small infrastructure of physical hardware with around 1,600 machines. And as we are testing an in-memory database, we need a huge amount of memory. So overall, we are currently using around 610 terabytes of memory across our landscape. So the problem is if you have such an infrastructure, if you have such load characteristics, you need optimized services, optimized tools to handle such incoming load. And that's actually the main part what my colleagues and I are doing. So we are developing such tools, such services, which are optimized for our workload. And one of this tool is our own task execution framework, which we put on top of Apache Messos. Apache Messos is something similar to Kubernetes, but a bit more low level. So what it provides us is some kind of an interface to resources of machines. So in the bottom, you can see that we have multiple data centers with our own physical hardware, but also various cloud providers with cloud instances. Every time an instance has some resources available, they will send them over Apache Messos to our own task scheduler. And our task scheduler then has to decide, OK, what kind of task should I now schedule on this available resources? Now, the problem is if you add more and more machines, this also gets a bit more complicated. And especially in that case, our task scheduler is receiving more and more incoming offers with more and more events about changing states of tasks. So task finished, maybe, so you have no opportunity to launch new tasks or a task failed. So you have to reschedule it and you have to handle all these events. And the problem is at some point in time, we hit a bottleneck that our scheduling system was not able actually to handle all the incoming events and use all the resources in an efficient way. And now you can imagine with such amount of hardware, you actually would like to use them as efficient as possible. You can also see that around our task scheduler, there are various other services and databases. So the task scheduler is basically just interacting with all these various services and doing a lot of operations. Therefore, the initial design of the task scheduler is basically a big Python application with various threads who are handling the incoming data, who are processing the incoming data, which find the best way to schedule a certain task on a certain available resource of. We also have other threads around in the process itself, which are required for our observability stack. For example, we have a thread who is responsible to transmit all the exception data to sentry or distributed tracing is in place for our applications. So we have also a thread who is responsible to send that over to our Yeager instance. Now we have the problem that we have not the best performance anymore. Resource utilization goes down. And the nice thing is about our observability stack that we can now actually inspect each thread and each part of the system and find out, okay, where is the bottleneck? And the same method we also applied to find out, okay, why we cannot utilize all the vapor resources. Therefore, we have to take a look inside of the resource of a handling thread. And that's now a semantic visualization of our distributed tracing system, which we are currently using. And we can see the required time for certain operations. So for example, we see how long we need for selecting an offer for a task or how long we need now to prepare actually the task so that it can be scheduled. And the first strange thing is that for the same function, we have actually different runtimes. So for example, for selecting an offer, we have a variant of various runtimes from 700 milliseconds down to 30 milliseconds, which is actually quite strange. The next interesting thing is that we have also increased latency in a way that it's actually not expected. So what you see in the highlighted boxes is the first thing, which is the span, which is captured on the scheduler level. So we are capturing that we need around 200 milliseconds for this API call. The service, which we are actually asking, is also transmitting the data into the same system. Therefore, we know that the service itself only took around 30 milliseconds to process the API request. So that's also a bit strange, and we would also like to investigate there a bit. The next remaining thing, which is also strange in this capture, is there are also gaps between operations. And if you take a look into the code, there are actually no gaps. Instead of the prepared task operation, there are two suboperations, these two API calls. There's nothing in between. Why is there latency of multiple milliseconds in between? And then we started to assume, OK, I mean, we started, we are using threads. It must be the global interpreter log. We are hitting the global interpreter log. There is a contention, and that's now our bottleneck. But I mean, that's no problem. You just open a browser and perform some research, and then you will find a lot of various ways how to mitigate the global interpreter log contention. So we could just start and replace all the multi-threading thing with multi-processing, or asyncio stuff, or we could pinpoint certain CPU function, which is CPU bound, and micrame to siphon, which is actually releasing the gil in certain sensors. Or let's rewrite everything in a more faster language. But if that would be now the solution, I would probably not talk now about the global interpreter log. In general, we have to say that such rewrites or major effections are super expensive. I mean, we're talking about a productive system who is powering a huge amount of workload, and we would actually like to invest more time on new features, make it even more efficient. But now handling such performance problems is also important, but we have to use the right things to actually solve the performance problems. Therefore, we took one step back and decided, okay, let's first analyze the problem, find out is it actually the gil contention, and if we then detected that, we can actually go and maybe decide even better what kind of mitigation is the best one for our application. So let's take a look on the gil. I mean, it's probably easy. You just import a module, for example, sys for the Python interpreter, and then you ask the Python interpreter how it's going with the gil. Sadly, that is not true. There is no simple function which you can use to get some gil statistics, which means that we have to think about something else. And then I thought a bit about, okay, what I would actually like is to know about the gil are various things. The first thing I would like to know the basic fundamental metrics of a log. I mean, at the end, the global interpreter log, it's a log. So I would like to know how long does a thread actually wait for the log and how long is certain thread is actually holding that log. If I know that metrics, and I mean, we all know that if you have just some numbers, they are often not so useful. Therefore, I would also like to have some additional context, okay, which thread it is, which Python function is actually suffering from that contention, maybe. Is it maybe possible to get a trace ID or a request ID in that part? Because then I can correlate that with other systems. And I would actually like to use this in our productive environment because I cannot repute this as my local machine, because my local machine is not connected to our cluster with a hundred of machines. So that's a problem. And in best case, it would be integrated with our existing observability stack so that my colleagues and I don't have to learn another tool, how to use it, how to interact with it. So with that list, I went through the Internet and looked out, what are the available tools? How can I analyze the code? I mean, I'm probably not the first person who is thinking about that problem. So there are some relatable. There's, for example, a super interesting talk by Dave Basely from Python 2010, sorry, Python 2010, but it's actually quite up to date because it already talks about the new skill implementation, which we have said Python 3.2. And within his talk, he's explaining how he measured the GIL contention and stored the data so that he actually was able to generate such nice graph. The problem is the instrumentation and so on only works with Python 2.6 and there are some other problems which make not so usable in our productive environment. Main problem, you have to shut down the interpreter at the end to dump all the data out and then you have to generate the visualization out of it. I actually don't like to shut down our scheduler because every minute it is not running, we are basically losing resources. The next thing which came into my mind is the threat concurrency visualization of PyCharm, which is actually a very nice tool to visualize log contentions in your application. The problem is PyCharm doesn't take the GIL into account. In that example, I just run an application who is heavily GIL bound because it's basically always wasting time in CPU cycles, but you don't see anything. There's a new implementation called GIL load, which is quite interesting because it's a profile which you can easily integrate in your existing Python application. And during the application runtime, it will print out a load number. It's comparable with the load of a Linux system, for example. The problem is at the end you have only a number. You don't know, okay, what is now the root cause of that problem. So if I do that on an application who is heavily bound to the GIL and who runs always into the same contention, I just get the information, yes, you have a problem, but no other information how to solve that. A very promising approach is PySpy, which is a new profiler for Python written in Rust, which is very fast and you can easily attach it to a running application and then you can get this nice overview about what functions take, what amount of time and so on. And it also includes the GIL utilization, which is quite nice, but again, you don't have a breakdown to find out, okay, what is not a problem about it. The truth is there is no magic GIL contention analytics tool. And that means we probably have to do it by ourselves. Okay, let's create a tool which actually reveals the GIL and hopefully it is able to give me all my wishes about such a tool. For that, we can use an existing framework and basis, system type, which is available on Linux machines and it allows actually to analyze applications by attaching certain event handlers to applications. So, sorry, it actually allows to attach handlers to certain events, which are emitted by applications or by the Linux kernel itself. And then you can do certain calculations within this event handlers and print out the measure times, for example. The nice thing, CPython 3.6 actually introduced support for system type and de-trace and there are already some markers, some event emitters who you can use to analyze your Python application with system type. So, for example, we have function entry and function return which will be involved every time the interpreter goes into a new Python function or returns from Python function. I can highly recommend the documentation about that because it's super verbose and helps really to understand the concepts of system type and how you can use it. And I also saw in the schedule of this conference there's another talk about low-leveling profiling which will also cover system type. Problem about this approach is that most pre-built Linux packages actually don't include the Python interpreter which is compiled with de-trace support. So, the markers are not there. You cannot use them. Also, there are no markers for the GIL-related areas where you would actually like to know when a thread acquired something or when a thread is actually dropping the global interpreter. But especially the last part was actually not so complicated to implement. So, this is just one part of the patch who introduced some markers regarding the GIL. So, every time a thread will now drop the GIL, we'll emit an event about that. And every time a thread tries to claim the GIL, it will also emit an event of that. As you can see, we can also add arbitrary attributes to these markers which will be then accessible in system type. In that case, I'm using the ThreatyDance so that I have an understanding and I have an idea what thread is now performing this action. Now, you may be asked why you don't use the ThreadNames. I mean, every application should have an IsThreadName so that you can actually recognize the threads. The problem is, at the moment, there is sadly no CAPI to get the ThreadNames without actually holding the GIL, which is a bit complicated if you would like to measure the time until you actually have the GIL, but you need the GIL to get the ThreadNames complicated. But I think that's also something we can solve at some point in time. Okay, now we have these markers. Now, our Python interpreter is emitting these events at the point in time they are passed. Let's measure the time at that point. We are attaching probes, so-called event handlers, to these events. For example, we're instructing system type that it now should look inside of the shared library of the Python interpreter where the markers are located. And for example, if we now go in the GIL claim part so a thread would like to acquire the GIL, we will measure the time of today in nanoseconds and store that in a hash map of system type. As key, we are using the ThreadIdentifier. If we now acquire the GIL, we can actually calculate how long does it take until the thread acquired the GIL. And also store that in another data type of system type, which is an aggregate, which allows us to actually get some statistics out of system type. So for example, distributions in the form of a histogram or an average and so on. And the same thing we also do for the GIL TROP where we can actually then now calculate how long we actually hold the GIL. Now we are calculating these numbers. What we actually would like is to print them out in some kind of report. For that, we can use the handlers which will be invoked at the startup and the determination of the system type tracing session. So if the tracing session stops now, we can print out some nice debugging informations and like terminating the tracing and print out a summary of overall the measured threads with the respective timings. Let's do some example. In the first example, I have a Python process with two IO bound threads. By the IO bound or how do I actually make an IO bound thread is by simulating the IO with time.sleep because it's basically the same behavior as for example, a read request on a socket. Before the actually the thread goes into the sleep mode for a certain number of seconds, it will actually release the GIL. After the sleep is complete, it will try to reacquire the GIL. The same thing also happens on a socket. You would like to read something from a socket. You define the amount of bytes you would like to read as long as there are not enough bytes or actually no data at all available. It will plug and it will wait. But before that, Python will release the GIL for you. If we now measure that, we can actually see that we actually don't have any problem with that application. You can see that the main thread only had to wait a bit more than one millisecond on the global interpreter, which is quite nice, especially if you know that this application runs for 15 seconds in that simulation mode. If we are looking on the whole time, we also see that the main thread is the most prominent thread who holds the GIL the most of the time, which also makes sense. It has to import the libraries, the threading library and so on. So the most time will go probably for the initialization. The IO threads itself are super lightweight, so they don't have to wait that long on the GIL. And if they have the GIL, then also they don't consume that much time with the Python interpreter. Overall, we can say we hold the time only 0.2% of the full runtime and we also had only a wait time of less than 0.01% of the full runtime. So basically no GIL contention at all. And in that situation, the GIL is not a problem. We can now change that easily if we go and introduce a CPU bound thread. How do we simulate a CPU bound thread? It's actually quite simple. We just need an endless loop who's doing nothing. So this Y loop with a pass inside of the loop itself will do all the trick for us and we now have a CPU bound thread. If we now take a look on the timings, we can actually see a GIL contention. We see that the main thread has the same behavior as before, same whole time, but increased wait time. Okay, that's already a problem probably. The most important problem is that our IO threads now have a much higher wait time to get actually the GIL. So we are now waiting more than 700 milliseconds just for getting the GIL and we're actually only waiting around 100 milliseconds in our application. The whole time is still the same and the CPU thread, which is a CPU bound is actually consuming all the variable CPU time and therefore it holding the GIL nearly the full runtime. So overall, we now see that the GIL was basically always active by at least one thread and we also had some wait time in there. Even more interesting is the latency for the IO threads. If you are taking a look on the histogram of the IO threads then you will basically notice after some time that the latency is quite stable between four and eight milliseconds and that's quite interesting that it's so stable but it also shows us one main disadvantage of this GIL contention. The GIL contention already affects the overall performance of the application by introducing five milliseconds additional latency to any unlock of the global, sorry, any attempt to acquire the GIL after a blocking IO operation, which is actually quite a problem in some cases because normally you run multiple IO operations during for example an HP request. Why it's so stable? Because of the internal switch interval which is an implementation detail of the GIL but it's actually quite interesting to see that in action. Every time a thread would like to acquire the GIL it will check if someone is holding the GIL. If someone is holding the GIL then it will go into a condition and will sleep up to five milliseconds. If now the GIL is still acquired it will send out a request to drop the GIL so that the other thread who's still holding the GIL should please release the GIL so that the new thread can take over the global enterprise. That means this can add already this additional latency if you only have Python instructions then you will basically always get this stable latency of around five milliseconds. The problem is a thread who's holding the GIL can also hold the GIL even longer than this five milliseconds because some bytecode operations take longer or you're calling out into an external state functions but it actually doesn't know anything about the GIL and also don't release the GIL. Okay, now with that we have some tool search which we can use to analyze our application our productive application. So here's the plan, how we do that. We deploy our new container with our custom C-Python instrumentate for our custom C-Python version including system tab on our cluster we go to this machine which is running the scheduler we attach to the process and we get some nice insights about the GIL contention. In reality, it was a bit different. It was quite easy to deploy the container with the custom C-Python version, no big deal. But at the end, we had to install system tab on the host because it's actually not that easy to get system tab running inside of a container. And that's a bit related to the architecture of system tab. System tab is actually transforming your script which I just showed some slides ago in a real kernel extension and loads that at runtime. And then it is that kernel extension running to measure all the timing and in the end, the data will be printed out. That's possible. I also did that in our productive environment. It was interesting, but I don't recommend that. Especially if you're talking with an insecurity guy he will be probably not happy if you start adding custom kernel extensions in your productive environment and run some processes. It's rude, your full isolation is basically gone. Nevertheless, I measured some nice results. Over two seconds observation of our process I found out that we actually hold the GIL around 88% of the full runtime. Sorry, of the measure timeframe. In the same timeframe, there were so many threats waiting for the GIL that we had an overall wait time on the GIL of nearly 300%, which actually proves yes, our application really suffers from GIL contention. That's not great, but it actually revealed even more questions. The problem is, the main question, which first came up is, okay, are there threats who are holding the GIL longer than five milliseconds or does all threats actually give up the GIL quite fast and we are hitting the limit of switching around various threats? That would be one possibility, but let's take a look if we actually see that in our infrastructure. If we see that, I would actually like to know, okay, which function is taking so much time and which function is not releasing the GIL? And also, one question which came up was, is there maybe some kind of clustering in these measurements? So it is maybe possible to identify certain clusters for certain operations at a certain patterns in this measurement that are very intensive for the global interpreter lock contention. Overall, the biggest problem was that with 31 threats, it was actually quite hard to read this text report. And that's also a problem with so many threats that such tools are often not optimized for that amount of threats and I would not recommend that. So one thing which came into my mind is that timelines are much easier to understand. I mean, if you take a look on this video tracing systems, they're actually quite good in visualizing how long a certain operation takes. It's much easier to recognize how long this operation takes based on the size of the span individual representation. Okay, could we do that the same thing? Please also for the GIL contention and maybe find out what's going on in our application. Therefore, the idea is actually let's use still system tab because at least it proved that we can collect the data, collect the data with system tab and print out a text file which we then can load into a tutorial notebook and can do various analytics stuff on that can create some charts, some nice visualizations, all the things we can do with our data science tools. And I think that's one of the most nicest things about Python in nowadays that we all have this knowledge visualization libraries at hand to make data quite easily to recognize. What I did is to visualize that with bookie. This is now the time's visualization of our 31 threads over two minutes. What you can see is that in the areas where we have dark blue boxes, there we have a GIL contention which is actually we have a GIL usage which shows that thread is holding the GIL longer than 50 milliseconds. Please remember, switch interval, five milliseconds. In the red boxes in the red areas, we have now threads who are waiting for the GIL longer than 50 milliseconds. If you are now thinking about optimizing web applications, you normally try to aim as fast as possible and every web developer who says, yeah, 300 milliseconds are totally fine for a web page. I'm not sure about that. Okay, let's zoom in a bit and we can take a closer look. So these are now 20 seconds, five seconds. Now we came to the interesting part of one second. And what we saw is that they're actually a clustering of big blue boxes. So for example, this part is taking more than 500 milliseconds. 500 milliseconds, this thread is doing all the work. No other thread can do anything. And the first question was, okay, what is this thread? It's actually a thread who is collecting metrics of our scheduler is sending out this metrics into a central system so that we know how many tasks are running, how our cues are currently and so on. And after adding additional visualization, I found out that this thread is actually consuming 75% of the full GIL whole time. Which is super expensive if this thread is only taking some metrics out of the application. Okay, so now we know what is a problem. And the nice thing is, now as we know the problem, we can actually fix the problem. So what we did is we start to replacing the C extension which we were using inside of this thread to calculate all these various metrics. We actually found out that the C extension never released the GIL because it's actually internally using heavily Python objects, but if you read normally C extension, at least this was the intention by reading that by choosing that C extension, then you think normally it will probably release the GIL quite early in the time. It's not always true. Also, probably most simple fix was to just change the interval of our metrics collection. So we just changed the interval from 10 seconds to 120 seconds. So we are now collecting less often intervals, but I mean, if it can solve the GIL contention, not a big deal. And that actually happened. So before we saw a huge usage of time, huge usage of CPU time where the GIL was required for processor. And also a huge amount of time where threads are waiting for the global interpreter. After applying this simple fixes, this simple changes, it was actually possible to reduce the time by basically half the time. So we only now hold the GIL for around 43% and we only wait 80% on the GIL. So the wait time is not great, but nevertheless, with this simple fixes, we were able actually to speed up our task execution scheduler so good that we can now actually use all the vapor resources. And it basically saved us a huge amount of refactoring of our application of replacing multi-threading with multi-processing, or asyncado, where we had basically to think about, yeah, is it maybe easier to rewrite the full application from scratch? And I mean, that is probably the most important information out of performance-related stuff. Please try first to find out if you have that problem, if it makes sense to invest in that area, and probably find even a better solution if you have more insights about that. And especially now that will help us to find out, okay, what is now the next best evolution and step for our application? Does it now make sense to introduce multi-processing? Or does it make sense, for example, to rewrite some functions which are a bit more CPU-bound than other functions and move them into a siphon, for example, where we can release the GIL. Also, this full topic revealed many multiple additional ideas how we could actually improve the Python ecosystem in that area. So the next things I would actually like to do, but as you know, time is always a limit, I would actually like to bring this tool set in a more reusable state so that it's easier to use for people who are not involved in developing that scripts. But also, I think it would be possible to extend C-Python in some areas to make it even easier to build such kind of tool sets. So, for example, the C-API for threat names are one thing which would be super helpful. Every time I read the system tab report, I had to map the threat identifier to human understandable name, otherwise, you have no chance with 31 threats. Also, one question I had in my mind is, would it be possible to actually move this metrics collection directly in C-Python like the profiler itself? It don't have to be active all the time, but maybe we can enable it on demand if we actually need it and collect the statistics and get them out. If we have such an API, it is probably also easier to integrate that in existing observability tools. For example, distributed tracing, which shows, yes, this operation actually took 700 milliseconds, but overall, I only consumed 30 milliseconds of the GIL. And at that point in time, you would already know, okay, probably I have a problem with my GIL. If you are interested in that topic, and I hope there are some people interested because the room is quite full, then please watch out for me. I would actually like to talk about that. I would like to hear some feedback, things that could be improved or things which would be interesting for you, because with that feedback, I think we could actually start some things in that area to make performance measurements in Python much better. And I mean, yes, we have the GIL. We will probably never remove the GIL from C-Python because it is not so easy. Otherwise, people would already did that. But if we learn to live with the GIL, then we can also improve overall Python application performance and we don't have to rewrite them in, I don't know, C++ or C or Assembler if you really like to do that. Okay, then thank you very much. And as we have five minutes, I think we can do one or two questions. Thank you. Thank you.