 I'd like to introduce Dimitri Trofimov, who's the team lead and a developer on the PyCharm team and is going to talk about profiling. Hi. You are brave people who are interested in profiling and don't afraid of talks marked as advanced. Actually, when I saw this talk in schedule marked as advanced, I was scared a bit myself. It won't be that hard, I hope. So, first, I briefly introduce myself. My name is Dimitri Trofimov. I work for JetBrains, I'm team lead and developer of PyCharm ID. My talk won't be about PyCharm directly, but I will use this debugger as a case study for profiling and optimization. If you want to discuss anything about PyCharm, just come to JetBrains booth in the expo hall to talk with the team. Being involved in the development of PyCharm, I have done a lot of different things. But the runtime aspects of Python, like debugging, profiling and execution, interested me more. Today I want to show you how usage of statistical profiler can help to optimize program. And this program, as I've said already, will be a Python debugger. I will try to stay on a high level using the debugger as an example and touch its details only if necessary. So, let's begin. The best theory is inspired by practice. The best practice is inspired by theory, said Donald Knuth. I like this saying. What I'm going to show today is inspired by practice. It was a real problem and to some extent still is. And the approach, the solution to it that I will show later was also real. It was actually done at some moment. And if you're interested in it, you can look into the code. But also very interesting is that when preparing for this talk, I tried to rationalize things and to look at the process which happened in the past from a bit more theoretical perspective. As if I did that again, but more in the right way. And actually that opened some knowledge for me and gave me some ideas that I will implement in future. And I hope that you find something interesting in this talk too. So, is it happens quite often in our software development work? We start with an issue ticket in the bug tracker. So, the issue says debugger gets really slowed down and it provides a code sample. And so, we see clearly that this issue is about Python debugger in PyCharm. PyCharm debugger. That's some part of the PyCharm that is written in Python. That's the same debugger that is used in PyDev AD. That's an open-source project that is maintained by Fabio Zadroszny, the author of the PyDev. And also it's maintained by PyCharm team. To understand better how the debugger works, I recommend to listen the recording of my talk at EuroPython 2014. This is called Python Debugger Uncovered. But now I will remind some basic concepts. PyCharm Debugger consists of two parts. The part on the IDE side or the visual part is responsible for the interaction with the user. It communicates with the second part that lives in the Python process. This second part, the Python part, receives breakpoints and comments via socket connection and sends back some data if needed. And the data can be the values of variables and stack traces and notifications about breakpoints. So, that's a Python, a part of the Python application with some threads, I.O. and separate event loop. And it's actually always running in the background of the process. And that all can lead to some performance overhead. And the core of the Python debugger is the trace function. That is actually the window through which the debugger looks to the user code and sees what's happening there. Python provides an API for tracing the code. It is a function called setTrace. It gets a trace function as an argument. Then that trace function is executed on every event that happens with the user program. An event like line execution or function call or exception or return from the function. There are a lot of checks that trace function performs. For example, it checks whether there is a breakpoint for a given line and if there is, it generates a suspend event. So, I think you've got an idea how debugger looks like. There are some threads doing communication with the ID in the background and there is a trace function that gets events about executed lines. So, let's go back to the issue ticket. When the code is executed normally, it runs for three seconds. In the debug mode with the breakpoint, it executes for 12 seconds. But in the debug mode with breakpoint, it executes for 18 minutes. It's very long. Let's reproduce the issue, whether it actually exists. So, we open PyCharm and we have this code. Actually, not to wait 18 minutes, we will reduce the code snippet. Actually, about this code snippet is just... Actually, that is a simple function with one iteration through the range. The only interesting thing is that the range is quite big and we have here an increment. So, let's reproduce this issue. We just run it. It was fast. Then we debug it. It was a bit slow but also fast. And then we place a breakpoint and we... Then it works. Yes. So, the issue exists. Let's analyze this issue. So, we have here three different cases. Normal run, debug without breakpoints, and debug with a breakpoint. And actually, as we can place a breakpoint at different lines, there are three more cases. So, it's debug with a breakpoint in the function, debug with a breakpoint in the same file but not in that particular function. And debug plus a breakpoint in some other file. But testing shows that the last case actually behaves the same as debug without breakpoints at all. Breakpoint is some other file doesn't affect performance at all. So, we won't look at that case. So, basically, we have four different cases. So, for our four cases, we have two cases with breakpoint in the function and with breakpoint in the file. The debug works slow. William Edwards, Deming, famous engineer, statistician, and management consultant said, you can't improve what you can't measure. So, before we do anything else, profiling or optimization, we should be able to measure the performance of the thing we want to make faster. In our case, the core of the sample code is in duration. So, we use model time to write how many seconds it took for the duration to complete. So, that will be our simple measurement. And after we apply this measurement to our cases, we see that the two cases with debug with breakpoints actually work 100 times slower than normal run, which is a bit sad, but who knows? Maybe in this particular case with this example, it's not possible to make any better. So, we need to compare this with something, with some program which does the same thing and have more or less the same functionality. And we choose PDB for that. Although it is less functional than PyCharm debugger, but it is sufficient for our comparison. You can place a breakpoint and PDB will stop at it. It is also written in Python, so it is in the same class. It wouldn't make any sense to compare with something written in C, because it has different application. So, and PDB is in standard library, and so it sounds natural to take it as a performance standard. And now we can make a benchmarking. After we took PDB as a standard, we can apply the same measurement to it, and then we can compare results with our debugger, which now will become a baseline in terms of benchmarking. And what we see is that PDB being a bit faster still suffers from the same problem. In the cases where breakpoint is set, the performance drops down dramatically. But still it is a bit faster. It takes five seconds instead of nine. So we can try to reach its performance. And the first thing we need to do to make the code faster is to find a bottleneck. It doesn't make sense at all to optimize parts of the code that doesn't influence the overall performance. And the part that influence the overall performance, the most, called a bottleneck. So let's find it. And the best way to do that is profiling. Profiling is the way to look at your code from a different perspective, to find out what calls what, and how long did it take for that to run. Profiling is a set of statistics that describes how often and how long various parts of your program executed. A tool that can generate such statistics for the given program is called Profiler. Let's use a Python Profiler. But first we need to choose one. So let's learn about Python Profilers available. If you're looking for a Python Profiler, you'll find several of them. The most obvious choices are C Profiler, YAPE, and Line Profiler. C Profiler is a part of the Python Standard Library. It is written in C. Python documentation says about it, C Profiler is recommended for most users. It's an extension with a reasonable overhead that makes it suitable for profiling long-running programs. YAPE Profiler is almost the same as C Profiler. But in addition, it's able to profile separate threads. Line Profilers were different from two previous Profilers. It provides statistics, not about functions that are executed, but about lines inside the functions. Also written in C, it provides a rather high overhead because it traces every line. As C Profiler is a default choice and we don't need the features of YAPE and Line Profiler, at least yet, let's use C Profiler. And we do that in PyCharm. For that case, we will have a bit, our sample code will be changed a bit because we need here to use, at the same time, debugging and profiling. So we will set up debugger from the source code. And we'll put place breakpoint here. And what we do now is we profile it and we continue. So the task started, we wait until it finishes. So, and after that finishes, we see, no, sorry, that is not what I wanted to show. Let's do that one more time. We continue, the task started and we wait until it finishes. Yes. And we look at the call graph. We see here a lot of calls, but actually, if we look closer, we'll see that all of them actually take zero milliseconds. That calls, that are internal calls of debugger. And the calls that took the most of the time, actually, there are two of them, are user code. That's our function and the main work. So basically, what we are seeing here is that Cprofile didn't show us any useful information. Is our debugger unprofileable? Or should we use YAPI or line profiler then? Actually, if we do, we'll see that they don't show anything neither. And so, why is that so? Why is it so? It doesn't work. Okay. To answer this question, we need to learn a bit about how Cprofile, YAPI and line profiler work. Cprofile provides deterministic profiling of Python programs. What does deterministic profiling means? There are actually two major types of profilers, tracing profilers and, or deterministic profilers, and sampling profilers, also called statistical profilers. Tracing profilers, they trace the events of the running program. An event can be a function call or execution of a line. That is the same as we had with the trace function in our debugger. The disadvantage of such profilers is that as they trace all the events, they add significant overhead to the execution. As for the debugging, Python provides an API for the profiling. The function responsible for that is called set profile. It's almost the same as set trace with the only difference that the function that we pass to their profile function isn't called for every line. It's called only for function calls. All these profilers use a set profile or set trace function to set up the profiling. That's why they profile on the user code. Our debugger, which also uses a set trace, turns out to be out of the scope of set profile. All these profilers aren't applicable in our case. So is our debugger unprofileable? Actually, there is another type of profilers. It's called a sample or statistical profilers. Such profilers operate by sampling. A sampling profiler captures the target performance call stack at regular intervals. Sample profilers are typically less specific and sometimes not very accurate, but they allow to run the program at its full speed. So they have less overhead, which in some cases make them actually much more accurate than tracing profilers. Finding a statistical profiler for Python is not that easy as a tracing profiler, as there is no obvious choice. But if you search enough time, we'll find several statistical Python profilers as well. That are StartProf, Plop, Intlv2Namplifier, and VMProf. Let's have a closer look at them to choose the one that we'll use to profile our debugger. StartProf is a sampling profiler written in pure Python. It's open source. It doesn't work, unfortunately, on Windows, only on Mac and Linux. It works, but it's quite minimal. And last time it was updated was long ago. Plop, or Python low overhead profiler, is written in pure Python. So actually it's funny, but it's not that overhead. It's not that low overhead as it could be. And it doesn't work on Windows neither. And its main page on GitHub says that it's a work in progress and it's pretty rough around the ages. So not our choice. Intlv2Namplifier. It is very accurate, has low overhead, but it is proprietary and not open source. You need to buy a license to use it, which may be not the worst thing, but it isn't as suitable in my case as it doesn't work on Mac OS X. And VMProf. VMProf is a lightweight statistical profiler that works for Python 2.7, Python 3, and even PyPy. This profiler was developed by PyPy team that represented a year ago at EuroPy in 2015. And since that has been developed and actively enriched, it's stable state. It is written in C, so it has a really low overhead. It's open source and free. And actually it's very great, it's open source because it allowed me, for example, to add line profile and feature to it during preparation for this talk a week ago, which would be impossible if it weren't open source. So it seems that it's a profiler of our choice. Let's try to use VMProf to profile our debugger. And we do that again in PyCharm. So we'll use another run configuration for that, the same source code, and we press profile button. Do we continue? We wait until the main task finishes. Yes. And after it finishes, we see that we have here a call tree. Actually, that is a nice feature of a sampling profiler that provides you with a call tree where you can see actually how your program was executed with timings. And we see here that the most of the time was taken by our trace function, that is the trace function of our debugger. So that is the bottleneck. Our trace function itself is a bottleneck. Not everything else, not threads, not ION. It's a trace function. So we found bottleneck. What should we do next? To make our program faster, we need to optimize it. And optimization can occur at a number of levels. Typically, the higher levels have greater impact. The optimization can proceed via refinement from higher to lower. At the highest level, the design may be optimized to make best use of the available resources and expected use. The architectural design of the system highly affects its performance. But in our case, we are a bit limited with our design decisions as we need to comply the set-trace API contract. So this optimization level isn't available for us. Given an overall design, a good choice of efficient algorithms and data structures and efficient implementations of these algorithms and data structures come next. Let's see whether we can make an algorithmic optimization. To find the way to optimize our debugger algorithmically, let's ask ourselves a question. Why does debug without breakpoints work so much faster than with breakpoints in the executed file? If we look into the code, we'll find out that in case there is no breakpoints in the current file, the trace function returns none. While if there are any, it returns itself. So in the middle of this function, we get the breakpoint of the file and if there is none, then we just return none. And so if we refer to the documentation again, we see in the last sentence that local trace function should return a reference to itself or another function for further tracing that scope or none to turn off tracing that scope. So actually, if we don't have breakpoints for the file, we turn off the tracing for the scope altogether. That's why it works very fast. And why don't we do the same for, but for functions, not for file? So we can add a little change. We store the name of the function where the breakpoint is placed and then if we don't have breakpoints for a function, there is no need to trace it. We just return none. If we measure the performance of this optimization, we see that our function started to work 110 milliseconds instead of nine seconds, which is a big deal. Beyond general algorithms and their implementation, concrete source code level choices can make significant difference. So our next optimization will be on the source level. But to make such an optimization effectively, we need to go to the source lines level. For that line profiling can be useful. But line profiler won't help us in that case, as it is implemented by trace function. Instead, we use a special mode of VM proof profiler, which was introduced there recently, and it enables capturing line statistic from stack traces. Let's use it and see how it works. We will again run it in patch arm. We'll use another run configuration for that with the line profiler mode enabled. And we use the same source. And we press profile button and we continue. So after it finishes, we see our trace dispatch function. And now what we can do is go to source. And in the source, we see a hit map, which shows us which line took the most of the time. And it's very strange, but the most of the time was taken by this particular line. It was 20% and 330 hits from nearly 1,500. Actually, what that line does is that it checks whether we need to trace this particular thread or not. That's it. So if we see that those two lines in the beginning, they are not related at all to this line. So what we can do is to move this line in the beginning of this function. Let's do that. So we'll just put it here. And also, if I think about how to optimize this source, we can remember that getAtter is not the optimal way to check whether an object has an attribute because getAtter makes a lot different things. So how we can rewrite this is we can write it... Oh, no. It's not very convenient to write it. Okay, I won't type because my setup doesn't allow me to do it. So we rewrite it this way. So we just check whether this attribute, which is used as a mark, is in the dict of the object. And after we check the performance of this, we'll see that this source optimization actually gave us one second. There are several low-level optimizations which aren't available for Python. Being an interpreter, Python doesn't have built, compiled, and assembly phases. Runtime optimization is possible in Python because runtime optimization is, for example, GIT that's just in time optimization, but it's available now only for PyPy and not for Cython. So what to do? Did the optimization reach its limit? Actually, if all high-level optimizations are already done and Python doesn't permit us to go deeper, we need to go beyond Python. Maybe we should rewrite everything in C to improve the performance. But in that case, we will lose the compatibility with Python implementations other than Cpython. For example, gython, ironpython, and PyPy would become incompatible. And having two implementations of the debugger, one in Python and one in C, we will make adding new features a lot more harder. What if we could just leave our Python code as it is but still optimize it a bit on lower level? So solution exists. And it's called Cython. Cython is a static compiler for Python which gives the combined power of Python and C. That is an example of program written in Cython. Know that it looks exactly like normal Python code except that declaration of variables in the second and third line, these declarations have type information which allows Cython compiler to generate more efficient code. So this basically provides us with another level of possible optimization inaccessible before, namely compile optimization. Let's add Cython type information to our trace function implementation. So after we compile our trace function with Cython as a native extension and measure its performance, we'll learn that it made our debugger more than twice as fast, four seconds instead of nine. So now we can compare all three optimizations combined with the baseline, our initial version of debugger and with the PDB, our goal. And we see that we have reached the goal and actually done even better. Yay, happiness. But to better our happiness, I will say that after we compiled our debugger with Cython, it became a native code which can be profiled with VM prop well anymore. So it is unprofileable again, ironically. But there are still ways to profile it which we'll leave out of the scope of this talk today. And the issue, we managed to double the performance for the sample code from the issue ticket. And we made it better than PDB. But still, in this particular case, it works slower than run. And maybe it is possible to make it even to work even more faster given the constraints of the set trace API and so on. But still, maybe there are ways to optimize it. So we'll leave that issue open for a while. Conclusion. Use profilers to find bottlenecks in your code. There are different profilers. Each has its own advantage. Learn about them. Start to optimize things from the higher level to lower. And to optimize Python on a lower level, use Cython. So that's all for today. Thank you for listening. There are links for VM prop profiler and debugger if you are interested in looking into the code. Actually, this feature of LAN profiling was added to VM prof recently. So it's not available in PyCharm yet. But it will be available via a plugin. I will publish it on this week, I hope. So thank you very much. Thank you very much, Dimitri, for this great talk. So the floor is open for questions. May I ask you to wait for us to give you the microphone just because we are recording everything? Thanks. Hi there. My biggest issue is memory profile. Can you help me do that? Actually, in this particular case, memory profiling wasn't an issue. If you're interested in memory profile, I can recommend to look at the VM prof because it supports memory profiling. The only thing it doesn't support yet is profiling of the native memory allocations. But that's actually quite a hard problem in Python. So if you have a pure Python code, VM prof can profile your memory. And actually, in Python 3.5, there is an API for memory profiling. I don't remember how it's called. I think it's called memory profiling. So you can look at it also. Questions? Hi, I'm Mikhail Kowalski. And I wanted to ask maybe a new question. Isn't writing the code in Cyton somehow also rendering it incompatible with other Python implementations? Yes, that's a great question, by the way. Yes, it does. If you just add a CDF into your Python source, it won't be compatible anymore. But what you can do and what we did in PyCharm Debugger is we had these Cyton optimizations optional. So the only change that you need to make in your Python source to be at Cyton compilable is to add these CDF definitions in the beginning. So we used a little template language. So in our source, these CDF definitions are commented out. So the source is running as a normal Python source. But to build Cyton extension, we uncomment these lines and the source became Cyton compilable. I can show you, actually, it's better to see than to say. So here we have, like, this is a custom template small language and it says, if it is Cyton, then we have this header. If it's not Cyton, then it's normal Python. So, actually, this source works for all Python implementations and if we need to compile that, we do it with a setup.py where we uncomment this if in case of this Cyton. Yeah. Any more questions? Well, if not, please join me in thanking Dimitri again. Thank you very much.