 Right. Next talk, please welcome Julien who's going to tell us about profiling in Python. Thank you. Okay. So, sorry. So today I'm going to talk about production profiling for Python. So if you don't know me, I'm Julien. I worked at Datalog for almost a year now and I've been doing free software stuff for like 20 years now. This is a burning statement that anything that I say today might change in the future but I have to say that to you. So first I'm going to talk about what profiling is and what you need it. Then I'll talk about how we build a Datalog profiler for production system that we use internally, that we're going to provide to our customers in the future. So the first thing to understand is what profiling is and what you need it. If you never use profiling, it's a simple answer but it tries to solve it's like if I were to ask you in your program which function or which line of code is going to be the most costly, the most time you're going to spend in your program running and doing that function of that line of code, it will be probably pretty hard for you to guess or to give me an answer, an accurate answer. It's really hard to know when you run programs in production which endpoint is going to spend a lot of time on the CPU, the one line that is going to allocate the most memory in your programs. You might guess but the more complex your application is going to be, the more complicated it's going to be to answer that question. So the goal of profiling is to actually get inside your program and get all of this data and extract it and be able to say okay, this is a stack trace but I see most of them this is a stack trace so when I talk about stack trace it's like all of your functions that are being called one after the other and being able to say okay, this is what my program is doing and this is where I need to optimize it to make it faster because this is the end goal is to be able to see what your program is doing and to make better decision about where to spend your time to optimize it. This is typically the kind of input that you will see if you do profiling on a Python program or any kind of program for that matter where you see stack traces and when the width of the regular goal is like the time spent on CPU or allocating memory or anything like that. There are two types of profiling. The first one is deterministic profiling so if you ever use cprofile the Python profiler which is built in since like a long time in Python. This is what it does and there's another type of profiling which is what we're going to talk about today is like the statistical profiling where you do things in a different way which has its own perks. So deterministic profiling if you never use cprofile or if you don't know it works it's pretty easy it actually hooks itself inside Python it says to cpython each time you're going to call and execute a function I want you to tell me and I will like not down the time it is and then you finish to execute a function you just call me again and I will not down the time and then I will compute the difference between the two times and I will know that this function has run for that many seconds. So that's pretty cool it's pretty easy to understand but it has a few issues which is like you only know the whole time that means you know how much seconds your function has taken but you don't know what it did like was it waiting for IO or was it doing CPU or was it allocating memory, you have no clue you just know that it's been two seconds executing the function. The way it works is that it hooks itself on functions which is the granularity that you will get from the output. So if your program is just one giant function where you will have nothing interesting in thermal information and if your program is far enough function or very small function calling each other it's going to have so much overhead that your program is going to be 10 times or 20 times slower when it runs. So it's pretty hard to see profiling production system for that reason because the most function you have in your program the slower it's going to be. See profile in particular as this one problem where in our case it was an issue where it explored its data also in a custom file format which is something like pql if I remember correctly which is not something but it's really easy to use in any of our programming language. So for all of those reasons it's impossible to use something like deterministic profiling and cprofile in production system. You can't afford to have your web server to run 20 times slower just because you're doing profiling. The other way of doing profiling is different is like you're going to run your program normally you're not going to change anything but every now and then like every 10 minutes ago or so you're going to interrupt your program and to look at what it's doing right now. So this is going to tell you if you do that often enough you're going to have statistically a good representation of what your program is doing usually. So it's not like the absolute truth contrary to something like cprofile which really meters every call. You're just going to in that case statistically over a few hours or a few days of your program running going to have a number of samples that's going to show you which number of times you see each stack trace and you're going to have the kind of measure which is roughly statistically accurate. The upside of doing that is that you can actually measure a lot of different things like the world time but also the CPU time which tells you if you spend time just doing IO or actually using the CPU and doing things like computing results. It has a very low overhead. The way it works is that it leaves your program doing what it does usually and does not try to wrap the functions or interrupt them or whatever. It just doesn't know and then execute something in the background saying okay what is it doing right now and that's it. You're able thanks to Python at least to go down to the line of code that is being executed so you're not just talking about functions but you're able to see which line of code in particular is going to be executed. In some languages not Python unfortunately you can go down to the opcode though the precise like operation in the line that's not possible in Python yet unless you speed your code and you can have like over things like I'll talk about that later see the exception running in your program but being raised and you're able to see them and to see the control flow in your program. So what I'm going to talk after is what we did at Datalog implementing this. Before the environment I spent a lot of time checking the state of the art or the thousand of Python profiler out there. None of them does that. A lot of them actually use on very good doing fancy frame chart or things like that which is not something I had to solve. I was only interested in getting the data and exporting them into a proper data format so what I'm going to talk about just now. The first thing I had to provide to solve was like how do you get the CPU and world time information so the way I solved that is in a thread so you run the parallel thread in your Python program and you use most of the time let's say every 10 milliseconds you wake up and you're going to use this private function that has been there forever that everybody used so it's not really private but what this functional returns is a list where the key is a thread and the value is a stake trace basically. So if you do every 10 milliseconds regularly you are able to each time see which is being executed on each thread and you're able to construct a model where you know like in real time what your program is doing every 10 milliseconds. When you do that you can actually use so that's only line X specific for this part function that is p thread get CPU clock ID which basically returns you a special clock in line X which measures the CPU time at each thread is consuming so every 10 milliseconds you will be able to get a stack trace and actually associate that stack trace with a number of nanoseconds or milliseconds of CPU that has been used by the thread. So at the end you get a pretty good representation on each of your threads what they do and how much CPU they use. You can actually also get a world time because you can use a regular clock so you get this kind of output where you have the regular so this is an example of a clock with four threads that's why you may not see but there's like four minutes on the top route span on the top there's like four minutes of world time in total because you got four threads running for one minute and you have the detail of which function has been called in which threads and you have the same data for the CPU time which is like way less there's only like three seconds or five seconds of CPU time being used and you see the output looks like a bit the same you see the same stack traces but you see that a lot of the function are like weight functions because they're doing nothing so if you were using something like cprofile you would see a lot of time being spent in the weight function which is not really interesting because it's just a weight function where if you take a look at the CPU time using the data you got from the pthread-git-cpu-clock-id you actually get the number of nanoseconds being used on the CPU and you can see here for example the weight function we don't see them in the charts below so it's a way better way to do profiling in production system because it has a very low overhead so for example on my laptop if I am to profile a Python program with 10 threads which have each one executing 30 function-long stack traces I can pull and generate 100 samples per second for only 1% of CPU so it's already a pretty good thing to get and it's actually going to get a bit cheaper because the way it works you sleep every like 10 milliseconds but you can actually increase that amount of sleeping so if you want to have less impact of your performance you can dynamically adapt your sleeping time like I was saying you're able to get so you need to do a bit of c for that but because it's not exploding Python but you're actually able to get the profiling exception information so you can use a block like TriExcept and it goes back to a few functions before being called you can see that in Python it's equivalent of sys.xinfo you can catch that and take the control of the program and you can limit the resource usage I was saying you can basically sleep more often if you need right now the only downside is that we don't do c profiling so if you have siphon module or c module you are not able to see any information below the Python stack so that's something I like to solve in the next release we also do memory profiling so for memory profiling it's way easier because I would say that most of the work has been done already in Python so there's a Python module since 3.4 which is called traceMalloc so what traceMalloc does is that it replaces the malloc function used by Python to allocate the memory and not the object that you use in Python by its own wrapper which allocates the memory but also make a trace of the allocation and traceMalloc is able to export to you the information about which line of code allocated a memory and like many megabytes and on many times it has been called so it's pretty easy to get this kind of information without doing too much work memory traceMalloc it wraps and it counts every object allocation that you do which makes your program very slow again so that's not something you can do on prediction system if you have a program that does a lot of allocation and gets four times or ten times slower just because you are tracing the allocation it's a problem so the trick here is to also do sampling and doing statistical aggregation of the results so the algorithm is pretty simple like you sleep most of the time and once in a while you just wake up and activate or deactivate traceMalloc to get the samples or not of memory allocation if you do that like 5% of the time it's usually okay you only get 5% of the allocation being tracked by traceMalloc but over time if you do that for a few minutes a few hours you get a pretty good presentation of your patterns in your application so it gives you two results as I was saying the number of allocations that each of your function does and the size for the two charts looks a bit of the same but I mean the width is going to change based on if the allocation is going to allocate bigger objects or not but this is the kind of result that you might get at the end with our stack traces and the number of megabytes that you're going to have allocated so pretty easy again to debug your program to see what's costly in terms of memory so that only counts the allocation that you do not the size of the memory you used that's only the number of times you allocate memory it doesn't track like when you free it or if you don't use the object anymore so right now the limitation of traceMalloc like I was saying is that the overhead it's really slow really really slow I'm not sure right now how fast you can I mean if you can improve it you make it faster there's no free information so the way you track things it's pretty simple you don't have any context information about which thread did the allocation so you can't map that any threads and it's also limited in the information it reports for the CPU and world time before it's pretty easy to get the function name because it's pretty reported in the stack in that case traceMalloc just give you the line number and the file so you don't know which function or which class you are actually trying to allocate even which type of thing you are allocating it's pretty light in that regard so that's something I'd like to improve in the future another thing where we do a bit of profiling is threading so usually when you do multi-threading in Python so a lot of people don't do multi-threading because of the global lock and how it makes things slow etc but you still need some time to have threads running when you do application with a lot of threads you need to synchronize over something and usually you use a lock which is like the most object you can get in the threading world in Python and that lock when you have a lot of threads running to acquire it you might have contention and it might make your program slow there is nothing in Python basically to give you some kind of information about is my thread blocking because it's trying to acquire a lock or not is it's waiting for too long, is it's waiting forever it's really hard to know so the way we solve that is we basically wrap the threading.lock instances so each time you thread a new lock you actually get a lock which is not a real lock but under the hood we intercept it to see when your program is going to acquire it and when it's going to release it that makes it possible to actually get the information about how much time they spent trying to acquire the lock or how much time they spent acquiring the lock being held by one of the threads so you get a lot of information for example you get the total lock acquire wait time so this is the number of milliseconds usually not that long that your program is trying to wait is going to wait to acquire that lock so with this kind of information you're able to see that you may need more lock or to split your program in different parts to not use only one lock but several locks etc. you can get all of this information you're able to redesign your program which is not always easy and possible but sometimes it is and you get all of that with the different threads that try to acquire the different locks so once we built all of that we had to export all the data to use them which is its own challenge so like I was saying C profile as its own custom format is not very interesting not useful for us there is one format which is pretty used in the world which is a cal ground format so it comes from the Valgram program which does profiling for C and it has its own text based format that you can use to export data the problem is that it's text basically so it's not really like it's easy to consume but it's not really efficient in any way so that's not a good solution for us if you don't know Pprof, Python profiler then directly export HTML so that never gave us any clue on how to export to something that you can use for any of our program until we found Pprof so if you don't know Pprof that's a lot of things the name of the profiler that isn't the go programming language it's also the name of the tool they use to display the data and it's also the name of the format so it's like the same word for three different things so the Pprof format is pretty cool because that's what you can use for if you do go and go profiling you can export all of your data in Pprof format so it's a format based on protobuf so it makes it pretty easy to understand there's a schema with all the fields documented etc it's easy to use and protobuf is easy to use in Python too it's very fast to serialize so it would be a shame to try to be fast gathering all the data about your Python program and then bring it very slow but exporting it so that's sort of a problem so where the Pprof format works it's not going to export everything that you gather like every sample with every timestamp etc it's going to aggregate all of that so you're going to compute the sum, the average, the minimum, whatever you need and export all of that with the stack traces in your format at the end and it's pretty space efficient so they use a pretty simple model of a string pool a lot of repetition usually like all the functions you're going to see are always the same the stack traces are going to be most of the time the same so all of that is pretty well I'd say organized in the schema and it's zipped by default so pretty fast to generate pretty slow, pretty small to store for example on average we use 20 kilobytes of storage to store one minute of profiling for one Python process so pretty easy to store even for a few hours or a few days so Pprof is also the name of the tool that you can install pretty easily if you know a bit of go but it's not that hard so it's able to generate this kind of frame chart where you see on top left where you see all the stack traces and the width is going to be either the size of the location that you do in memory, the time you spend on CPU etc depending on the status you have to gather or you can also use this kind of chart on the row where each rectangle is going to be one function and everything is going to be connected by the number of times they called each other etc so it's pretty hard to read when you're not used to it but if you I promise if you spend like a few minutes on it and Pprof is pretty easy to gather and to understand so it's easy to use you can apply any Pprof file with it being go or iPhone or anything so what we did at Datalog is an open source library which is not yet public just because like for them it's too early but I'm working on it it's going to be integrated into our Tracer library which is already open source so the profiling part is going to be integrated into that and if you want to use it it's fine because it also supports exporting the data to files with an S so you'll be able to use it directly and to export all the data in Pprof and use the tool I showed just before so as I said it's not released yet so if you want to know when it's released you can follow me on Twitter or anything and I'll be able to announce it it should be in a few weeks, no promise but I hope so and that's it. I have four minutes for questions