 OK, so this is going to be somewhat of a hands-on talk. I'm Norm Elfenbaum. I work at Bluevine, which is a fintech startup in Israel. I help organize the Pi Web IL, the monthly meetup, and the pike on Israel. It's kind of new, only three years. So you're welcome to hop on on the next one. And you can find me online at Twitter and at normelf.com. So what is a program profile? So a profile is something we do on a program in order to understand what its resource consumption is. And we can use it in production to monitor our service. And we can also use it to debug performance problem that we see just locally when we're working. So in Python, being a battery-included language, it has a built-in profiler in the standard library. So it has two of them. One is a pure Python implementation. And after it, someone contributed a CIP profile, which is just the C extension of the same profiler. It's just faster. So we're going to start with a demo, which is the easiest way to understand stuff. So we have this module in our production. And someone is noticing it's being kind of slow. What it does, it does power twice on a number it receives. It counts the digits of that number using conversion to string and then lengthen it. And then it just does the calculation and prints it out. So let's run it live. So we're running profiling demo. So it takes quite some time to run. We're running it locally. We're trying to debug. And it will come up soon. OK, so it takes 9.1 seconds to run. I'll make the screen a bit larger. OK. It turns. Read the bright. Maybe I can do it in white colors. Let me think. Let me think if I can fix that. I don't really want to get into changing the color of the terminal. But I'll try to work it out like that. I hope you see it well enough. Maybe we can turn off the front lights here. Oh, just for the number. You can see it here. I can highlight it like this. So it took 9.1 seconds to run. So what we're going to do is we're going to run it under the profiler. So we're going to run Python. Add the minus m to run the module c profile. And then use minus s to sort it by the total time each line of code took to run. So let's run the code again. We'll take the same length. But we'll see the results. We'll see what happens. Now before we see it, what do you guys think? Which part of the program is making it slow? Is it the superpower or the count digits? Who thinks the superpower? OK. And who thinks it's the count digit? OK. Oh, guys, you're good. I must say I did in another conference in Israel. And no one, like, it was a different ratio. So we're going to see it as well here. So apparently, counting digits is very expensive in Python because we need to convert a very large number into a string. And then, actually, the length method itself is not so expensive. But the transition into a string is very expensive. And we can see it takes 8 seconds, that's 7. And the superpower, this is the count digit. And the superpower only take 200 milliseconds. So good job, guys. Let's continue. Now we want to create an optimized version of it. We want to run it much faster. So I've created an optimized version. Do you think, can anyone think what it does differently? Yeah? Not module? Something else? It's a log function. Right. Exactly, it's the log function. I hope you didn't see it when I moved it for a second there. So exactly, just use the log 10 function. And then it's moving much, much faster. Let's try it now. Took 439 milliseconds. It's much, much better for us. But the output is much bigger. And you just see this very large output because of the imports. Imports is very, there's a lot of function calls. And you do an import. You see a very big output. On the top, you see that the superpower is now the most expensive function. OK, cool. So it looks like magic, right? When you look at it, how does Python does that? I know it's so nice to have. But when I saw it, I was wondering what are the mechanisms that helps it do it? So Python has pretty easy access to its stack from all everywhere you are in the Python execution in your program. You can see the whole stack of how you got there. And so the most profiler of what they do is run as part of your Python process and set up a trigger to capture the stack. And then it can later do the analysis on it. So let's see how we access the process call stack. So we can easily access the stack using the syscorn frame method. And then we'll just run on it, convert it to something that we can access easily with the tracer module extract stack, and print it out. So let's run it. Here we have some more functions that help us just to see the hierarchy of the code when we run it. So these stack access. So this is the actual code here. And when we run it, we'll see that the first line is in the module level. And it shows we were calling bar. The second line here is in the bar function. And we're calling the showStack function here. And inside the showStack, we see the for loop. So we can really easily, just by doing syscorn frame, access Python stack. Now we're missing when it will be called. We need some trigger to call it. So in general, profilers have two types of triggers. One is a deterministic trigger, which is then later a deterministic profiler. It's triggered on every function or line call. And then you can see the exact execution path. And we have statistical profiler, which is triggered on a certain time interval. So let's see how someone builds deterministic profilers. So Python lets you specify the callback that gets run whenever a line or a function is being executed. You can do it using sysset.profile or sysset.trace. The difference is the sysset.profile is calling the handler every time a line or a function is being called. And the settrace is being called only when a function is called. Now, when it's called, we'll record the stack. So let's see an example of that. OK, so we're using the sysset.profile. Here is our handler. The handler receives a call stack argument, an event, and the arg itself. We'll see the output later when we run it. In general, we'll see that the event is a call to a function. The argument is the function name. And the line is just something that we created here. It's just what line it's on. Now we're going to set the profiler just like that using the sysmodule, set profile, and our handler. And we're going to call hello world just for fun. We can execute whatever code we want. So let's see how that goes. OK, so we're going to do Python set profile. And we're going to run it. Now we can see the output. It's also maybe see the code, so it would be easier to understand. So this is the code and this is the output. So you see that the first event that was received is a C call. It's a C call because Python print function is actually implemented in C. And we see the line itself, print hello world. Now we get another event, C return, which is the function return, it's finished. So think about when you want to create the timer, you get the time it started, the time it ended. And then you get another event, which is just the return of the whole executable, the whole program. So now we have our deterministic profile. We know how to trigger it. But how do statistical profiles work? So what they do, like we said, they sample the program on a given interval instead of every line call on an interval that we decide in advance. And one way to implement it, you can implement it in several ways. But one of the ways is to implement it using an OS interrupt or an OS signal. So let's see the code that uses it. So we have our handler, right? It's very similar to the handler we used in our deterministic profiler. The handler is being called. It received the signal number, which is less interesting for us. And the call stack. We extract the call stack, and then we can print it into screen. In a real profiler, we'll capture the results and for later analysis. This is the interesting part for us. This is how we set the actual signal. So we create a signal, SIGPROF. It's a signal that is used for profiling, actually. And we put the handler on it. So when the signal is called, the handler will fire. And we put a timer on it. So the signal will be like they always will send the signal every interval that we put. This is the first interval, and this is all the intervals after it. And we also need to do something else. We need to remove the timer when the program exits. So this is a pretty nice module that Python had that I wasn't familiar with it. It lets you execute something at exit. When the interpreter exits, it will execute this function. And we'll just set the timer to zero. Now, this is the explanation. It's nice. OK, so the last part, we're going to start the function that starts the sampler, that sets the trigger and starts the sampler. And then we're going to make our calculation. And we're going to print it to screen. Now, let's run it in our terminal. OK, so this is our code. And it's running. So now you're going to see something which is pretty interesting, right? You can see that we only get the LEN str like before. We only get this part of the code being sampled. And I guess you know why it is like that, right? Since we're doing a statistical sampling, we're not seeing that the answer was just the longest one. So since we're doing a statistical profiling, we don't see everything that being executed. We're just sampling on a certain interval. So what you get is less inclusive. Like, it won't impact performance as much, but you also get less accurate analysis later. So now we know how to do both. So now for the practical question, when do we use which profiler? Someone say, when will we use the statistical one and when will we use the deterministic one? What are the use cases? Which use cases are better for which profiler? Yes. You wouldn't use it. Oh, OK, good instinct, OK, cool. Yeah, so that's a big part of the answer. So the statistical profiler, since it's controllable and predicted, you know what the interval you're going to sample and you know what is the effect on your production system will be. So it's much more suitable for production monitoring, but it's less accurate. So when you debug locally, you might want to use the deterministic profiler. It introduces a fixed amount of latency, but you don't really mind when you're doing it locally. Yeah, good instinct. So for the most interesting part of our lecture, we want to build a profiler by our own. So we have all the parts. We just need to merge them together, merge them in. Let's see. OK, so we already know how to set an interval, right? How to set a sampling interval. So we're going to set the signal, set the timer on the signal, and register and remove it once the program exits. Now the sample method itself is a little different now, because we want to collect the data, right? So what we're going to do is extract the stack. And then we're going to do a nice trick where we join the stack, right? We join the stack layers into a single string separated by semicolons. And then we count each time we arrived at that path. So we know how many times we spent there. We have a global variable. We keep it there. This is a naive profiler, right? So we can do global variables and stuff. And when the program finishes, we can call the format stats and output it in a way, which is I will use the flame graph tool to show it very nice on screen. And this is its format. It just put the frames with semicolon, like we said, and then the count of how many times we spent there. So let's go back to our command line. Oh, so firstly, I need to prove to you that it's actually 25 lines, right? So we're at 25 lines of code. And I was pretty generous with the, it can be shorter if you're interested. I was pretty generous with the spaces and stuff like that. So you won't say I'm cheating. You see, we have the imports on different lines, right? We have two spaces here. So there's a lot more we can put in a lot more stuff. And let's use it. So in order to use it, I've created a demo program. What the demo program does is a simple calculation. It has two functions here. The main function is called the calc function. And it does power. I'm using power all the time because it takes some time for the CPU to execute. So we're doing power 100,000, then power 200,000. Then we're calling the start method on the profiler. We're calling the code and formatting the stats, printing it to STD out. So let's run our code. Okay, so this is the format. And I'm not sure if it was clear earlier, so now you can see it more clearly. So we see that we spent time here. This whole stack is where we spent this amount of samplings, okay? So we've been in main calc. When calc was with 200,000, 79 times. And we've been in calc 100,000 24 times, which makes sense because it's a larger number, right? So now we can take this code, which it's pretty clear when we're just using it for a very simple program, but it won't be in a larger program, right? So in order for really reviewing it, we can output it into graph, you know, flame graph. Into flame graph, and then output it into our browser. And now the flame graph tool. I'll give the reference to the guy who invented it later. I really like it. It's a really simple format and you get really nice output. So it's actually an SVG that shows you a nice diagram of like the relative part of the program where things have been. So we can see that main had 91 samples, right? 100% of the samples. The calc 100,000 had 24% of the samples, and the calc 200,000 had 75% of the sample, which is pretty clear, right? Makes sense for us. So this is a simple program, but if we'll take a more complex one, for example, downloading PEPs into our local hard drive, which is still simple, but it's complex enough for our lecture, for our talk. So what this program does, it composes a URL, gets to the python.org and downloads a PEP, oh sorry, and download the PEP using URL open, saving it to memory, reading the request, saving it to memory and then writing it to a temporary file on disk. And the main function, it loops through four PEPs in this example, run the save PEP and says it accomplished it. We're outputting it to STD arrow because we want to use the pipe for something else, STD out, we want to use for something else. And that's it. And now we can call our code, go to start, run the main function and then print out the stats. So, let's see. So like I said earlier, I'm counting on the wifi here, so that's good, actually I'm surprised. I thought I'd have to use an ethernet cable. So what we see, we see like we have an output that we can't really understand anything from it and here comes the flame graph tool to the help and we'll run everything back, but in the end, we receive a very nice diagram. Again, it shows how much time, it's part of the program was sampled, how much like the count of the sampling. So what we can do, we can really look into the diagram and start like to understand, actually it's nice also for just understanding the code if you want to understand like the standard library or stuff like that, you just see the execution path. And if you want to focus on a certain place, for example, we can click on it and then we'll have everything zoomed out and we'll only zoom on this certain part and see where it spent its time. Talking, I think stuff, opening them, we won't really get into that, but it's something that you can do with your own code and really get a good notion of where you've spent time. Another nice thing about it, by the way, if we're here already, it works really nicely with rep. So for example, I can do like a rep on the results and remove something, for example, remove what we can remove, it will be interesting for us. So we can remove any connect, any place we have a connect method. And then we receive a much simpler diagram, right? The diagram will be smaller, it will be kind of funny because I just remove a random string, but you can do it for yourself. If you wanna remove, for example, a certain library, you don't wanna see its output, so you can just do rep-minus-v and then it will be gone. Okay, so we saw how it works. I think we've come to the reference part. So the Flangraph tool is built by Brendan Gregg and it's really nice, he just built it himself. I think in Netflix, and you can use it for all types of things, like you can start doing profiler, you can just create this output and then you'll see it in whatever you want to profile, right? Julia Evans had a really nice blog post about Ruby and Python profile, which inspired this talk. And, which is also pretty interesting, Nylis, which is a San Francisco company, just implemented their own performance monitoring service, just a simple script, and they're using it for all the production environment. We're using it, my company, we're using New Relic, which is really nice, but it has some problems, for example, keeping the data for how long it keeps our data, I think two weeks or stuff like that, and you need to pay for everything if you want to get more, right? And you have Python Profiler's docs, and this talk can be found at GitHub. Okay, thank you, any questions? For questions, we have five minutes. If you have a question, please line up, add the microphone over there, don't be shy. Feel free. If everything was very clear, it's also okay. So, it's also good. Okay, excellent, very good, very good. Thank you. So, let's talk, thanks. How would you profile your Profiler over? Oh, very meta, how do you profile your Profiler? Actually, you do see the Profiler code in your profiling, it's pretty funny. We might be able to see it here as well. That's pretty fun here, you see? Here we can see the extract stack, that's what we use, so it does show up in our profiling. How would you profile it? I don't know. You can see it as part of the Profiler, right? Profile output, and I think you probably want to keep it out, so you probably do grab Minus V and kick it out of your profiling, okay, thanks. And I will not let you off the hook. And so, I really like the talk. What about, I don't know, profiling a multi-processing type of thing with multiple processes and things? Yeah. So, what Niles did, that's how they do it, they do it for the real workload. So, they have another service, which takes all the outputs of the different processes separately, and then merge them in together, and then you can see across your whole system, your whole processes, what weather time is being spent. So, you just need to create another service for it. They just created a web server, another web server. Okay, thanks. What's the next question? Huh?