 I have quite a few slides to get through, so I'm going to be speaking fairly quickly, and I don't think I'll have a lot of time for questions at the end. But if you have any questions, please email me or contact me on Twitter. The topic of my talk is why is it slow? We see many talks over the years focusing on how to improve performance in software, but we don't see very many talks about how to go about analyzing performance problems or figuring out why is it slow. A little bit about me, as you heard in introduction, I'm an engineer at Capiche. Capiche is a natural language processing company, and we're trying to solve the problem of how to analyze customer feedback at scale. I've made a learning scythe and video course a couple of years ago in 2016, which is still pretty relevant today. Very few people know that it exists, because it's only really available on the Safari platform at O'Reilly. And recently, I also wrote a book on async IO. The goal of this talk, what I really wanted to do is to try to give you a little bit of my experience over the years trying to solve performance problems. I really want to try and give you a strategy for how to approach the problem. I want to give you a little bit of structure for how to think about how to approach this problem in different scenarios. But I'm not going to be talking at all about how to actually make things faster or any kind of low level diagnosis, hardware issues, CPU registers, and Cache, and so on. So the scope of my talk is I've broken it up into four sections. The first section is going to be on fundamentals, what you really need to be thinking about when tackling a new performance issue. In section two, we'll talk about what you can do when it is easy to run the code, when you can run the code over and over again in your dev environment, and you really have a great deal of freedom in how you tackle analyzing the problem. In section three, we'll talk about when it's not quite so easy to run the code and some strategies you can use there. And in section four, we'll talk about distributed systems and microservices and how your thinking needs to change to approach that kind of scenario. So let's jump right in. Okay, section one, fundamentals. There's really only one fundamental that I want to drive home in this talk today. The most important thing, the most important thing that you need to obtain when analyzing a performance issue is when the system is being slow, what is it trying to do? Not in a fine-grained detailed way, but in an overall way. You really want to understand what the whole program is trying to do while you're having this performance problem. And I really want to emphasize that very importantly. It is super easy when analyzing performance to try to focus on one particular function or one particular line and try to optimize just that thing. But by doing so, you miss out on a whole bunch of enormous performance opportunities where by changing slightly the design of the program, you can get massive boost in performance. So I'm going to emphasize that again. You want to figure out what is the code trying to do while it is being slow. And you have to look beyond the current line of current function. This knowledge gives you the biggest leverage to be able to make a huge impact. You'll often discover in practice that when you find the cause of a performance issue, that code can very often be producing the same result over and over. Like for example, it might be an initialization in the modern age with all these machine learning models. Sometimes we find that model initialization happens over and over again at the start-up of a program. And sometimes you are able to save that result and case it and just reuse it again on subsequent start-ups. But the best outcome, and this happens disturbingly often, is that sometimes when analyzing performance problem, you find that a slow path in the code is not even required. It may be there because of legacy reasons. It may be that previous people in the team worked on the code several years ago and they're no longer in the team. And no one really quite knows what the code does. But when you discover such an opportunity, you can get a huge performance boost simply by deleting that code. The best optimization you can do is not to run a certain line at all and this by far beats any kind of algorithmic improvement. If you only focus on micro-optimizations, it's easy to miss the really big wins. So long story short, really the concrete thing that you want to obtain when analyzing a performance issue is you want to get the call stack. This is key. While the program is being slow, you want to see the line that it's currently busy with or the function is currently busy with as well as the entire call stack all the way from how you got to that line in the flow of the program. You want to know what initiated the slow code and why because this gives you the leverage to be able to make strategic design changes. So before we get into more concrete details, I want to talk a little bit about tools. If you search for Python performance tools, you will find many, many results and these stretch all the way back the past 15, 20 years. There are many tools in the space, but for me personally, my focus in this talk is to give you a short list of the things I personally use at work and in my personal projects. And my criteria for what I choose to use really tries to maximize the product of simplicity and impact. It has to be easy to use and it has to be easy to get results quickly because it's really easy to get bogged down in very complex tooling where you don't really get a lot of bang for buck. By using some of those sophisticated tools. And the final tip I'll give you in this section is instead of choosing the latest and greatest tool and then finding out what you can do with it, it is much better for you to first decide on the kind of data you want the tool to give you and only then select the tool that can give you that data. So I've already told you that what I'm really interested in is the call stack. So I tend to focus on tools that can give me that. So here's our first concrete section. When the code is easy to run and you have all of your tools available at your disposal, you can run it on your dev machine and you can throw many strategies at the problem. What is my go-to for how to approach that scenario? Local code's easy to run. You can run the program as often as you like. Sometimes you do have to rearrange things but having tests makes it really easy to run different parts of your program independently. So that's really key. Tests are not only about trying to prevent regressions but they also help in performance analyses because the discretization that tests provide can also be used for performance analysis. For example, if you're using the PyTest test runner, you can use the dash K parameter to specify a particular keyword that will narrow the scope of the tests that you run. And I have frequently created tests specifically to analyze a performance issue not even intending to commit those tests as part of the version control system but only so that I can invoke it using PyTest and PyTest plugins, which we're gonna get to in a couple of slides. So my first big trick is something that you may not have come across before but the idea is that you can just use Control-C as a performance analysis tool. This is by far the best bang for buck that you're going to get. The idea really is you run your program and while it's being slow, you simply hit Control-C. What you are immediately presented with is a stack trace and the stack trace tells you what your program was doing while it was being slow. In particular, if you have a big skew in the code that is fast and the code that is slow, when you get the stack trace after hitting Control-C, you almost always see the same trace because that body of code is consuming most of the time. Mike Dunleavy has quite a detailed stackover for a post about this and he calls it stack sampling. If you're in a hurry and you can manually interrupt your program under a debugger, there's a simple way to find performance problems. You just halt it several times and you look at the stack trace. If there's some code that is wasting some percentage of the time, say 50% or whatever, that is the probability that you're going to see that stack trace that gets shown after you halt the program. This seems like it's too simple to actually work but in fact, it works surprisingly well and in exchange for very little cost to you upfront. I'll run through a quick example and I'm going to be moving pretty quickly here because I did a run-through earlier today of my talk and I ran out of time so I don't want to spend too much time on this. I'm really going to try and just hit the high notes on the concept. Here I have a fake program that does a few silly calculations. There are three functions, F1, F2 and F3 and in the main function, we simply call those three. What we're going to do is we're going to run this program and we're going to find that it takes about two seconds to run which is a little longer than we think it should based on the complexity of the code. We're going to try to use our approach of hitting control C during the two seconds and then we're going to see what happens. So what in fact happens is we do get a stack trace and this is the stack trace that we get. And what you can see here is that we can follow the stack trace all the way down from the main function which is the top of the stack to where the function is being slow so all the way to the bottom here in F1. And in fact, this function really does take most of the time. It's a Fibonacci number calculation which is very recursive in nature. And the interesting thing about this approach is that regardless of when you actually hit control C during those two seconds, you get the same stack trace every single time. It doesn't even matter what the call order is. In our main function, if you call F1, F2, F3 or in reverse order F3, F2 and F1 you still get the same stack trace back. And the reason for that is that F1 consumes all of the runtime in this city program. And this technique works surprisingly well in a large variety of these ad hoc situations where for example, you have a pie test invocation that seems to take longer than it should. You can use this technique hit control C and you will often find that you may have a fixture for example, that is doing a lot of expensive initialization early on. So this is a really cheap way of discovering a performance bottleneck. Compared to C profile for example, I want to highlight the cost of not having a stack trace. The same program run under C profile produces this output. It is true, it does highlight the fact that it is our F1 Fibonacci calculation that is taking all the time. And it also highlights that it has a large number of calls. But what we don't have in this output is we don't have the relationship between F1 and main. These calls will just appear in whatever order they sought according to the total time and the per call time and so on. So we lose that link. So it's easy to see the time spent or the number of calls spent, but not why, not why that call was being made. The other thing to keep in mind is that C profile is quite expensive to run for this particular dummy program that I'm using here. The original runtime was two seconds but when you run under C profile it was 74 seconds. In larger, more complex systems, that impact can really burden you when you're trying to figure out the performance behavior of the system. So just to summarize C profile is pretty expensive. It measures everything, even things that you're not really interested in. And the worst thing is that table that we just saw, when you see the numbers presented in that way, it feeds the temptation to want to focus on a micro-optimization because you see that a particular function is being slow and that breeds the temptation to want to make that function run faster rather than thinking about the design of your entire software stack. However, C profile redeems itself because it does actually collect the information we need. It just doesn't automatically expose it in its default output. So another neat trick and the second and final one in this section is PyTest profiling, which is a PyTest plugin. So the key thing is after you install PyTest profiling, this command line parameter, dash a profile SVG, is really the magic. It's very easy to configure if you have a PyTest suite and call that. And when you do that, what you'll see printed in the console output is this SVG file that's been created. That SVG file looks something like this. It gives you a large map of nearly all the functions that got run during the execution of your tests. And it has a color coding for the cost. So the warmer the color, the more costly that function call is and the colder the color, the less costly. And I'm gonna zoom in on a diagram like this very quickly. It also gives you some of the stats that you get usually in C profile output. Like for example, it gives you the total time and the own time that was spent just in that function. But it also gives you the call stack because these arrows show the relationships between all the function calls. So you can use those two pieces of information to gather quite deep insights about the performance behavior of your system. However, bear in mind, this is a lot more expensive than control C because it uses C profile and C profile is slow. Okay, we're about halfway through. I think I'm doing quite well for time. So the previous section was when it's easy to run the code locally. In this section, it's more complicated. Some reasons might be that the code that you're trying to analyze makes a lot of use of threads and sub-processes. It might be difficult to run a narrow part of the program or you may not know which part to run. You may find in a particular scenario C profile is too expensive to run. You're just not getting feedback fast enough after making changes. And finally, you may have significant code in native extensions and C profile doesn't highlight any information inside those. So my go-to in these scenarios when the simple tricks no longer work is a really magical tool called PySpy. It really is impressive. I have here with these green check marks basically a wish list of all the features that I've wanted in the past before I found PySpy about things that it can do. It's a sampling profiler. It can attach to a running Python process. So for example, you could use it in a production environment if a process seems to be hung or in a frozen stage. You can attach it and get feedback. It has a very low speed impact. It can analyze sub-processes. It can dump call stacks on command which we'll get to shortly. And it can also include stack trace information from native extensions inside, interleaved inside the same Python extensions you would be used to seeing. So let's have a bit of a look at PySpy. It has three sub-commands. The three sub-commands are record, top and dump. And we'll just take a quick look at each of those. So the way to invoke it on the command line is it has a CLI called PySpy and then sub-command which would simply be record. And in this case, you can see just like we saw previously it's gonna dump an SVG file. And you can also add the dash dash sub-processes command which will also measure sub-processes. So what that produces is a flame graph that looks roughly like this. I'm not going to go through this in detail but I just wanna give you a sense of what you can obtain with this tool. In practice, you can hover over each of these worms and you get additional information and metrics and you get that information about which function calls are being made from which other function calls and their duration based on the width of each of these bars. Okay, moving on. So PySpy top, it gives you a view very similar to the Unix top command but instead of seeing processes, what you actually see is functions and file names in your running Python program. As they consume more or less time, they'll change their order inside these columns. And on its own, this isn't that convenient because you're not really seeing a stack trace, you're just seeing a measurement but it is super convenient to be able to just attach this to a running process in certain situations and then getting a very quick sense of which line the current program is stuck on. And this final command dump, this is the one that is really useful. So again, you have a sub-command called dump and you attach it by PID to a particular running Python process. And then what comes out is a list of all the threads with the call stacks from all the threads in that running Python process. And as I said, you can produce it on demand. So you can execute your program or you can have a program running in a production environment and you can attach this with the dump command and then get an immediate snapshot of what all of the threads are busy with. This is super amazing for being able to jump in and get performance metrics really quickly without very little ceremony. And finally, I just wanted to highlight some other features that the PySpy dump command has. It can show local variables and include them within each frame in those call stacks that I showed in the previous diagram. And you can add the dash dash native parameter to include call stacks from native extensions. Those do need to be compiled with an additional flag and that is documented inside the PySpy documentation. But that is how you can have your call stack from Python and from your native extensions interleaved together in that live dump view. Okay, and finally we're on to the final section and this is the most difficult one. In fact, it turns out this is the easiest one after all because you don't have very many options to get good information. There's really only one path to go down. Just to clarify what I mean by distributed system, you have a bunch of services running perhaps on a bunch of computers that are making network calls between themselves and the lifetime of each of these services running on the different computers is all different and they deploy differently and so on. But they do work together to provide some overall service. They are much more difficult for you to be able to run code at will unless you have a very small system. Services depend on other services and those can also depend on other services. So it's even difficult just to reason about the system architecture usually and setting up a full dev environment for such a thing can be hard or even impossible. And worst of all, the performance behavior can be non-deterministic where the slow behavior can only happen sometimes. Like maybe when only one user does something or maybe when only one kind of file is being processed by a software and something like that. Nevertheless, the same rule applies, the golden rule. The fundamental thing that we wanted even in the simpler cases was that we wanted a call stack because we wanted to reason about what the entire system was doing when we were observing slow behavior. In this case though, what we really want is a call stack that operates across distributed network services, right? We want to know which service called which other service and how long each of those calls took and we would ideally like to see that in an ordered relationship like a tree. Distributed tracing provides that call stack and it gives you timings for each of those pieces. I have a definition here which I've stolen from Honeycomb. It's a technique used to monitor and observe requests as they flow through distributed services. Distributed tracing provides visibility into each request and all of its subsequent parts. There is an industry standard that has been built up around these ideas called Open Telemetry and I have the link here. There are many vendors that provide tools that do this but at Capiche we're using Honeycomb and I'm going to use Honeycomb in a couple of examples. So the key thing that you get from a distributed tracing tool is a view like this. You get a tree on the left. So right at the top here, our trace begins from a Django endpoint which is the entry point into our system from a front end like a react or a view front end. Application and from this point on, that Django system is going to call other services in order to accomplish some goal. So I've highlighted here in red, we can see that the service name changes throughout the call stack and in this example that I'm showing here, we have three services, but it can be more, it can be less and these time-based bars that run across tell you how long that particular step was consuming and you can see these layered in a nested fashion. There is a cost though for all of that power. The cost is that this data needs to be collected continuously. You can't just turn it on in an ad hoc way as we saw with other tools. You have to add instrumentation to your software and all of these trace events have to be transmitted to in my case, the Honeycomb servers so that they can be rendered on demand. To set up such things is fairly straightforward. You typically would initialize the system by providing an API key which you would get from the service and for things like Django and Flask and other similar frameworks fast API, there is a middleware that can automatically be configured to integrate into at least the rest API part of the stack and then inside your code, you have to add additional things like this. For the case of Honeycomb, they have this beeline and client library which gives you two main tools to work with. The one is a decorator that you can attach to functions and the other one is a context manager that you can use to scope discrete blocks of code. And those two things work together to send these trace events to Honeycomb. I have a quick example of a case study here that we really did use at Capiche where customers upload spreadsheets which contain customer feedback data and then we process and analyze those. And this screenshot is in fact taken from a real issue on one of our issues in the issue tracker where we were trying to process a CSV file and it was incredibly slow. And this might be too small to read but this first big bar at the top is about encoding detection. Literally just trying to detect the text encoding of the uploaded file. And the second one is to generate a CSV schema from the uploaded file. And we found that we had a really problematic code path that was producing very slow results. And from when this issue was reported to the resolution of the issue was within one workday because we had these traces available and I could immediately jump to the trace in Honeycomb and immediately see that we had not one but two separate problems. And we were able to fix both of those and resolve it really, really quickly. Distributed tracing is extremely powerful and I have a confession to make. I use this tool even during software development just when I'm writing a new feature or I'm working on a bug because all the code is already there. The instrumentation is already in my software. So as I'm writing new code I just configure an environment on Honeycomb for myself for personal use and then I send events to that and I can analyze the performance even during my development process using those really nice graphics. In particular in Dev, I trace all of the SQL queries especially for Django and for the Django ORM because it's really easy to emit way too many SQL queries if you're not careful about how you control the execution of those queries with nested models and so on. Okay, so in summary, remember my advice is what you really want to get when trying to figure out and analyze a performance problem and you're coming to that problem from a blank slate what you're really trying to get is a call stack because you want to see what is causing the program to be slow through the entire stack right from the start of the program. When possible, do give control, see a try. If you have something that is running slow on your machine and you want to figure out what is making it be slow, hit that control C and you can get an immediate stack trace and try to make sense of what's happening. If you need a bit more detail than that if that's producing confusing results or you have multiple issues that you're trying to figure out PyTest profiling is a much more comprehensive approach which will give you that large tree of SVG that you can use to check the performance of many functions at the same time. If you have special needs where you can't run the code at will, PySpy is a really great tool to be able to attach to a running Python process and it makes a very, very little performance impact on that running process but can give you real-time information including cumulative stats through the top command and the dump command can also give you a call stacks from all of the sub threads in real-time. And finally, if you have a distributed performance issue where it's quite difficult to isolate the code to a single system, you will have to use a tracing system like Honeycomb to be able to analyze performance and that will require upfront instrumentation of your code in order to emit those events. I know that I have left out a huge number of tools and I'm sure other people would have their own suggestions for what will work well for them but this is the set that I currently use. So these are my go-to options for the different scenarios that I've highlighted and that is the end of my talk. Thank you everyone for attending. Thank you very much, Kaleb, for sharing your tools and all these inspiring tips for us. We have a short amount of time and since this is a remote event, let's check if we have a remote question. I see something, no? Okay, then we also can have local questions if somebody in the audience has one. If you do, we have a microphone over there where you could go and ask one. And let's see if I see a question. I do not, I see somebody running through the microphone. Please go there quickly. Thank you. Hi, thank you for this work. Yeah, a bit closer. Go closer? Yeah, right. So thank you for your work, for your talk. So my question is with regards to this tracing system, you mentioned Honecom, are there any other alternatives that you know of that we can check out? Many, there are many. If you just Google distributed tracing, you'll find many of them. There are also open source systems that you can self-host. But off the top of my head, Yegamonkey is one. Datadog provides a system. I think Sentry now also provides distributed tracing. I say that perhaps under correction. But there are many. And new ones are coming out all the time. So probably what's better for you to search is not distributed tracing, but open telemetry. That's really the key thing because many vendors have joined forces to decide on a common inter-operation protocol for distributed tracing. So that's the thing to look up, is open telemetry. All right, thanks. So thanks for your question. And that's about all the time we have. So let's have another round of applause for Khalib.