 Hi everyone, nice to meet you. I'm from Israel. Oh, that's awesome. How's the weather like in Israel? It's amazing. Oh, that's pretty cool. Yeah. All right, so I'm going to let you get started. All the best. Thank you. So hi everyone. I hope you're enjoying all the amazing talks that already been taken today. I'm very excited to speak in this amazing conference and I'm going to speak to you about practical optimizations for your pandas code. So I'm a healthy. I'm a software engineer at Salesforce. I have big passion for Python data and performance optimization. And you can find me sharing the knowledge at Twitter and in media. So let's begin. Let's take a brief overview of what's pandas. So pandas is a library for data manipulation and analysis. It's loaded the data into Ram. And it's widely used and it has a vast ecosystems, which make it a great tool for you to use in your data science projects. But the next question is why should we even care about performance. So obviously fast is better than slow. We don't want to wait for our program to execute. In addition, memory efficiency is good. We are terrified of either the out of memory error or our AWS bills. So by using more efficient machines or less machines, we can save money. And if we are even willing to spend that amount of money. Hardware will only take us so far. So now I hope that I showed you why we should care about performance. But the obvious answer is, should we care about performance always or just in certain scenarios. And I'm going to speak when should we care about performance. So first as developers and as data scientist, we should write to make our code cleaner and maintainable. So performance optimization should come later on. So it should come later on. So obviously we should do performance optimization only if it's affect our users. So when the program doesn't meet the requirements, either that it doesn't run fast enough or that it fails due to out of memory exception. Another good reason for performance optimization is when the development base is hurt due to the fact that your program run every once in an hour instead of every 10 minutes. So you should really try to optimize only the parts of the code that are troublesome. It can be done by identifying our battle next using profiling. I'm going to cover it briefly in this talk. But obviously it's a big topic. In addition, when we're refactor when there were code, we want the code to be well tested because the code is useless if it's basically run fast and give us wrong results. Okay, so this is a nice meme about the difference between a snail and Python and enterprise employee doesn't find the difference, but I'm here to tell you that you should have faith in Python and in pandas, and that you can gain a significant performance improvement if you follow what what I'm going to show you. So I did this analysis on a data set of invoices mill invoices. It's called one million lines. It's not a big data set but it allowed me to show you the issues and the tricks. So let's begin and understand how we can optimize our code. So I'm going to cover the tips on the easiest and the ones that are best value for money, either they are the easiest or they give us the best boost. So first we should only use what we need. And by that I mean we should keep only the columns that we use. So if I'm not going to use the day column, I can remove it and by that I'm saving a lot of memory, which is consumed by the pandas. Also, if there are rows that I don't need, I should filter them first. And by that I reduce both the memory footprint and the execution time. We should be like this little fella and use only what we need. The next thing is we should not reinvent the wheel. We have vast ecosystems, we have NumPy, SciPy, and many other libraries. We should use these existing solutions because they have less bugs and they are highly optimized. So instead of creating our own K&N implementation, we should probably use SciPy or Scikit-learn implementation. So now we're going to delve into some more details about optimizations. So pandas really aids loops. It's highly optimized for doing vectorized calculation, which means calculating the entire column in the same time. So obviously the better option is to simply iterate over our data frame and do some calculations. So either by using iterals or a regular loop, I can iterate and do some simple calculation on a row by row basis. As you can see here, I'm just doing subtraction and the price with the tip to get the original price. And as you can see, this takes 35 minutes to execute it, which is a lot. I'm using the time it module in this stock in order to show you how long does a method take to be executed. It takes so long, I only run it once. And we should look how we can optimize it better. So instead of using the iterals, I can simply use the apply method. The apply method get a function and I can do something that is very similar. So for each row, I'm doing the same subtraction and returning the result. So, as you can see, from 35 minutes, we got to 22 seconds, which is a major, major change. So just by doing that, we got 100 times improvement in execution time. So obviously, iterals and regular loops are evil and we shouldn't use them. But there is a better option. We can use the vectorize methods of either NumPy, pandas, or scipy. I will show you an example. So here I'm doing the subtract on the entire column. And although not just the look cleaner, it will be much, much, much faster. So instead of 22 seconds, I'm getting two and a half milliseconds, which is 8,000 times improvement in execution time. And I don't want to remind you how long did it take when I use the iterals method instead. So we should really use vectorize methods to optimize code further. The next optimization that I'm going to speak to you about is picking the right types. So I'm going to try to convince you the motivation of it by simply creating a NumPy array, which contains only the number one over and over again. And I'm going to create that frame of this exactly same array with just different types, with objects, with floats, with int64, int63, and so on. And after I create it, I will check the memory usage. And as you can see, the exactly same array can take you 80 times more memory if you are using Boolean over object. So obviously we should try to improve it. And let's see our data look in my specific example. So I can look at the data frame size by using the dot memory usage methods. And as I can see, we have 47 megabytes of data. And if I want, I can see how the memory is distributed across different columns. If I want to optimize only one of those, but I advise to optimize all your types because it's easy. So how can we do it? So before I will show you how we can do it, I will show you the supported types in Pandas. So we have ints, we have floats, we have Booleans, we have objects, we have eight times, and we have a few more types that no many people know about. So we have the category type. The category type is the gray type when the same value occur over and over and over again. We have sparse types, which are great when most of the array contain nuns. And we have nullable ints, integers, and Booleans. So by default, none is a float. So even if I have an array that is a Boolean and I have only one nun, it will be casted as float. So instead I can use the null integers and nullable Booleans in order to avoid it. So how can we optimize the types? So the best solution is to use to load the data frame with the specific types that we want before. And if it's not enough, we can use the as cast as type methods or two numeric and two time delta functions. We have the downcast parameter, which can help us understand if it's an int 64 or an int 32. And in our example, I use the as type in order to emphasize the type change. So I decided that there was a adjustment should be a Boolean and the mill price can be int 16 instead of float 64. And let's see by just doing that. So the memory usage was reduced to 3.7 megabytes. And we can see how the memory was distributed, which is 12 time performance improvement in memory. So we should really try to optimize and pick the most specific type for a data set. In addition, if we don't find a good enough type, we can create our own custom types. So pandas provide an extension array API to create such objects. But obviously, it will take you a lot of time and I suggest to do it only if you are very experienced pandas developer, but there are open source types. So there are types for a IP like objects, which is in the cyber pandas library, and there are special objects which are in the gear pandas. So this is a nice meme about types about I'm going to speak to you about some panda usage, which functions you should choose over the other. I'm not going to cover all of them because it's a huge topic, but we should just begin. So we can process our data chunk by chunk. And by that time in, we can split our data into smaller parts and then execute code into each chunk. By doing that, we can work on large data sets that are much bigger than our moving. So it would look something as follow a huge file, and we are going to read the file chunk by chunk, and then we are going to do some processing on it. It's important to know that this is a great optimization to reduce the memory footprint, but it's good only if there are no interactions between the different chunks. Another thing is, we can optimize the way specific functions are being created by mean and some. So the type, as we say before, can be our friend. So if I'm going to calculate the mean of this specific column, if it's a type object, it will take me 96 milliseconds to be executed, and will take me only 4.2 if it's a float. So we gain 20 times performance improvement just by knowing the type again. Data frame serialization. So many times when we are loading and saving the data frame, we are picking the obvious format, which is CSV. CSV is great because it's readable and it's supported by many frameworks, but it does not do the best in loading time and saving time. So if the bottleneck in is by the loading time and saving time, you should probably use a different format. In addition, the disk space will be affected, but in our age, it's a bit less crucial. So as you can see here, share the link of this benchmark. So as you can see, CSV take longer to save and to load as feather or get feather is somewhat a arrow format, and you can use a pyro as well. And as you can see, you'll even do better than CSV. The next is, we should really try to use a query and a value uses num expression under the hood. Num expression is a library to evaluate number expressions. They can improve our execution time. I'm going to show you an example in the next slide. And more importantly, it will improve our memory footprint. So when we are executing NumPy function, it does some intermediate calculation and by that in case of footprint. And this simple library knows how to avoid doing those intermediate steps. But obviously, not all operations are supported because it's somewhat new library. So for mostly just for numerical calculation, you should use it. So in this example, I'm going to sub-select the rows that are correlated to breakfast invoices. As you can see, by using the regular methods, I'm getting 100 milliseconds as opposed to 80 milliseconds using the query. So we gain 20% performance improvement just by doing that. I want to warn you that this optimization is awesome only on somewhat big data sets. On small data sets, you will probably get the other way around. Concut and append. So when we want to add rows to our data frames, we will probably use one of these. And append, create a new data frame object whenever we are. So if we have a for loop and we are adding a lot of objects to the data frame, we should probably avoid using append. And we should use append on a list and then use if the sorting is the problematic part. We can use some different sorting mechanism. We have pandas, numpy, and pytorch, and TensorFlow. Basically, I advise you to use the panda sorting. We can use the kind parameter to tell which algorithm to use. So we have this quicksort, but we have method and Ipsort. In addition, if you are using GPU machine, you should probably use either TensorFlow or pytorch for the sorting. Group by our very CPU heavy calculations. And there are some easy techniques that you can use to reduce the execution time. So the first one is, as we said, we should filter as early as we can. And by doing that, we are iterating on less data. In addition, custom functions are very slow because there are some optimizations that are not done in the C level of pandas and numpy. So probably if it's possible, you should extract part of the custom functions to either vectorize methods or other techniques and keep your custom functions as small as possible. Additional CPU bound operation is merged. We should again filter or aggregate early in order to reduce the size of the data set. Sometimes it will be more performant to first filter the data frame records by using where in technique instead of using the merge before. This is best if you know that many records gonna disappear. Using an inner join and another technique that can help you is to join on the index. So that's about how to use the pandas techniques. Now about compile code. Why should we even care about compile code? We are in the Python. So Python by default has dynamic nature and we don't specify the types before and which make some compilation optimization missing for us. So for some operations pure Python can be slow. I will show you an example. I'm doing a full function which basically accumulate the number and until I get to and and so this take 80 seconds to be executed. So for that we have site on a number which provide us compile optimizations for our Python code. And I'm going to start with site on. So site on can give us up to 50 times speed up from pure Python. It has steep learning curve because it's adding it's between C and Python and C is not the easiest to learn. We integrate site on into our project. We need to compile the code, the pi way x and then to edit with a setup in the setup UI, which is a bit hard to integrate but the good thing is that the calculation of the compilation before and and I will show an example for our our methods. So as you can see, I'm just adding the type for more complicated function. It will be harder to understand. And as you can see just by doing that, we are getting to 360 milliseconds, which is 49 times performance improvement over the pure Python code. We have a number in addition. So number help us to compile the code using the LLVM compiler. So it can give us up to 20 times speed improvement over pure Python. It's really easy as as you will see later. It's just adding a decorator to our function. It's highly configurable. If we want to make the function parallel, we can simply add a parameter for it and it will do all the gil stuff for you. And debugging can be easy for the Python part. But if you have issues in the compile by code, it won't be fun for you. It is a newer project than Python and thus it supports mostly numerical calculations and a little bit of string like operations. So in our example, the only thing that I needed to do is to add this decorator that's saying that it's just in time compiler. If I want it to be parallel, I simply add a comma parallel equal to and that's it. Just by doing that, I'm getting 440 milliseconds, which is a 43 times performance improvement for more complicated functions. It might be faster than than a site on and we can create vectorized methods using numbers as well. So if you ask me, you should first try to use the existing vectorized methods of pandas, sci-pi and a number A and a pandas, numpy and sci-pi and then number because it's easier and only if those doesn't help you, you should use site on. We can benefit from general Python optimizations as well as we are in the Python ecosystem. So we have caching, we can use caching to avoid unnecessary work and computation. We can even, this will make our code run faster. If we can, we can use similar techniques to make our calculations on in an increment like manner. In addition, we can save memory footprint by using intermediate variables in a smarter way. So when we are doing intermediate calculation, we will have the memory footprint of the two objects and just by doing smarter variable allocation, we can save a bit of memory. So an example for that, I'm going to use the memit module. As you can see, the big memory is eight gigabytes. I'm allocating both the data variable and another variable here. And if I would just override this data variable, I will only get seven gigabytes. But again, if it's not your bottleneck, I wouldn't play with it because we should first optimize for readability. Another thing that we can benefit is concurrency and parallelism. So pandas run on a single process. And as you can understand, for a CPU bound program, we can benefit from parallelism. And for IO bound, we can benefit from parallelism and concurrency. The reason I put it this far in my slides is because you will gain probably four times the improvement and it will make your code much, much harder to maintain. But if it's the bottleneck and it affects your user, you should probably use it. And there are many more techniques for Python optimizations. There is an amazing book by Mika and Ian. And if all of these does not suffice for you, you'll probably should use a different library for data frames. So you have modding, which is a wrapper for dusk and array. We have QDF, which is pandas for GPU applications. And we have PySpark. But obviously, there is no free lunch. Every one of these and every one of these framework has its own limitations and issues. For example, QDF, which is GPU, can be slower on some operations like fill and A. And obviously, installing it, it's much harder and it's less featureful. So when you are picking your own framework, you should really thrive to use the one that best fit your use case. So let's go over the techniques that I listed. We should only use what we need. We shouldn't use columns or rows that we are not going to use in our data manipulation. We shouldn't reinvent the wheel. There are many smart people that already implemented most of the common algorithms. We should use vectorized methods. We should pick the right types for our pandas data frame to reduce the memory. We should know which methods to use for our pandas code to reduce the execution time. We can benefit from compiled code as well. There are general Python optimizations that we can do to achieve better performance. I think that the next talk in this track speak about it. And if nothing suffices for you, maybe you should use a different framework. So I've listed the additional resources. If it's interesting, there are workshops on Numba and the more. And the best suggestion for me is to see the vectorization mindset talk, which can make your life tremendously better. If your code executes in 10,000 times less, you shouldn't wait anymore for your code to be executed. And that's it. If you have any questions, feel free to ask me now or later on on Discord or LinkedIn. I hope you enjoyed the talk. That's it. Thank you. Hey, I think we have time to take questions. So what do you want to take them now? Sure. All right. Okay, so the first question is from Pascal. It seems like a map and apply function in pandas serve the same purpose. Maybe you can give a recommendation as to when to use one over the other. So they are similar. The difference is that applies run on either the entire row or the entire column. So map I would use if I have a function that is executed only on one column and apply I will use if I have a function that is executed on two columns. In addition, it's important to note that when we are using when we are using map and apply, we should try not to use vectorized methods because they are optimized for the entire array. Instead of only specific value. Awesome. So the next question is from Francesco. Why not using just a sequel database instead if performance is critical and the size is large. So it really depends on the SQL database basically SQL databases as a lot of optimizations in it. But but there are many issues with it as well. By that that we are using pandas and it's in memory in memory, it's been calculated much, much faster than you can do it with this just reading this data will take you longer than calculating using vectorized methods. So for huge, huge data set probably pandas won't survive for you unless you can do chunking. And then you should really think if SQL database is best fit for you, or you should go to use some frameworks like spark. All right, so the next question is what tools do you use to profile your code to find the bottlenecks. This is from James. Okay, so I didn't cover it in much. So I will try to give you a brief explanation. So we can use C profile to profile our CPU, our CPU part of the code. The part using it is that the profiling itself will make the code much slower like 100 times slower. So for if the code is slow from the beginning, we should use a static statistical profiler instead. And in order to visualize the output of the profiler itself, I'm using snake viz. I will share a link later on. And for memory, I'm using the either fill profiler or the mimic magic that I just showed. Okay, so the next question. Okay, so this is not a question but Simon says dusk is a good alternative for big data frames. So and so. So I covered it in a bit. So dusk is is amazing. It's pretty mature. And I briefly touched on it. A modem is a library that wraps the pandas API. And basically there are engines inside it. So one of those is dusk, and the other one is Ray. So I have only good things to say about dusk. But the good thing about modem and using it instead of a raw dusk is that if dusk doesn't support one of the features of pandas, it will default to the pandas operation. So basically, it should be just, you can replace the framework in a much easier way than using dusk. Okay. And also, Peter asked if there is a downloadable version of your slides in a PDF format anyway. I think it should be in the talk, right? Yeah, yeah. So I think it's in HTML form, but I can, I can add a PDF format as well. All right, so Daniel asks you, what are the limitations of modem? So there are many, it really depends on two things. Which engine do you use, whether dusk or Ray, and then you have the limitation of that engine. And in addition, it's a pretty new library. So in my personal use case, sometimes it's just fail for no reason, but they are working on it to make it more stable and work better. As you can imagine, dusk and Ray uses multiple processing techniques. And if you have like 10 or 100 machines, you will get at most 10 times a performance improvement. And as you can, as you saw, by using vectorized methods, you can get to almost 1 million times performance improvements. So obviously I would try to use those only if Pandas doesn't fit my use case. Also, so the next question is from Krishna, at a production level deployment, which would be better, Pandas or PySpark? Does Pandas have implementation for distributing it across multiple clusters? So the solution for distributing Pandas across different systems is dusk. And I think Pandas and the Spark are different use case. So Pandas is for medium data when you want really effective data transformation, because you load all the data into memory. And Spark is great when you have really huge data sets. So if your data is probably bigger than 20 gigabytes or 50 gigabytes or 100 gigabytes, probably Pandas will not suffice for you. But then you will have different problems because most of the machine learning algorithms are supported in a single machine and Spark machine learning is less featureful to say the least. Okay, so the next question is from Mark, did PyPy enhance Pandas NumPy usage through Vendila Python usage? Can you repeat the question? Okay, did PyPy enhance Pandas NumPy usage through Vendila Python usage? So I didn't use PyPy to be honest. Maybe it does, maybe it doesn't. So I don't want you, I don't know. Okay, awesome. So Deepak asks, sometimes we want to use is in for grouppy.getgroup. Do you have any suggestions on which one to use, which is better? Can you write it in the queue? I want to see the... Okay, so if the doubts are going to be more extensive or it requires any discussion, you can reach out to the speaker on the breakout channel, which is entirely just dedicated to talking about this presentation. So we can take stuff there too. All right, so the next question, I think this will be the last one. Any way to clear all objects at the end of the code, like a clear cache? So you can call the garbage collection of Python yourself, but I specifically don't do it because it will make the code less maintainable and readable. Probably, if I were you to understand which part of the code does take the most memory footprint and just adjust the types or split it to chunks. So I worked with a lot of memory bound programs and I never specifically had to invoke the garbage collection on my own. Okay, do you hear me? Yeah, yeah, sorry. What software did you use to make your slides? It's rice, right? So I have used Jupyter Notebook. Jupyter Notebook as a reveal.js plugin, which allows you to, in a simple click to create slides. In addition, there is other nice benefits to it. The code can be executable as well if I wanted it to be interactive slide. And in addition, you can perform the Jupyter Notebook to a medium post quite easily as well. Isn't that the package called rice? So the package is called rice and basically it uses reveal.js behind the scene. And if you install Jupyter via Anaconda, you are getting it for free. You can simply click the file and then you have export section to which format you want the notebook to convert, whether to a reveal.js slide share to PDF or to regular Python script. All right, could you unshare your screen for just a minute? And yeah, one second. Yeah. Oh, did you stop sharing your screen? Oh, we can still see it. Did you want me to stop sharing? Yeah, yeah. Okay. Do you have any problems to it? It's fine. I just wanted to play the applause soundtrack for you. Awesome. Okay, so I think we have more questions, but I think it's lunch break now. So we can take the rest of the questions to the break room. Okay, thank you. All right. See you. See you all after lunch. Goodbye.