 Thanks. Hi, my name is Lukasz Konkol and I will speak about performance optimization and profiling as an examples I will use comparison of Python code and patterns inherited from other languages like CC++ Java Yeah, I would try Okay, let me introduce myself a bit. I'm working as a senior Python developer at STX next It's a Python software house located in Poland I have also gave my talks at the Python UK last year and PSS which is a Python conference in San Sebastian in Spain and I have also shared my knowledge during Local meetups in in the post 9 in Poland, but what's most relevant for the topic I'm really interested in performance optimizations and I'm really fascinated in it Okay, let's check up the agenda. I will first introduce you shortly to the topic of the performance aspects Then we would have a live demo and Then I will summarize that I talk shortly and at the end We would have some time for questions But before we get started Let me get to know you a little bit better Who have ever used any profiling tool? Okay Who is using that frequently like on daily basis in daily job Okay, not that much And who is just starting a journey in Python Okay And who has used any other languages and then Python like cc++ Java Okay, that's much Okay, so let's get started What are the base aspects of performance? First of all, it's CPU We all want our code to be as fast as possible And that's what our client and our end user most cares about Google statistics says that Then that more than the half of the users will abandon the website If it takes more than three seconds to load And more over each single second calls customer satisfaction to drop by about 15 percent and conversion to drop by around five to ten percent Another thing is memory The truth is that we don't care about the memory until we run out of it And I must admit it's fair approach because why should we care about something that is not affecting us? but We should be still aware that any memory leak can affect our software for example in On daily basis in testing environments, we don't use as much fixtures as we would have in production So the scale makes a really difference here Last factor is input output of operations And for example database operations or file system This may massively influence the first aspect so the So the time of Loading an application or running But it also has another side effects There are specific limit limitations or on the simultaneous Input output operations For example database has transactions which may log and hold each other File system has a limits on number of open of similes than usually open files Moreover some services like Google App Engine Provides resources like database up to certain quota Then each read and write to database over that limit costs you real money What should we do to optimise our code in a good way? First we have to plan and predict our approach Implementation and the data structure which we will use And predict which which one will give us the best results Then we should profile our code It would be good to provide a few proofs of scon of concepts Of course if we can We should use some fake data just to indicate which one would be the best But that's not all once we ship our code We should monitor the performance in the living ecosystem Keep in mind That you will not be able to test each data set You can be almost sure that your end user Will have some edge case you could not invent on your on your testing Environment And note that this may be also related to the Text stack the user is using Once you have production benchmarks collected you should identify bottlenecks and quick wins And it really depends on each specific case You should then take a look into optimizing your code again You should first look into quick wins and bottlenecks But should we optimize everything of course not there is no point in Optimizing as something that that will give you like 5% gain and It will cost you a few weeks or even months of work So about profiling tools there are a lot of them available in a pipi I will just introduce a few of them, which I will use during the demo First one is C profile it's To inspect the CPU usage as divided by functions That's part of standard library And it's most accurate of the available tools Next one is memory profiler. That's a third party library. It's available on pipi It's better than other options, but it's still not perfect Results may vary a bit which I will present you later during the demo And it also takes some time to profile the memory Next one is CIS. It's built-in library. It provides us with low-level operating system API which we can use to To inspect for example CPU usage of the process or the memory And last tool is this It's also built-in our library We can use it to disassemble the Python code and Inspect it on lower level Okay, so now it's time for demo. Is that font size okay? Okay, so let's start with the first example First I will show you the usage of the tools. I Will I will use during the presentation so first one is the function which I will use Profile the CPU usage So I'm creating the list I'm creating the second list and then I'm deleting the first list and returning the Second list Okay, and how do we profile it? We can we should import C profile Then import the function which we will use and provide it as a string So it can get evaluated so we can see that Six functions has been called during during the execution of the main function It took 0.02 seconds And we got number of calls of each function Here in the last column we have the file name line number and the function name Yeah, we got the number of calls of this of that function we have total time of all the calls Then we got time per one single call average time Then we got the cumulative time Which is from the start up to the end of this function Yeah, and cumulative time per call We can also pass some additional arguments to see profile But I will not do it for now. I just want to show you the simple cases of CPU profiling, okay Another Another tool which we can use for profiling CPU is time it but it's not so accurate as C profile mostly because of a garbage collector not being crammed in time it So let's run the second example Okay, we got one second 1.85 seconds It's The time of running the 100 or it tries on This example. So what we provide here is the real function to to measure measure and The setup string which also will be evaluated before running the The real function to to profile Time it is also available With the dash M time it then we should provide the setup string and Then the string with the real function to profile This will gives us a bit more readable Result so we got the information that we have run a 100 loops best of three of the of these loops is 18.5 milliseconds And we have also nice plug-in for for time it in I Python We should just import import the function Then we was percent sign time it CPU profile And here we will have even more user friendly way of showing the results We'll also have the standard deviation Yeah, so we got average time here and then the standard deviation here It has been run in 100 loops and it has been around seven times Okay, so that's about the performance CPU performance Now let's go to the memory profiling so On the function we need to profile We should call the the creator We get we got We should import memory profiler first Then use the profile Decorator from this to the library And we have a few optional parameters like precision. Yeah, that takes a bit longer it probes the Memory usage of the process after each line so here we got the Code we're profiling We got the starting memory usage And we got increment and the total memory usage here. So here you can see that first list took about 1.85 megabytes second one took 1.87 megabytes and Then when we release actually the least a We get back one megabyte of memory What that may mean? that We got garbage collector didn't run yet or There is one more thing some Small integers are being cached by by Python So these are just few of the reasons but there are definitely more and moreover there is We should also keep in mind that it's just A probing so it might not be a hundred percent accurate accurate. Okay Another example will be greatest common divider So we will disassemble it and see what the result will be Okay All these codes are listed in Documentation of this library So we got here the we're setting up the loop We're loading the y variable Then if y is false we will jump to line 24 So over here Okay, in next line we're loading x and y We're calling binary module function then we store it in temp variable Next line is assignment of y to x So we're loading loading y and storing it in x Next line is assignment of temp to y We're loading temp and storing Y and we got the end of the loop. So we're jumping back to the second second line over here and Then we're loading x and returning it So that's the simple example of Disassembled Python code. I will also use it later for more interesting examples Okay Another thing is What Another thing I want to talk about is the reliability of profiling. So we'll call both CPU profiling and memory profiling three times and See the results Yeah, I know that We got three different times Keep in mind that it might depend on many factors like another processes Being crammed in the background I tried I tried to isolate that virtual machine as much as I could but But it still differs between the course About the memory We also got different results here First First reason behind it is as I said before Python caching and It's mostly visible in difference between the first call and the other tools So here we got different result and Here we got the same result at the end the increment still differs but it's because of caching in Python, okay Another Example is Creating lists by list comprehension by appending the result and by extending the list so Definitely the fastest is List comprehension Then four times slower is Creating list by appending each subsequent item and a bit faster is extending the list so whether you would have the possibility to extend use it because it's Small gain but in the scale it can really differ next example is comparison of data types tuples lists sets and dictionaries, so the fastest is the least Then we got sets tuple is much slower than than list and set and dictionary is Even more slower is the slowest of these collections, but we should keep in mind that dictionaries are Keys and values pairs. So the construction of it takes time. Moreover, there is a hashing function which is hashing the keys and We can see also the size of these collections So smallest one is is tuple then we got list set and dictionary Set is much larger than Then tuples and the list It's almost four times larger but the reason behind it is that It's just optimized dictionary in the implementation so it just have keys and dummy values and It's using just just the keys of the of the dictionary and that's how set is implemented Okay, next example is using the Values of the two lists iterating by indexes Zip function and Using a dictionary so zip function is the fastest one Then we got dictionary and Then we got iterating using indexes of two lists. So zip looks like the best option, but It's really not so usable in most cases But dictionary is is still fine next example is checking if element is contained within the list So first one is just in keyword Second one is running through all the list and checking if the element is present And the last case is a binary search over the list. So we're splitting the list in two equal parts and comparing if The element is present in left or on or in the right part of the list then doing that recursively Note that the list must be sorted before running that operation So I compare All these checks For two cases first one is positive case. So We will find this element in the list and second one is That we will not find this element in the list. Okay, so First one is in keyword So we can see that it runs two times slower if element is not present in the list So the implementation here is a Bit optimized for loop over the elements Then we can see the for loop over the elements implemented in Python it's In positive case, that's one point six seconds and for loop is two point five seconds and it also keeps the the trend of Positive to negative ratio, so it's still two times slower and Last one binary binary search It's definitely the fastest one. So in Keyword is Pythonic way to check if element is present But if we really care about performance and really need to run it fast binary search is the best option here even if the Implementation is more complex, but it's still not that scary, right? It's just Less than 15 lines So if we really care about performance, we should use more complex solutions, but which are more efficient Okay, next one is swapping the variables First example shows swapping using temporary variable and second one is swapping tuple so assigning tuple of Y X to X Y. Okay, so swapping Using temporary variable appeared to be faster now Okay That's not what I expected that might be as some Some process running in the background which is interrupting that yeah again, but different is not so So because now probably if I run it again It will finally give expected results. No. Yeah. Yeah, it's it's finally finally gives the smaller value for swapping tuples So let's see how it's How it looks in disassembled code We got Here it's it's swapping tuples. So we're loading X and Y and X then rotating these two Storing in a storing X and storing Y and then returning known While here we got load fast store fast load fast store fast and again load fast store fast. It's Three times load and store instead of to Store and and load and store Next example is efficiency of string construction first one is f-string Introduced in Python 3.6 and next one is formatted string and the last one is percent Formatted string so f-strings is are definitely fast test And then we see that percent formatted strings are in the middle and Format is longest. Yeah, it's nice to use but it's low But now in Python 3.6. We got f-strings. So let's use it okay, so we got nested look here and And What I will do is iterate over the smaller portion of data in outer loop then Iterate over larger amount of data in outer loop and then split it equally between outer and inner loop so in all these cases we got the same number of iterations okay, so maybe Let's guess which one will be the fastest so the first the first example who thinks that it will be the fastest Okay, second example Okay, and the last one Okay Not everyone plays a bet but Okay Let's run it Okay, so This one is the fastest one so second one. So iterating over more items in outer loop Which I will expect otherwise, but it's up here. It's not Yeah, and there is not much difference between The first and the last example Yeah Let's see at the disassembled code because it may explain a bit more So here we're starting first outer loop Here we got we start we're starting Inner loop here. We're finishing the inner loop and here we're finishing outer loop. So It's really What we would expect is that iterations over inner loop would be faster But it appears that the iteration over outer loop is faster Why is that because here we're We're jumping To the inner loop and here we're jumping to outer loop So if we jump over here frequently It's a bit slower than jumping over here Not sure if that's If that explains it, but yeah It's it's pretty pretty hard to explain on that disassembled code but the fact is that More In outer loop is more efficient than Than in the inner loop. Okay next example Using global variable versus Using parameter So again, who thinks that the global variable would be more efficient Okay, who thinks that parameterized variable more efficient. Okay Let's see we're frequently using the global variable But it's not so efficient as parameterized version it's We will see their reason over here in disassembled code. So their only difference here is that in Global global variable is loaded with load global instead of load fast over here So that's the only difference and it appears twice Where we load this global variable or x parameter over the function. Okay Next example is slots who is familiar with slots Okay, so I will explain it a bit deeper so slots are the listed parameter names that Object will be restricted to so if we got slots x over here We cannot dynamically assign a self dot y for example It will just throw the runtime error. We will not be able to do that okay, so But that sounds like an restriction for us But what does it mean from for from performance point of view? It's Definitely faster with slots and as we will see in the second It consumes a much less memory than just the regular object without slots So we can see that it's three times more memory here used The reason behind it is that objects without slots Will over allocate the memory So that we can we can see it over here. It's three times more that may of course differ based on Different different number of slots if we for example gives it give here like five five five Variable names or ten it will differ Okay, so that would be all from demo part Let's summarize it shortly Okay, so first Predict what would be the best solution based on your experience there's based on the examples You just saw or just based on your hunch Predict but to have preliminary data And don't trust it at all always profile your code Predictions might be misleading When I was preparing examples for this demo, I encountered a lot of surprises We have also seen the scene one. I have to run this three times because Data data was Not perfect enough, I would say So things do not do not behave as we expect Yeah, we should always profile our code Even if you think it's fast enough or fast test Just check it for your own For your own piece of mind, right? And the last thing Calculate return on investment So try to find the best ratio between possible profit and optimization cost There is no point in spending Weeks on optimization, which will give you an unnoticeable improvement Hey, thank you for for your attention You can find slides on the left side within the link or QR code I will also put it in on the europitan website I have also pushed the code snippets to github link is also available here It's Repository is private for now. I will make it public right after the talk and I will so appreciate any feedback about my talk so I can improve myself To give better talks There is a link on the right side to the feedback form It's simple anonymous form. I would really appreciate if you spend a few seconds to share your opinion Thank you very much So we have time for a couple of questions I see their hands Thank you for the talk I was wondering is there any easy way in C profiler to give kind of like a person's tax So there was total time there was cumulative time Is there anything where you just give a parameter so that One line or one allocation took let's say 80% of the total time something like that just not to calculate the whole thing Yes, so that might be not so visible in the examples but there is total time of all the calls or total time of Or time per call for the function you can also Give a parameter to see profile to sort the results by specific column Okay, so we can sort for by total time or time per call and also one short question So in terms of that slot at the your example didn't allocate anything to a particular slot So let's say if you allocate a hundred megs or like read a file Of the same size with slot and without slot. Do you think will make a difference? Yeah, definitely because It's it's always good to profile the specific case because that depends what you're putting in the in this variable but in general Object without slot is over allocated. So it will just allocate some some memory Right after the object to have some space to some dynamic assignments Thank you Who is okay? I've got two questions actually the first one is are there any graphic tools for the For the profiling libraries you've shown. Yes, there are graphic tools I didn't listed them because I wanted just the simplest tool which will be most accurate but there are many tools you can just Google it and Yeah, there are a lot of them. I checked like week or two weeks ago There were a lot of them. All right. Thanks. And the second question is Since the profiling if shown us is not really reliable and not hundred percent reliable Should we rerun it multiple times to get a better Understanding of how the how the application performs. Yeah. Yeah When I run the memory profile or a few times in the loop for example It gives it stabilizes with with the time. So first run is Like a bit inaccurate But then it's stabilized and gives at least the total memory usage on on the stable level Then increment also stabilizes like we see we saw in the example with when I run three times the same function with with memory profiler and Total total Memory for per process has stabilized in the second and second and first line but Increment still did not stabilize but it will stabilize at some point because it probes the memory By inserting the the probes in into the code Okay, we have to stop here. Thank you very much for your talk