 Yes, okay. So good morning. Thank you for joining. I am Sanket and I worked as an SRE at LinkedIn And today we're gonna hunt a memory leak. So we all know what a memory leak is Okay, so just to reiterate memory leak is kind of a bug that you see in programs What will happen in that case your program with time? Starts using more and more memory and at the end it will crash because you have finite memory So we'll be hunting Such one today. So we actually faced one of the memory leaks in scenario So we'll be seeing a simulated one similar to that and we'll try to hunt it down today. So let's get started So here's what we're going to talk about today. So we'll be exploring Python and objects. So what do we mean when we say an object in Python? We'll see what happens when you allocate them as in when you create an object what happens exactly internally We'll see what happens when you deallocate them as in when it goes out of scope or if you delete it Then we'll take look at the leak We'll use the knowledge of objects and the allocation patterns and we'll confirm the leaks existence Because before you before even you go hunting, you'll have to know whether there's a leak for sure or not So we'll do that and finally will go hunting Cool. What we won't be exploring today would be any specifics of memory management and garbage collection in Python But we'll be looking at details that are enough to understand and the debug the memory leak scenario And also we won't explore why did this happen because we know the answer now Okay, so before we get started Let's take a look at an interesting scenario with Python 2 and integers. I know like Python 2 is Supposed to be EOL next earlier, but still the scenario is interesting. So let's take a look So simple code we import bunch of modules Then we create a list a huge list made up of 100 million integers. So it's as simple as that and we print the memory usage So it comes out to be 3.2 GB. So it's understandable like it's a huge list hundred million integers and The size comes out to be 3.2 GB Now what we do is we explicitly delete the list. So if you delete it ideally It's a god of memory because you have explicitly deleted but When we delete it despite deleting explicitly if you print the memory usage it came out to be 2.5 GB So there's nothing else in the program. So as you can see we just created a huge list hundred million integers And deleted it still there is 2.5 GB of memory In use. So what exactly is happening here? What's going wrong or what's going right? Or why is it happening? So we'll see after we learn how the alox and the alox work So objects you might have heard that everything in Python is an object So what what do you mean when when you say everything is an object an integer is also an object a string is an object I think there is an object everything is an object. So what what does it mean? Well, if we talk about Python's most famous implementation in C language that is C Python Here's is an equivalent. So basically an object in C language equivalent would be a struct called pi object And it has something called pi object head So what will head contain so head contains basically pointers to various functions among all the other things What kind of functions say when you call dot up and on a list? So what C equivalent functions would be called internally pointed to that function? In case of dictionary what what internal functions should be called when you when you say dot set on a dictionary or dot get on dictionary So pointed to all those functions will be there in the head part One of the interesting field of our interest today is a ref count that you can see So that is one of the field that is there in the head part. So we'll be seeing what is a ref count field and why is it important? So that is a pi object. So if we talk a specific object, let's take an example of integer object So integer object again would be Castable in C language terminology to the pi object So it will obviously contain head and head will be containing pointers to all those C functions And of course because it's an integer object There is a C-lang long variable or long equivalent type Contain in it. So this is an object or the objects when it comes to Python It's basically a C language structure So before we go into a locks or DL ox part Let's take a detour and explore two more concepts first being free lists So when you read the C Python's code, you might come across such variables called free list For various type of objects. So two example that I've taken here is a pile list object and a dict object So it's a list of pointers and it's confusing pointer to list and list of pointers So this is a list of pointers To save pile list object and the size is fixed. So let's say a size is 100 So whenever your Python interpreter starts this list is initialized of size say 100. So what the list is useful? So as and when your program runs you might be creating lists We are taking example of list So first hundred lists that are created will be pointed by the pointers in the list. So This will be basically kind of a cache. So when you Deallocate or those leaves object goes out of scope. You don't delete the memory. So what we are doing here is it's kind of an Object cache. So first hundred objects you allocate you keep it here and you never release the memory back So this is a free list basically a cache to fast to make the allocation of presence fast Because you always have at least hundred objects allocated in memory Even for for pi frame objects, what is a frame object? Whenever you create make a function call So even for that there is an equivalent pi object made So even though those frame objects are kind of cached So this is free list. What is free list? Basically cash of objects When you allocate you fill the cash and they can be kind of reused. So that is free list The other concept before we go to allox and deallox is garbage collection So in Python we don't do any explicit memory allocation deallocation operation like we do in C language like a lock calls or free calls So whenever object goes out of scope, they are kind of deleted or garbage corrected automatically So the python does it for us. We don't explicitly do it. But how does python do it internally? Well like most of the languages python also has a generation-based garbage collection That is that will link list maintain per python interpreter per python process for each generation's objects So it's it's a doubly linked list for each generation and there are total three generation And when you run garbage collection in the last generation You will clear the free lists. That is your cache of objects that you created initially So this is garbage collection and if you kind of read the code as it is from the source code of GC module This is what it looks like. So when you run the GC in last generation, you will be clearing all the free lists Why why is this free list and these concepts are important? I will see. So finally What happens when you allocate an object? So because we had free list, you know, we have a cache kind of object as so you have already allocated memory over there So if there's a space in free list, you will basically use the last free slot from free list and allocate the object over there Otherwise if free list is full all the slots are in use. You'll be allocating memory You initialize the object and You register object with the GC as in add to the relevant doubly linked list Now you see there's a little star over there. That means conditions apply What does it mean if you are to categorize python objects in say two broad categories or two categories of our interest It would be immutable and the container. So what does an immutable mean immutable meaning longs or ints or float kind of variables or strings so Values of these objects are kind of steady. It cannot be changed and contrast to that. There are container variables So what what doesn't container mean? They can contain other objects So a dick can contain other list or list list can contain date or they can contain itself. So these are two kind of variables Why does not every object get GC tracked? Well, in case of immutables remember, we had a reference count field in the object head So each app object in python will be having reference count. So whenever they go out of scope reference count is decremented And if your reference count drops to zero you garbage collected basically free the memory Why do we need the GC specifically for the container kind of objects because what can happen is you create a list and You append the list itself to the list now you have a reference loop So if there is a reference loop reference count will never drop to zero because list is referring to itself So this is a reference loop and reference count alone cannot help you with that So that's why you have a GC module. This module has capability to detect such reference loops and free the objects for you So that is GC module and that is why not all the kind of objects will get GC tracked But objects which can potentially have reference loop will only get GC tracked Now the deal location When do you deallocate the object when the reference count drops to zero as in no one is referring to it? So what what do you do because reference count has dropped to zero you will remove from the garbage collection Tracking that is modify the doubly linked list pointers You decrement the reference card of the contain types because you're let's go back to our initial example You're deleting the list. So all the integers that are contained within it are no longer getting referenced So you will have to decrement reference count for those as well So in an initial case for all the hundred million integers reference count will be decremented and it did drop to zero because there Were no other references to those integers. So memory ideally should have gotten free, but it did not so we'll see why If free list has space You put the object over there because it's a cache that we are trying to fill up And if your cache is already full say hundred free objects are already there You free the memory and give it back to us Of course, there is a bit more complexity goes inside it, but we are trying to simplify it So just to recap we had a free list free list is what Basically a cache of objects if there's free space you use a space from there There was generational GC especially for the container kind of objects which can have a reference loop When you allocate an object reference count is incremented or set to one If there's space in free list, you'll put it in free list And if it is a container kind of object you track it with GC when the allocation reference count has dropped to zero You untrack it from GC if it was tracked and you use put it in free list or the allocate Now the original case when we deleted the list still all the memory was not given back to us. Why was it happening? Well, as it happens with Python 2 and integers free list is not bound by a specified size So as in when you go allocating integers the free list will keep expanding So it basically unlimited in size unlimited as in there is no fixed bound So it's a linked list and it will keep expanding Why is it implemented like that? Well, the authors might have noticed that There are lots of integer allocation happening during a pythons runtime and to optimize that what they did is whenever you allocate an integer You are allocates bunch of integers. So 24 to be specific So when you allocate an integer extra 23 integer space is already allocated So next 23 allocations will go a lot faster. So this is just implementation specific details And when this Free list will be freed of course when you run gc in large generation If you want to read more Exist specifics or the code part. I am leaving a link down. So you may explore that So now let's go to the leak So the app that we saw in the production was a flash cap. We all know what a flash cap is. It's basically I guess we all know what a flash cap is So here's what it did. It was basically a proxying app. So it takes the sttp request It calls some downstream service get some matrix and give it back It was as simple as that In simulated scenario, what does it contain? It does not Fetch matrix from somewhere. It just generates bunch of random integers and give it back to you. So that's the api There is a helper function a wrapper or a decorator which will convert the response into json There is some performance profiling as in how long did it take to Give the response back and of course, there's a leak which is we are here for So there's an amplified leak so that its effect is visibly apparent So if we kind of look at the code of the app It's fairly simple. I'll read read through line by line We import the app. We import two functions helpers one will help generate the response There is a bunch of random integers one is a wrapper which will convert our response into json The api code is simple. You just call the hyper function. Whatever you get you return. So this is the app as simple as that Let's try to run it So when we start the app, so there's it's a simple script mem usage dot asset So it will just print the usage for the mem the app that is running So when we start we haven't sent anything the memory usage comes out around 24 mb Now we send bunch of requests for sending requests. We are using a tool called apachi benchmark ab So what it is used for is like sending ton of request in parallel and then print out statistics as in how long did it take Median time max time at each step of connection. How long time was it taken All the fancy things can do but what we are using it here for is just sending bunch of requests So you can see ab-n50 that means send 50 requests to the given endpoint And when we print the memory usage, it comes out to be 4085 megabytes So that is a lot for 50 mb requests because there is an intentional leak put in the code But by just looking at this number 4085 you cannot be sure that there is a leak There may be legitimate memory usage like this number itself is not qualified enough to To qualify the existence of the memory leak So what we'll do I will take the knowledge we took from the initial allocation and the allocation part So what did we see when you run a gc in generation 2 what happens all the reference loops are If there are any reference loops those objects are collected. So that means that memory is free All the free lists are cleared. So I think that memory is also free So you can be sure that when you run a gc in generation 2 whatever can be collected will be collected Right, so there is there is no scope for like garbage to stay back after you run a gc So just to be sure When we return the response will run gc so that whatever garbage that is possibly that can possibly collected will be collected So just before you return the data you Collect the garbage so when you don't pass an argument that means it will run into last generation So that free list will be cleared as well Now let's do the memory profiling again with the garbage collection So initial memory usage again 24 mb as we saw it last time We again send 50 requests and So when you send the 50 requests, this will be running a lot slower Why because after every request you are doing a garbage collection So your gc module will be again going through all the generations linked list It will be clearing free lists. So this will run a lot slower But you can see after gc memory usage comes around 4070 mb So that means 15 mb of reduction, which is not a lot and still a significant memory is still occupied So this tells us that There's a leak. So I hope this is proof enough To qualify this code as to be containing a memory leak So let's go hunting There are various tools available First being code and panelysis You have a process running It is using 4gb of memory. You just take dump of the memory snapshot. You analyze it You can analyze what what can you get? Okay, you will see that there are ton of integers out of 4gb 3.7 gb Are made up of integers, but this does not tell you much. I mean this only tells you that There is an integer leaking somewhere in the code But where? It doesn't tell you So there are some native libraries in python Some of the names are listed object graph memory profiler, heap, trace malloc, etc We went with trace malloc. Why? Because it comes built in with python Python 3 4 I guess and of course this code was running in production So it was not really feasible to pull any library from the internet and run it from production It's not as simple as that. So we went with trace malloc What does trace malloc do? Well as the name suggests It can trace back the object where it was created It can also print statistics As in how much memory was allocated at which point in code as in which file and which line number And one of the other beautiful thing it can do is it can take snapshot during your program run time And you can compare snapshot of memories and see what changed and what caused the change So Kind of hello world or how to program for trace malloc You just import it. You don't have to install anything because it comes built in You do a trace malloc dot start so whenever you do start from that point onward trace malloc will be tracing all your memory allocation As in which part of code is allocating what memory Then you can take a snapshot using saying trace malloc dot take snapshot So snapshot will contain memory allocations done till that point And then you can print statistics. So here we are printing our top 10 allocations So this is how trace malloc you can use trace malloc How does it work? Well, uh in python 3 there is a memory Related operations api. The functions are something like pi object underscore malloc or pi mem underscore malloc free So any any time you do any memory operations it will go through these functions So what trace malloc does is whenever you do Start trace malloc and do a trace malloc dot start Trace malloc will be overriding these functions. So whenever these functions are called it will come to know Trace malloc will kind of take note of where the Memory operation is coming from as in which file which port number it will register the information itself and then do the memory allocation Similarly, when you free any memory trace malloc will first come to know because it has overridden the functions It will register like who did this operation which part of code did this operation And it will do the free operation as regular So to reiterate our app was stateless And what it was doing is fetching some matrix and it was serving back to the scdp requester So why why is this important? It's important to understand the nature of the application So as you can see during the runtime of the request there is not supposed to be any memory allocations done because it just fetched Kind of push so there is not supposed to be any persistent memory allocation inside the memory How will we use this information? Let's devise a plan We start the app we warm up maybe by sending couple of requests so that everything is Loaded into memory We send couple of requests and warm the app up We take a memory snapshot. So when you take a snapshot, you have all the allocations done till that point So you have that information with you You send some traffic as we did earlier And we'll take a snapshot again at step four. So in ideal case when you send traffic because app is supposed to be stateless It's it's not supposed to Store any memory or use any memory. So when we compare snapshots step two and step four In ideal case, there should not be a huge difference But if there is a memory leak a snapshot comparison between two and four should show a huge difference And it will tell you that where the memory allocation was done So in the end what we'll do is we analyze these two snapshots and we'll see uh, what what part of the code did the memory allocation So here is the uh code equivalent code for our plan So i'll again go through it. So we import trace malloc We do trace malloc dot start so that each memory operations will be kind of traced that point onwards We had our original api and we add one more sttp endpoint called slash snapshot Now what we do here is first time you call slash snapshot It will take a memory snapshot. That will be the initial snapshot at step two So when you first call it, it'll take a snapshot and when you call it, uh, one more time It will be comparing the original snapshot. So this will be the step four when you call it again So it will be comparing with original snapshot and printing top five stats So, uh, I'll pause here for a moment. Uh, take a moment to go through the code Should be fairly understandable Go we're good to go So let's execute the plan So first as we said we'll warm up the app a bit. So we'll again use ab uh hyphen n2. That means, uh So we'll be sending two requests. Uh, just to warm up We'll uh take a snapshot. So this will be the initial snapshot. We'll be calling uh slash snapshot api Again as usual, uh, we'll be sending 50 requests as we've been sending So in this case again, this will be running a lot slower and that's why we are not doing a live demo Why why a lot slower because when you do start doing trace malloc each in every memory location and there are tons and tons of it Each of them will go through trace mallocs over it and uh, the api calls So trace malloc will register all of them and do its own state management. So this will be running a lot lot slower But the uh bonus point is like you will will get to know what the leak is So again, uh, when we call uh this last snapshot api again It will compare the that snapshot with the original snapshot And it will print the statistics So as it comes out, uh, the first line itself in top five, uh, there's a file called helpers dot pi and line 25 Did uh approx four gbf allocations So it's supposed to be stateless, but this file this line did all the allocation So we we we know where to look So what was the actual culprit as it turns out the json api wrapper, uh was Doing some uh as we said, there's some uh time management or the profiling done So whenever the function was called it was uh like taking note of how much time was it taken and it was storing The time taken into a global variable. It was not exactly global in production environment, but in this case we put in global variable So as you can see the highlighted code, uh, the there's a global dict called perf data And you are putting the time taken into a global dictionary. So because it's put in a global dictionary It will never go out of reference because it's a global variable reference count will always be at least one So despite we ran gc each time the memory was not collected and hence this is a leak Uh, if you want to go through the code, I'll I'll wait a moment So in actual production, it was a thread pool library overridden and there was Bunch of two three more layer of abstraction So by reading the code, it's not really obvious because when you work in complex code scenarios, there are The memory leak can be hidden in a like two three level of down downwards the dependency So this kind of tracing only will uh bring this issue to light So yeah, this was our leak Just to recap We understood how python does memory management. That is freely is garbage collection when does what gets cleaned We understood application behavior. This is important because we know the app is stateless We were able to devise a plan and kind of use it to compare snapshots and uh get the memory leak So that is it I hope you learned something new today. Uh, thank you Questions forgot any Hey, thanks for the talk and that was a good one Uh, in this case you're instrumenting the process from scratch, right? What if the process is already running and it starts me misbehaving? Do you have something for that? Uh, you mean the dynamically attaching to the process was running first performing right Properly earlier and then it started misbehaving and doing a lot of mallocks Okay, what can you do then like in this case you started the process from scratch? Yeah, so in in that case I think uh, it will be a bit harder So I tried to first do that that let's not disturb the process If there is some solution available to attest with the process and then kind of do a sampling sampling kind of profiling But I could not find such tools. I mean if you if anyone is aware of any let me know But this required some code change But the bonus like or the brighter side was that I didn't have to install any external dependency But that tool would be great. I mean don't disturb the process and trace the alox from outside I'm not sure of any tool that does that at the moment Okay, thanks Hi Suppose in case I'm training my model. Okay. It's a kind of standalone python code, right? How we can trace memory leaks in that So memory leak You generally get to know when your process starts dying So in our case what was happening is we had lots of uizgi workers And when we started sending a lot of traffic, uh, they were configured to auto restart So we didn't notice it. But when we checked the log, we were seeing lots and lots of restarts So this generally comes in light when your process starts crashing as in it use all the memory available to it and then it crashes So that's when it comes to know. Otherwise, you would need to know that this is the Uh, kind of memory user that my app is supposed to be using and if it goes beyond that and it So the main main thing to detect one is it keeps increasing. I mean it will never stay stable So if you see it keep increasing with time, then you suspect leak and then you got to investigate Uh, so earlier somebody was mentioning. How do you debug a running process without disabling? Impacting a running Yeah Context so a stress is your go to option for that again won't pinpoint you as you can do So it won't pinpoint to the code. So it as good as but you'll still have a much more detailed context of what is happening in the In the process level than what you would have otherwise But at the same time, there's also another python based debugging tool called drgn, which might help you in doing this as well Yeah, drgn. I'll check that. But the original question the question was that Wouldn't it be easier to enable debugging flags in python and just run it with that instead of this? There are There are memory allocation hooks available in python which can be in production You will not have debug version of python. Oh, no, not in production Yeah To debug these kind of cases these kind of cases like production only will get that bunch of traffic So either you'll have to figure out how to simulate it on on your dev environment Or you'll have to do it in production. I mean the production traffic will be there in production traffic only So it may be tricky to kind of reproduce on your dev