 Hi, I'm Pratibha and welcome to my talk on Python memory problems and how garbage collection happens in Cpython. I have tried my best to simplify it as much as possible. So let's get started. I hope it is helpful in understanding how in general the object get allocated Now memory problems are the worst nightmare of every developer whose code is serving large files or who is having a lot of computation in the production environment. If you have ever faced the issue of memory leak in application or out of memory exception and you are using Python and you are banging head that everything is working as it's supposed to be but you are still not able to figure out what is happening. So maybe this talk help you in understanding how the underlying architecture of garbage collection works. So this talk is aimed to summarize how garbage collection that is being of the memory works in Cpython and because it is overwhelming to see the memory issues let's get started because there are many scary issues and it's difficult to fix them because they have dependencies. So let's get started. In recent years we have seen many improvement in Python garbage collection but there are still some instances when we are not getting back the memory that we are freeing it or we are not getting the memory back as we bring up the large variables. So this results in memory current for the applications which finally crashes. Although there are multiple ways to overcome the memory challenges sometimes it is difficult to find what we can improve in our code and infrastructure that can make their memory efficient. In such cases it helps to have an understanding of what is going on behind the curtain at the lower level where memory is being managed. So this presentation aims to give you a quick overview of we'll discuss some common memory errors that we see in our day-to-day lives and how the Cpython manages garbage collection in general. So let's list down some of the memory issues we usually see in our day-to-day environment and sometimes we have large objects. Let's say we created a very big list, very big dictionary. For any computation maybe we are passing a large log file but even when we are out of that function it is lingering in the memory and that memory is not being released and this is hanging around for no apparent reason. Now there can be other reasons like reference filing in your code. Assigning references don't read distinct duplicate objects but if an object is no longer used and cannot be marked for garbage collection because it is being referenced in other place within the application it results in memory leak. In fact these kinds of referencing styles are one of the main problems of memory leak in the application field. Now this one was something which I which I faced very recently in last few months they just got unexpected memory error and there was no solution there was no answer to why it happened. So even if you have in a Fram you can get this unexpected memory error after a lot of looking around after a lot of going through the articles I realized that the pattern that was installed in my system it was 32 bit and it has used up all the virtual address space that is available to it and because 32 bit applications are limited to maybe 2 or 4 GB of user mode address space that was leading to this kind of issue but the biggest one that we usually face is out of memory. Now this one is something which we can regenerate on our own using assigning large memory chunks to the big files or big objects or this is something which just from the surface because we don't know what is happening maybe there's a memory leak. So usually when this happens what is happening is when an attempt to allocate a block of memory fails most system returns this out of memory error it's a generic one but the purpose of the problem really has to do with the actual out of memory maybe your actual memory space is not full that's because the memory manager on almost every modern operating system use the hardest space for storing memory pages that doesn't fit in perhaps just called swapping. So in addition to that the computer can usually allocate memory until the disk is filled up and this results in fact an out of memory error that means your swapping space is also filled your memory is also filled so there is a chance that your memory is not actually filled but the space which is for swapping and everything that is filled and your swapping event has reached but what if everything is in place and working as expected can there be another reason for these errors this was the main thought in my mind when I was preparing for this talk that what can be the reasons and that leads me asking the question how actually memories being allocated and de-allocated inside the C Python. So let's have a look at the memory allocation and de-allocation that is garbage collection of Python and I hope that helps us in answering our question. So the most common explanation of memory is thinking a computer's memory as an empty book this is the most common explanation that we get when we try to explain it to some of the people who don't know what it is and it is intended for short stories which are the short applications. Now there's nothing written on pages yet different authors come along which are processors and each author wants some space to write these stories and since they're not allowed to overwrite each other they must be careful about which pages they write and before they begin writing they consult the manager of the book the manager then decide where in the book they are allowed to write this is the standard explanation of how the memory allocation works in general in the computer memory. So in fact in in correct term computer memory is it's common to call it fixed lens 20 years blocks of memory pages so this analogy holds pretty well for both of them and authors are like different applications or processors that need to store data in memory the manager who decides where the author can write in the book and plays the role of memory manager of sorts and the person who removes the old story to make the room for new one is garbage collector. Let's have a quick look into C Python's memory structures in general there are layer of there is a layer of abstraction from physical hardware to the C Python usable hardware. The operating system abstracts the physical memory and creates a virtual memory layer that applications including Python can access so this whole block that you see it can be considered as the virtual memory layer top of the actual physical memory and always specific virtual memory manager it's a very long name let's call it this virtual memory manager comes out a chunk of memory for Python process so whenever there's a new process it goes to the memory manager saying that hey I have this process I want to run I want some space so what they do it look into its continuous memory space and say that okay this particular set of memories for you. The dark gray box and the image below are owned by Python processes and C Python has an object allocator that is responsible for allocating memory within the object memory area which is the blue box here this object allocator is where most of the memory happens. Here if you see there are two parts one is object specific memory where your actually objects lies and the Python's non-object memory is where it takes care of the processing and all those all the data that is required for the processing not the data but yeah the stacks and other things that it need to process it. Now this object allocator which is blue is it gets called every time a new object needs some space to be allocated or an existing object is deleted now the question is how will somebody know that object how will a process a member knows that a particular object in memory has to be deleted because it is not marked anywhere it is not how should how should it know so that's the question we are going to answer now so let's have a look into the garbage collection of C Python. Now let's revise the book analogy and assume that some of the stories in the book are getting owed no one is reading or referencing the stories anymore if no one is reading something or even referencing in their own work you can get rid of them to make room for new writing that's where garbage collections come in there are two aspects of garbage collection one is reference counting another is generation garbage collection let's have a quick look on how does a python object looks like in physical memory or sorry the virtual memory let's say you assign x variable and it's let's sorry I'm babbling let's assign that you have let's assume that you have a variable x and you have assigned value 20 to it so how will it how will it looks like you will have a memory space which will store the actual integer of 20 and it will have a label which is x and it will be pointing or referencing to that memory location actual python object that will be holding this 20 the integer value will be look like this it will have a type which will tell that it's an integer it will have that object will have value which will hold the actual value and then it has a reference count now let's see what does this reference count mean in our next slides so the main garbage collection algorithm used by c python is reference counting the basic idea is that c python counts how many different places that have less reference to an object so if we go back to this one the reference count here is one because we don't have only one reference to it that is x variable if you if you have more variables which is pointing to this one it will get implemented now such a place could be another object or a global static variable or even a local variable so when an object's reference count becomes zero the object is marked for garbage collection or it is deallocated if it contains reference to other objects their reference counts are decremented those other objects may be deallocated in return so it can have cascading effect if this decrement make their reference count zero then of course and so on the reference count field can be examined using a function called getref count it is part of our assist module which is available in python and notice that the value written by this function is always one more than as the function also have reference to object when it is called so when you assign x equal to object the reference count was one when you call the function get reference count it is actually referencing the variable x so it has incremented to two now when you assigned a new variable which is y and you assigned its value as x so now you have three sorry so now you have actually the two again now again you want to check how much is the reference count to it so now it has become three sorry let me explain again when you have x created the reference basic count is one then you have incremented the reference count by using the function get reference count because it is reference index now you assigned the variable y as a value of x so now both of them x and y are pointing to the same memorial and the reference count of that particular python object that resides in that memorial location incremented to three now you deleted variable y so when you delete a variable y and if it is referencing to any of the other object then the value of the reference count of that variable will decrement so when we again run the get reference count it comes back to two from three that's how in general the reference counting works now there's a problem with this how should I say reference counting scheme it is fine when you have simple way of assigning a variable declaring a variable you have very simplest script everything will be fine for this case but the main problem with the reference counting scheme or relying completing on reference counting is that it doesn't handle reference cycles now let's see what reference cycles are so for an instance consider this code in this example x holds a reference to itself so even when we remove our reference to it the variable x the reference count never falls to zero because it still have reference to its own internal self therefore it will never be cleaned just by simple reference counting for this reason some additional machinery is needed to clean these reference cycles between objects once they become unreachable this is the cyclic garbage collector usually called as just garbage collector even the reference counting thing is also form of garbage collection but garbage collector is comprises of both reference counting and the actual mechanism which takes care of the objects which have reference cycle now there are there is a way of thinking that why do why do we have reference cycles maybe we can avoid having reference cycles which is fine you can leave that to your code and everything will work but sometimes it happens that it is required so let's have a look at that algorithm that's back and uses to detect these reference cycle is implemented in gc module the gc module is part of your python core internals and available just as os and this module is available the garbage collector only focus on cleaning the container objects that contain the reference to one or more objects these can be arrays dictionaries list custom class instances classes and extension modules etc one can say one could think that cycles are uncommon in the kinds of objects but the truth is that many internal references needed by the interpreter create cycle everywhere let's have a look at some of the notable examples the exception that contains trace is that objects contains a list of frames in the exception itself it's a very wide it's a very widely available example of reference cycle now module level function reference the module cherry which is needed to resolve globals and which in terms contain entries for module level function itself instances have references to their class which itself reference its module and module contain references to everything that is inside or maybe other modules and this can be back to the original instances so like this even if we don't want reference cycle is part of our code part of our implementation and we have to learn how to deal with it in long your run now let's have a look of garbage collections additional machinery which takes care of reference cycles because by now we have understood that it is part and parcel of our package in order to limit the time each garbage collection takes garbage collector in C pattern uses a popular optimization which is called generational just for the record this image was taken from a blog post it is a very clear image and really loved it and there's a credit to it in the end in part inside useful links and credits but this this is a very clear and if I created my own image it would have been this so I have reused the same image now the main idea behind this concept of generational garbage collection is the assumption that most objects have a very short lifespan and can be collectively up to their creation this has proven to be very close to the reality for many pattern programs as many temporary objects are created and destroyed very fast the order an object is the less likely it is that it will become unreachable to take advantage of this fact all container objects are segregated into three spaces which are called three generations each new object starts in the first generation which is generation zero list the previous algorithm is executed only over the object of a particular generation and if an object survives a collection of its generation it will be moved to the next generation so whenever new object is created it gets into generation zero list then garbage collection will run over it time to time the objects which have no references will be moved to discard list and they will be discarded the objects that have references will be moved to the generation one list and when the garbage collection of generation one live the generation one list will be run same mechanism happen and survival surviving objects will be moved to generation two list in generation two list you will have objects which are going to survive till the end of the your program so using this mechanism it it is easier because in generation one garbage collection can be run less often than generation zero because generation zero is supposed to carry the objects which are going to be created fast and deleted fast then generation one the garbage collection has happened less often and garbage collection in generation two happen less often compared to generation one so this way we don't have to keep on running garbage collection on the complete list we have segregated them into generation based on their references but this is one of the mechanism or this is one of the optimization that helps and limiting the time spent by our policy pattern in garbage collection now let's have a look how we can actually look this data of generation in our python code so generations are collected by number of objects that contain reaches some predefined threshold which is unique for each generation and is lower for the older number of generations these thresholds can be examined using in GC module you have get thresholds if you use this you will know how much is the threshold after which garbage collection will run on a particular generation list by default python have a threshold of 700 for the youngest generation and 10 for each of the two older generations you can check the number of objects in each generation using get down like here in in generation zero you already have 596 objects but in generation one you have two in generation two you have one now as you can see python creates a number of objects by default before you even start executing program you can trigger a manual garbage collection process by using GC collect method so here you have 595 2 and 1 which which existed even just you type python in the console and you will get that now you will run GC collect method and running a garbage collection process clean up a huge amount of object because if you run the if you run the collect and then you again run the GC count you will get how much objects are remaining here there are 577 objects in the first generation and three more in the older generation which are cleaned up the best part here is that you can alter the thresholds of triggering the garbage collection here this is something where you cannot change how the reference counting reference counting work but you can change how frequently the garbage collection run in these generations now how are you going to modify the values here there is a function in GC called set threshold using this you can modify this value in the example here we increase each of our thresholds from their default increasing the threshold will reduce the frequency at which garbage collector runs this will less competent this will be less computationally sensitive in the program but the catch here is it will keep the dead objects around longer in the memory so maybe there is a possibility that some here like I said like we have said the threshold that after 15 objects garbage collection should run on generation two and if there are there are objects whose reference count has gone zero in generation two they will still lie around unless either the threshold of 15 is reached or there are some which keeps checking that okay there are you can just delete those objects now one of the catch that I have noticed in not the catch I would say sometimes not always it is it happens very rare that people turn off the garbage collector all together and manually manage it that is also feasible but it is advisable not to do that with these we conclude our talk I hope it is helpful to you garbage in the summary garbage collection is implemented in Python in two ways reference counting and generational when the reference count of an object reaches zero reference counting garbage collection algorithm cleans up the object immediately if it has cycle reference count doesn't reach zero you wait for generational garbage collection algorithm to kick in and clean the object while as a programmer you don't have to think about garbage collection in Python it can be useful to understand what happens under the hood because maybe maybe you need some manual garbage collection to be run in your program so with this thank you for your time thank you very much thank you very much for your talk we can still take a few questions if somebody has one so let's have a look if somebody is going up and yes making the way to the microphone so please ask your question many many things for your talk I finally understood what is generational garbage collection I never exactly understood it I have one question we we use Python in control systems and we tend to run Python processes for very long times months and sometimes even a year or two and we have observed that whenever we have a high peak of CPU usage in the machine where it is running it's like the garbage like if the garbage collector stops working and we saw that we have a huge increase of memory and then it never cleans up do you have an explanation for that so this is my assumption because based on what the data you have provided your whenever you have spike in CPU you notice that the garbage collection doesn't kicks in am I right doesn't excuse me what you said the garbage collector doesn't I said that whenever there is a spike in CPU that means CPU is getting overloaded garbage collection doesn't have don't happen and the memory doesn't get free exactly yeah exactly right because so here's the thing when you run a garbage collection you have this memory to be freed up and it needs its own CPU cycles if you already have CPU cycles which are using use for computation somewhere else then of course it has it will get queued up in operating system that hey wait we have this computation already overloading the CPU and you have to wait for that so that's one of the possible reasons for it so but yeah that's what I assume happening here okay and as long as the CPU cycles are available the operating systems say that hey garbage collection let's go and clean up your garbage so that memory get free but yeah it can be a cyclic you know deadlock CPU is not free the computation is blocking it garbage collection was not able to clean up that is filing up the memory and filing up the memory the computation is not able to finish and CPU is getting spiked up so in that case you have to figure out there must be some kind of memory leak or you have to restrict how you are how the CPU utilization is spiking up because in production environment we say that if your CPU utilization is more than 70 percent that means you have to spend a new instance so that the traffic get even redistributed or the load getting redistributed okay thank you very much that's all we have time for today so let's have another round of applause for Patiba thank you