 Good afternoon, ladies and gentlemen. My name is Raymond Long. I'm a software engineer in the Wehat core kernel group. My main responsibility is to maintain the the C-group as well as the locking subsystem in the kernel. So today the topic is about the news, two major new features that are in the upcoming release of Well, which we backup from the upstream kernel. And you know that within the Linux kernel, there are two main features that enable Mac container possible. The two features are the control group. We usually call it C-group as well as the namespace. Without these two, we don't we won't have any Linux container. And the two new features that I'm planning to elaborate a bit more on today is the new C-group slab memory controller that I introduced in the version 5.9 kernel, as well as the time namespace, which I think I introduced earlier, maybe five, six or seven. And besides these two main features, there are also other world-file new features that I would briefly talk about that will be included in the next release of the Well kernel. That includes the ability to limit memory allocation when your memory controller is reaching the memory limit. And that can avoid the our memory queue that may sometimes happen if the application within the container just happened to be using too much memory than is allocated to the container. And also, we are now able to account the use of the per CPU memory in the kernel, as well as the other regular slab memory or pay cache memory. And we also integrate some better web control that will help to limit the number of unexpected hand that can happen because of the IO imbalance inherent in some of the application use cases. So before I would like to talk about that, if you know about the control group in the Linux kernel, there are actually two different versions. There's a V1 version and then the V2 version. The V1 version is what we call the legacy implementation. And then the V2 is the newer one. And most of the new features will go to SQL V2 because this is the one under added development. V1 is kind of in the maintenance mode. We tried to make sure that nothing breaks, but we are less likely to add new features into it. And the major difference between the two versions is that in SQL V1, each different controller can have its own hierarchy. So every controller can have hierarchy that are completely different from the other. From the top level, it seems like that can be more flexible, but on the hand that increase the capacity especially when you need to co-audition between different controllers. So that is the main reason why we have a SQL V2. In SQL V2, the major theme is that there is only one unified hierarchy for all the controllers that are supported in the V2 mode. So instead of different hierarchy for each controller, we have one single hierarchy for all the controllers that are running in SQL V2 mode. By the way, a controller can be running in either V1 or V2, but not both. So when you set up a system, you have to choose whether you want a given controller to be used in V1 or V2. And currently, I think most distributions are still using V1 as a default because this is what people used to be using in the past, but the trend is over time, more and more distro will switch to V2 as the default because this is where the new features are coming from. And so if you want to get some new features, they are only in V2 but not in V1. The only way is you switch it to SQL V2 instead of V1. Okay, now I'll talk about the new SQL slap moment controller. You know, we in the kernel, besides the page cache that are used for to caching a normal memory or also the file data, there's also another set of kernel memory that are used by all the internal data structure that are created in the kernel. And the current implementation, not the current, the old implementation of the slap controller is that whenever you create a new memory SQL, you have to create a parallel set of slap cache. So for instance, if you want to create a new process, you need a task structure in the kernel for each of the new thread or new process that you created in the kernel. And the memory for that structure comes from a slap cache that are created by when the system is up. And there are many different types of structures in the kernel. And from a look of the number of caches available, the kernel is around 250 or so, more or less. So that means that whenever you create a new controller, whenever you create a new SQL and you try to use the slap cache, then you create a parallel set of slap cache whenever the process within the container use that type of structure or need to create or allocate those structures in the kernel. And associate with each slap cache, you can see from the diagram that each SQL have an associate set of slap cache. And within a slap cache because of performance reason, each slap cache has have a type of per node as well as per CPU cache. So with so each CPU that use the use within a container, you create a cache of usually about a field slap that are dedicated to each CPU as a cache. But within that those slap, not all the available objects are allocated. Some are free waiting to be allocated when they need a crisis. And because of that, each CPU will kind of hold up a number of pages or pages of memory that are aside for that CPU, but not fully utilize. As a result, if there's a need of creating many containers or many memory SQL in the system, a lot of memory can get hold up by those slap cache. And it's estimated that in the modern system, modern distro like well about what they have megabyte of memory we will consume for each of CPU that are used in a given memory SQL keep or take depending on the set work load that you're wanting. So that can be quite a lot of memory. So now I'm going to talk about the new SQL slap memory controller. The major difference between the new controller versus the old one is that all the memory SQL will share the same set of slap cache. So instead of n slap cache for n memory SQL, now we have a single set of slap cache shared by all the memory SQL. But in order to manage the accounting of the memory allocated to each of the SQL, we now have a new structure at the beach between the slap cache and the SQL. We call that the object SQL structure. So the purpose of that structure is to do the accounting of memory that are dedicated to each of the SQL. And the way it does that is in the old controller, the usage of memory by each of the SQL is accounted in pages, number of pages that are used by SQL. But with the new slap controller is counting by itself pages. And the pie counting is done in the object SQL. So if the number of bytes accumulate accounted in the object SQL which more than a page, the resulting page number will be allocated in another SQL structure. And then the object SQL is mainly used for giving check of the number of bytes that are within less than a page. So because each of the kernel data structure, the size actually varies. They vary from as little as 8 bytes or can be more than a page, 4K, 8K, and so on. So there are a wide variety of range in size of each of the objects in the slap cache depending on what type of object you are bringing to be created. And the object SQL structure besides maintaining the by accounting, it also maintains the reference count. So each object in the slap cache, there's an, for each object in the slap cache, there's an associate way of associate reference that come to the object SQL structure they are currently using it. So if the slap has, let's say, a 10 object, then there will be a way of 10 pointer that will be allocated as a reference to point to the corresponding object SQL structure. So this new SQL do have a little bit of memory overhead which will be for each of the object that you need to allocate one pointer for that to know to check which SQL is associated with. So what actually are the benefits of the new slap memory controller? With the new slap controller, we tend to use a lot less kernel memory for slap. And that can save quite a bit of memory when you need to serve a lot of memory SQL, for instance, when you created a lot of container. And the memory reduction and actually more prominent in architecture like PowerPC and ARM64 because those architecture currently we support 64K pages. While on 80G system, the page size is 4K. So general rule of thumb is that the smaller the page size, the smaller will be the size of each of a slap because the slap we created is usually a multiple number of pages and usually the number is a simple power of two. So the smaller page size, the smaller slap, the less memory will be wasted because they are used as a cache for the CPU or for each of the nodes. And someone had one some benchmark on PowerPC system that contained one pop would turn the container executing some kind of workload on the two different kernel with the old and the new SQL slap memory controller. So by running that, you measure the amount of the kernel memory that are used up by the whole pod and as well each of the container. And it turned out that with the new memory controller, at the pod level, the memory consumption go from that memory is specific to the memory that are used up by the slap cache. And the consumption go from 3G to 400M, which is about more than 7X improvement. And at the container level, that means within each of the container, the amount memory used was reduced from 112M to 90M, which is around 6X smaller. And also and actually the execution time, they measure also improve a bit by 70%. That is quite noticeable because of the less overhead in start up the creating of the new slap cache and that can and also because they use up less memory, less cache footprint and so that can probably the reason why the performance improve a bit also. So another feature that I want to talk about is a time namespace. So the purpose of a time namespace is to virtualize two system corp that are supported by the Linux kernel, the corp monotonic and also the corp boot time. Monotonic time monotonic corp is a corp that always increment and you will never go back. And boot time is just used for accounting the amount of time you left since the system boot up. There's another corp view time that is used to measure the actual wall corp time. There's no virtualize in the time namespace because of the complexity and also there's performance overhead involved you want to virtualize it and the use case isn't that useful and that's why the time namespace doesn't support the virtualization of the corp view time. And a new time namespace is created by calling the unshare system corp with the new corp new time flag. And when you do the unshare, actually the color itself won't be living in the new time namespace. It's the new children that are created that will go to the new time namespace. And the way that the kernel maintain the the new time namespace is by maintaining an offset to the kernel internal time. That offset is maintained in what we call a VWAR page that are used by the virtual DSO dynamic share object that are maintained by kernel and used in user space application. And time namespace support is currently only available for 886 and ARM 64. Support for the other architecture haven't been submitted upstream yet. So these two are the one that are currently supported in the time namespace. And actually the major motivation of why using adding the time namespace is to allow the monotonic and boot time corp to maintain a consistent value during container migration and checkpoint restore. So that is the reason why we have that new feature and that is used to support the container. So you can see that all this all the work upstream are done to provide better support for the container ecosystem. And that is the end of my presentation just at a 20-minute mark. Oh, there is one question from our attendees is from David. Is there any significant variation in the implementation from x86 to ARM 65? Maybe you could better understand the context. Are you referring to time namespace or for time namespace Yes, David just replied in the time namespace. Yeah, there are some significant differences in the implementation. They are all done within the VDSO layer. So what they need to do is to create a dedicated pages that contain the offset for different time namespace. So when you switch to a new time namespace, you kind of swap in the other page corresponding to that time namespace. The actual implementation differs a little bit because the architecture specific code for time management is a little bit different. So they have within each of the directory, they have their own code to manage the mapping. But then they use a common set of core code to handle all the West. So there are some architecture specific code for each architecture and the West is handled by the common code.