 I think it's about time to start. Good morning, everybody. I know today is a quiet morning, I hope you enjoy it. Okay, what I'm going to talk about is the state of Numer support within the Linux kernel. For Numer, what I'm talking about is the fact that as the number of CPU or number of cores in the socket keep increasing. Right now, the maximum number of cores per socket is today. And for AMD, it's now up to 32. So we put more socket on the system. We are getting more and more core. And with SMT, you have two fads per core. So you double the number of fads. For instance, in the case of if I know HPE, the Superdom X system can support up to 32 socket. Though the biggest one you can buy right now is probably about 16 socket. But still, you count the maximum config with 32 socket, you can have up to 896 core and the chance to 1,792 fads per system. So with such massive number of cores and fads on the system, and with multiple different socket, the application has to be aware of where the Numer nature of a system in order to freezingize it. In fact, with such a big system, you may not be even able to put up the Linux kernel without problem. But that problem, I think, is largely solved with the recent changes in the locking subsystem to make it easier to support such a larger configuration. And in this presentation, I will talk about three major challenges that we have in order to support the application for the Linux kernel to support such a large system. The first one is data locality. And then I will talk a little bit about hill pages. And finally, we'll talk about cache line contention. So what is Numer? Numer basically is a way of design system that instead of, because in the older time, you have a share bus where all the CPU communicate with the share bus and the memory is attached to a share bus. But this has changed in the last few decades to a Numer configuration where each socket has their own local memory. So different memory depending on where it is put have different pathways to all the other CPU. That means different access time. And the reason for that is simply because of performance. Because with a share bus, everything put on the same share bus, then it will limit the frequency and the speed with which you can access those information. But on the dedicated bus on each socket, transaction can be done much quicker. But the downside is if you need to access the remote data, they are far away on another node. It will take a longer time. So in this talk, I will use some term that I want to define here to make sure they understand what I am talking about. A socket refers to a physical processor chip. They are on a given processor socket. And a node refers to an entity that contains a CPU with local memory attached. And that is connected to all the other nodes via inter-poster transportation link like the hyper-transport or quick path. The two terms are not the same because you can have more than one node within a single socket. But usually, each socket is just one node. But there are cases that CPU may not be true. An example will be the AMD Epic CPU. It's actually four chips on a single socket. So you actually take out the lip of the CPU, you will see four pieces of separate silicon on the same substrate. And they are connected by the inter-poster link. And then each of the chips connect to their own memory channel. So that's why in the case of the Epic CPU or one single socket, it's actually beautiful node one. And at core, we refer to a physical CPU with their own L1 and L2 cache. And L3 cache are usually shared among different cores. But L1 and L2 are dedicated to each of the cores. And in fact, our logical CPU refers to an instruction stream that from the operating system point of view is one single CPU. We sometimes call it just logical CPU because it's not exactly the same as one core. In the case of the 86, we support up to two threads per core. For other architecture, we support four or sometimes even eight. I know that the power PC, I think the power 8 can support up to eight threads per core. But it is a setting that you can select, I think, on the bias level. You can select four threads per core or eight threads. The more threads you select, the slower each thread will be, basically. Okay. Data locati. For maximum performance, you would like to have all your data and data application to be in the same node. They are attached to the CPU in the same node without going through the interpersonal link to a remote node. The reason is that within the same node, the access time are usually much quicker. And for some of the benchmark that we want, the data for memory intensive benchmark, if the data on a local loop, the speed can be up to three times faster than if the data on a remote node. And if you have one or install some of the enterprise application like the Oracle database, they have many nodes to control their behavior. And many of the tuning nodes are actually for controlling the data locati. And another thing is that the scheduler, in the Linux kernel, will try to move tasks around to balance the load. So if most of the tasks are one node, but the other nodes are idle, the scheduler may move some of the tasks from one node to the other node in order to balance the load. But the side effect of that is the data that was once local will not become remote. So there are two ways you can control about this kind of movement of migration between nodes. Within the Linux kernel have a feature called auto-numer that when you turn it on, it will allow the kernel in its best effort try to move the data with the task as much as possible. So the scheduler designed to move the task from one node to another node, it will also try to move the data access by that task to the other node as well in the best effort spaces. But another way, because of that, because of the automatic nature, it may not be the best internal performance because you can manually control where the task and the data is. The performance are usually better than relying on the auto-numer feature. And that's why for most enterprise applications, when they have a large developer team to support you, they can tune the application much better than what can be done by the Linux kernel itself. But that requires experience and expertise. So if you don't have a large support group to manage and tune the application, you may not get as good performance as another group or another organization. They have bigger support staff to do this kind of tuning. And for instance, if you have a four-node system, you can run a single-node, single-databit instance on all the nodes. Or you can separate into four separate instances and one each of the instances in one of the nodes, each of the nodes exclusively. You will find out that the four instance one will perform much better than the one instance, one database instance, because, exactly because of the data locality problem. So depending on your use case, you may opt to have a multiple-instance database instead of a single-instance on all the nodes. And the issue that we found on tuning system on Nimera is the hill pages. All modern open systems use virtual memory to translate virtual address to physical address. And translation is done via a page table. In the case of 886, the page table is four-level for current crop of CPU. The new one that are coming, I think, next year, ISLIC, you have five-level page table and support up to 52-bit of physical memory. Right now, we only support up to four or six-bit of physical memory, which translates to 64 terabyte. But in the future, we will support up to four gigabytes of memory with five-level page table. For 886, a page can be either 4K, two-map, or even one gigabyte. But then in order to speed up the translation process, there is a kind of cache within the CPU called the TLB translation look-aside buffer that are used to cache all these virtual to physical translation to that. If you use an address that are already in the cache, it can immediately give you back the actual physical address to go there to fetch the data item instead of going through the translation process again. Because the translation itself, you have to access four separate page tables. So you have four additional memory access, which, of course, slow things down for everything that the computer is doing. But there are a limit on how large the TLB can be. So you can't have two-map entry in the TLB. And because of that, if the application has a large data set, it tries to access many from many different locations. It's possible that you want all of the cache in the TLB, and then if there's a cache, you will first use the TLB entry and then do the translation, load the new mapping into the TLB. And so that will slow things down pretty significantly. So in order to allow the TLB to cover a larger address space, you will need to use HEAL pages. Again, the kernels provide two mechanisms for you to unite the HEALs page. The first one is transparent HEALs page. With that, the kernel managed the HEALs page for you, so it will look the data page that you load in. It used 4K page by default, but you saw that you are assessing quite a number of 4K pages within a bigger 2-meter page. It will combine the smaller 4K pages together into a 2-meter page and use one entry in the TLB instead. But then if you try to de-elk memory within the HEALs page, the kernel will have to break up back to the individual 4K pages, so there are some overhead involved. Another way to manage HEALs page is using what they call the HEAL TLB FS file system, in which it's the applicant's job to manage the HEAL page. The applicant tells the operating system which memory address I want to use HEALs page. And so the kernel will then map those memory address with HEAL pages directly, and it won't break it up and it won't do other stuff. On the kernel side, it's much easier because it just follows what the applicant asked them to do, but then the applicant has to know what they are doing. If they don't use HEALs page correctly, it can hinder performance because with HEALs page, you don't use up all the memory in the HEALs pages. You are just wasting memory. In fact, in many enterprise applications, they recommend that you turn off TLB because they can do a better job than TLB by using HEALs page directly. And having TLB is just adding some extra overhead to slow things down. That's why I sometimes recommend turning off TLB and letting it handle the HEALs page. Another thing I want to talk about is the cache line contention. And before talking about cache line contention, I would like to talk a little bit about the cache coherency protocol. Within modern CPU, we usually have three levels of caches. There are some exotic CPUs that may have up to four levels. I know that for some of the big IBM PowerPC systems, they have multi-chip modules like the Wysonskate. They have maybe two chips on a socket, and then there will be the first one, the first chip of memory level forecast. But in most commercial CPU, it's only three levels of cache. And you will have a newer system with multiple socket. We need to make sure that the content of the cache are consistent. So if one cache line is being assessed by multiple socket, they have to be the same. If they are different, the programmer will not behave correctly. And a cache line in the case of 86 is about 64 bytes. There are some, I think there are some systems that would 125 cache line. But this day, I think most of them use 64 bytes as the size of the cache line. Most cache coherency protocols support four or five different states. In the case of Intel CPU, they use the MESIF, the Modify Exclusive Share, Invader, and four protocol. AMD uses a slightly different one, but the principle is about the same. When multiple CPU are trying to assess the same cache line across multiple nodes, the cache coherency protocol will make sure that the content of the cache stays consistent. Of the coherency protocol, one thing I want to let you aware that you try to do a wide on the cache line. What happens is if the cache line is being shared by multiple nodes, when you do a wide, it will have to... The cache coherency protocol will have to send the invader message to all the other nodes to invader the cache line there. And then only one node can have exclusive access to the cache line. And once it stays changing exclusively, then you can write to the cache line. So if you have multiple nodes trying to write to cache line, you create a situation that the ownership of cache line move around different nodes. A situation, what we call the cache line bouncing. And that uses quite a lot of bandwidth in the inter-processor link. And so that will limit the bandwidth available for other type of memory access and also increase the latency. Now, talking about cache line contention, usually on a large new system with multiple nodes, cache line contention can be a serious performance part of that. There are two types of cache line contention that we usually talk about. The first type is called the force cache line sharing. It's the case that multiple CPU or maybe multiple socket try to access a different data item, but they happen to be in the same cache line. In that case, you will create a cache line contention problem. So, CPU1, we try to access data item A, CPU2, access data item B. There are different data, but they're in the same cache line. They try to modify the value. It will create a problem. Cache force cache line sharing usually has to be solved. But we have tools available to pinpoint this kind of problem, like the perp C2C2. It will pinpoint where the cache line contention happens. And you can look at the address where the shared data is. If the address are different, they actually have different data items, but they just happen to be in the same cache line. The address is different within the same cache line. You know that these are force cache line sharing, and what you can do is try to move the data away so that they are in their own cache line instead of being shared together in the same cache line. So, some tuning in your application in order to move them away from each other. In the Linux kernel, the data items, they are mostly to be contained by multiple CPU, a log, and counter. Many counter in Linux have been switched to use per CPU counter. That means there is multiple counter, one for each of the CPU. They're counting the same thing. They want to look at the value of the counter when you do it from a sysfx variable or from a popfs variable that contains the counter data. What the kernel do is, they sum up all the counter for each of the per CPU. They sum up all the counter for each of the CPU in order to give you the final result. But usually, we are less frequently done than the right. So, the use of per CPU counter will avoid the cache line content problem but at the same time, allow you to occasionally read the counter value. Another data type that has frequent cache line content problem is the log. Log content content will slow things down because in the case of exclusive log, you have to, only one CPU can own the log. The rest have to wait for it. So, you basically seal your application at the log contention point. The level of log contention can be reduced by picking up a core screen log into multiple final green one. But in many cases, it's actually harder done than said. It may not be, depending on the kind of log it is, it may not be easy to break up into smaller one. And on the other side, if there are two types of logs, there's spinning log and slipping log. In the case of spinning log, when there is the contention, it also means that the CPU will be spinning on the log cache line and that cause cache line contention problem that I talked about previously. And log contention used to be a serious problem. In fact, when I began working on Linux, my job is to try to improve the performance of application on large remote system. The first system that I work on has Exocap and we see the effect of log contention and it's detrimental to performance on those kind of large system. So, we try to work on the log itself to make the log contention less a serious problem. And we have, I can say that we have largely solved this kind of log cache line contention problem by avoid spinning on the log cache line. If you have multiple waiter waiting on the log in the past, it used to just, they all spin on the log cache line to see when they are available. But today, only one or two, only the, yeah, only one of the waiter will spin on the log. The rest will spin on their own local cache line, thus avoiding this kind of log cache line contention problem. A major vision block in the locking infrastructure in the recent year is the use of the MCS log. It's a locking algorithm that allows scalable log synchronization in SNP system and slightly different variant of MCS log are used internally by the different log in the kernel. The Q-SpinLog is a new locking algorithm that was merged into the photo kernel. With Q-SpinLog, log waiter are put into the MCS queue. The MCL itself serves as a light queue, so you have a queue hat and then a queue tail. Any newcomer will go into the queue tail and ray in the queue. And then the queue hat will spin on the log cache line, but the rest are spin on the local cache line, so there won't be too much contention on the log cache line. And then we have Q-VrayLog in the kernel, which were merged earlier in the V16 kernel. With the Q-VrayLog, Q-VrayLog actually is a fair log. The period version of the V-VrayLog code is an unfair log, and unfair log have the problem that if there is a lot of contention, it's possible that some of the CPU may never get the log, or those CPU may have to wait a long time to get the log. And in fact, one of the main problems of putting up large system with main CPU is that some of the CPU may never get a V-VrayLog, and the more commonly used V-VrayLog in the kernel is the task list log. And in the Buddha process, the task list log is quite contented, because you have to spawn up a lot of kernel fare and a lot of user fare. And so it is possible that in the Buddha process some of the CPU, because they can't get the log within, let's say 70 seconds, you will have a soft log message painting in the kernel, and in some cases they even have a hard log up and crash system. V-VrayLog is a fair log, and when combined with the Q-SpinLog, it will provide fair V-VrayLog with download cache line contention problem. And then for Miltat, which is a sleeping, mutually exclusive log, in the past log-waters are allowed to spin on log as long as the log owner is running. So if you actually look at the state of log owner, if log owner is running, it will spin on log. But before the free-tank kernel, the spinning is done just... So everybody there waiting on log will spin on log cache line. So we have the same log cache line contention problem. And in the free-tank kernel, the patch is merged to make the way they can kill based on the MCS log. That means only the Q-hat will spin on the log, the web, again, spin on their own cache line. And this solved the log cache line contention problem. And the killing group has seen it enhanced and extracted into what we call optimistic spinning kill code in the kernel. They are now shared by also V-VraySamerfor. And around the, I think maybe 3.15 or 14-time thing, you will find that when they convert some of the new tech into V-VraySamerfor, the people might actually... Okay, the people might actually deteriorate because V-VraySamerfor doesn't have the optimistic spinning code. And so they also adopt the use of the optimistic spinning kill since V6 in kernel in order to achieve performance parity with Mutex. Now, I just want to show you some data that I have about the effect of log cache line contention. The following two graphs will show you the locking way of a micro-benchmark using both the kill spin-off and the ticket spin-off. It's running on a 16-socket, 244-IPPH-EH system. The Y-axis shows you the number of locking operations per second, and the X-axis is the number of core or the number of active flat there spinning contention on the log. You can see that with the kill spin-off, the performance kind of stabilizes around more or less the same value, and in the case of ticket lock, it actually deteriorates because of the increasing log contention. And you will see a drop here is because it's transitioning from only in a single node to a second node. So because there are 15 CPUs on the socket, so up to here is within one socket. But once you go to the sixth, have 16 fat, the extra fat comes from the second socket, and that causes deterioration in performance in both kill spin-off and ticket lock. It's just because there's a lot bigger latency involved in the acquisition of log. And this extends the Y-axis up to all the CPU, all the core in the system. So you can see that if all the core in the system are spinning on the same log, ticket spin-off, the locking way will be done here, and for kill spin-off, it's up here. It's about four to five X difference, I think. So you can see, you can see that with lock case-line contention, there will be a serious performance impact. What is the number of locking operations per second? It's just the number of locking operations per second. So the Beigel benchmark is just doing acquire log, release it, and then acquire again repeatedly, and it shows you how many such operations you can do by all the flag on the system. Higher is better. Higher is better, yes. Okay, conclusion. So the Linux kernel is doing a beautiful job in managing data locality itself for its own data structure. But for application, they have to know how to manage it properly otherwise they can't utilize all the remote within the remote system. For application, they are not, they are not any more aware than what we can do, yes. We can use some tools like remote control or the CPU setting, to limit application to one on just one node, or a number of CPU within a single node. They will largely solve the data locality problem. Work is still ongoing on the transparent page-page side to make it more useful for the FH user. But that will take quite a while for the improvement to come. So I will expect that in the next few years we are still developing on transparent pages. And cache line connection is still an ongoing problem. There are many codes in the kernel and depending on what nodes some connection problem may show up which we need to address once it is found. But lock cache line connection, I think is mostly solved. Though we are still doing some additional tuning and performance enhancement in the locking code. Okay, that end my presentation. Do you have any questions? Yeah, I think it's time. Okay. I have maybe one question. No, I don't think you have time. Yeah. So are there currently, you mentioned some of tools to do some measurements. Are there some, for instance, of chip counters that would give you an idea of how much latency you have for memory accesses or in the process? Is that something that is easy to measure today? Could you repeat the question? The question is, I talk about a perp tool to diagnose the application not to find out if there are any cache line contention. The perp tool is the most common one that I use. I know there are other perp tools out there also can do this kind of measurement like the Intel vTune. But my question was, I think the tuning mentioned was specifically about mostly the local cache. And my question was whether there was a way to measure the fraction of your memory accesses before going to some other new manual and the latency that you got from that. Actually, the tools will show you the, I think it will show you the latency on different new manuals. So actually, the output of perp tool to see is very wide and it will show you on each node how many heat you have on each of the nodes. So you will know how much access are removed, how much are local. So I would highly recommend you try out the perp tools to see if you are interested in this kind of problem. And this is the kind of measurement that Autorimba uses or is it something else? Just like a regular perp tool will be caught for a certain period of time and then it will generate a report on all this cache line. The memory access pattern that exceeds a certain threshold. So you can... Yeah, I would highly recommend you try it out and see whether you like it or not. Okay, thank you very much for your time.