 Hello everyone, my name is André Almeida, I am a Cardinal Engineer at the open source consultancy collaboration and I'm here today to talk about Futex2, a new Futex C-Score. So today we'll cover first of all what is Futex and how it works internally on the kernel side, some limitations that were found and how we want to fix these limitations with a new C-Score, the Futex2. So Futex means fast use of Mutexes and is a C-Score to provide the user space means to create several synchronization mechanisms like Mutexes, conditional variables, barriers and the same affords and the semantics on the kernel side is very simple, all the logic should be implemented on the user space. So for the Mutex case for instance, if you didn't find the value that you were expecting like lock 3, you just sleep and wait and on the other side if you are releasing the lock you do a Futex wake, so you wake some waiters, someone that is waiting for the lock. C-Score is aiming to be very fast, so no C-Score is needed on the uncontended case. This is very important nowadays for some platforms since after a lot of hard bug mitigations doing C-Score is very expensive. And here we can see an example of how to implement a very basic Mutex using Futex. So the first thing we do is create an integer of 32 bits that will hold a value and this value will set the state of the lock. So we initialize it with lock 3 and then on the lock side what we do is we try to do a compare exchange operation atomically. So we want to replace the lock 3 value with the lock taking value. If you manage to do that, that means that the lock was free and this thread was able to take it. So we don't need to call Futex, we just continue the execution of the code. However, if you fail this compare exchange that means that the lock was taken. So now we are calling, we need to call Futex, the Futex is called. So the first argument is the address of our value because this is how we identify Futex objects inside the kernel by this address. So using the address you can uniquely identify any Futex for any process. The first argument is the operation that you want to do. In this case is Futex wait and because you want to sleep while waiting. So and the third argument is the expected value. Here is lock taken. So what that does mean? For instance, if the value has changed between you try to do the compare exchange and when you issue the sys call. So since this can happen and we don't want to sleep when the lock is free, the kernel checks if before sleeping the current checks if you are sleeping on the right condition. If you are sleeping on the case where the value is lock taken. So this is for the lock side. On the unlock side the first thing you do is to automatically set the value as lock free and then you call Futex again but this time with the Futex wake operation. You can see that you are using the same address because this is how we identify Futex. Now the third argument is not anymore the expected value but now it's the number of waiters that want to wake. On the Mutex example the only value that makes sense is one. Since if you wake more than one they will just fight for the lock and all guys lock will continue and he who didn't get the lock will sleep again. So this will wait, just wait with some processor time. But for other operations like you for implementing a barrier you can use the max number of threads in your system because you want to wake everyone at the same time. And also this example can be optimized to just call the Futex wake if you know there is someone really waiting for it. So I was going to find the interface. This is the complete interface and as you can see it's a multiplexed one that means that all operations happens in the same Cisco. The first argument is the user address and the address of the integer that holds the value that holds the state. And the second argument is the operational code and you can also add some flags here if you want to change the behavior. For instance if you are doing a multi-tread application where you share the memory space you can use the flag for those Futex privates to do some access optimization. However if you are creating a multi-processed application you can do that because you probably need to create a shared memory variable. So the value on the wake case is the expected value but on the wake case is the number of threads you want to wake. And the timeout operation is for if you don't want to wait forever you can add a timeout for the Futex wake and there is also the user address 2 and the vowel 3 arguments that are not over here because there are four other more complex operations. So Futex is from 2002. It was created by Russi Russo when he was working at IBM. But for a long time it had been maintained by Thomas. Also, JLBC doesn't expose a nice wrapper for some reasons. The first reason is that it's not easy to create a nice wrapper that checks the semantics and the types of the arguments when you are working with a multiplexed interface. And also because Futex wasn't really meant to be used by a lot of developers. Futex was meant to be used only by those developers of core parts of the user space like C libraries. But also you can use, we found the users of the Futex. For instance, if you have a corner case that you want to implement that is not covered by the Petra implementation or for instance if you are doing some level compatibility like you are creating some emulation tools. So if you want to use Futex, you will need to use the syscall function of JLBC and then use the number of the syscall Futex. So now I want to talk about how the Futex works on the kernel side. So for the wait, the first thing you do is to check if the value on the user space address meets what the user is expecting. So if it's not true, you just return immediately with an error. However, if it is true, you will sleep. Before doing that, you register yourself on the wait table so the waker thread will know where to find you. And then you are right to sleep and then eventually you will wake. This can be for some reasons. For instance, it could be just a spurious wake because the task scatter put you to do some work but you don't have any work to do. So you just sleep again. It could be a timeout. In this case, you need to wake, remove yourself from the table and exit. It could be also because this thread got a signal. So it could be, for instance, a CQ or a CQ port and then you need to exit the thread. So you remove yourself from the table and get back to your user space but it could be the normal case where it's the way someone did the Futex wake and now you are awake and need to go to your user space. On the other side, the Futex wake, for it's very simple. You just go to the Futex to the wait table and for each Futex that is in the same address as you, you just wake until you match the number of wakes that the user space has asked you to do. Now, this is a simple timeline of how Futex works. On the top, you have a thread that, in this case, is a Mutex. So in the top, you have a thread that doesn't have the lock. So it will do a Futex wake and then you go to the kernel and the kernel will scan the thread. So basically we put it to sleep. In the meanwhile, the thread on the bottom that has a lock just release it. So it will call Futex wake, go to the kernel, the kernel will find out which thread to wake and will issue a wake up operation. And then both threads will just exit from the kernel and continue their work. And now let's see how the hash table works. So in the Futex current Futex application, you have a global hash table with a lot of hash buckets. And then when you ask for weight, the hash function will assign you a bucket and you will add you on a waiting list in there. But for the same address, all threads will get on the same bucket. So this makes the waker life easier. However, you can also have hash conflicts. So in this case, different address will be on the same bucket. So now let's have a look on some problems that the current interface has. So the first thing to notice is that we didn't get any new feature on the Futex since 2008. This is because the code is very hard to modify and it's very tricky to maintain. This was set by the maintainers themselves. And Futex is very important for all sort of systems and it's important to provide safe multi-threading and safe locking. So if there's a bug on Futex, this will give problems for a lot of people. Also, the current code has some legacy features that no one use anymore and it's kind of tricky to modify the code to add a new feature without breaking old features. And beside that, we also have the problem of NUMA awareness. So on the single socket case, there is no problem on having a global hash table because this hash table will be somewhere in the memory and the CPU can easily access it. However, on NUMA architecture, we will have a lot of sockets and the global hash table will need to be on some nodes. And for every node that doesn't have the table, it will be very costly to get information for the table since the memory access is no uniform. So another problem that we have on the current interface is the lack of determinism for real-time users. So as I showed before, we can have hash conditions on the table and it's not easy or not possible for the user space to know how many full-texts are on the same bucket that we are operating. So that means that it's very hard to predict how much time a full-text operation will take. The fact that the user space needs to provide 32-bit user address is a hard requirement. You can use other sizes of integers. So maybe if you could have 80 or 16-bit integers, it could help embedded systems that maybe doesn't have so much memory. And also for the full-text case, you will probably use for like three values. So 80-bits is enough for that. So maybe you can fit more things on your cache. And also 64-bit full-texts could be useful if you want to wait on a pointer. So here is a list of a lot of attempted features that we got on these errors that wasn't measured. So the first one is adaptive spinning full-texts. We got two takes, one in 2010, another one in 2016. So the idea here is for if the kernel knows that the lock owner is running, maybe it's not worth to the waiter to sleep. Maybe it is worth to just spinning so you can avoid the overhead of sleeping and the context switch. And the second feature, 2016 as well, is the attached full-texts and the hash table per process. And this was made to try to solve the NUMA problem because here each process will have its own table. That means that when you create a new process on the creation, the kernel will allocate some memory for the hash table. So since the process data is attached to the node where the process is running, you will solve this memory locality problem. And also last year we got variable size full-texts. So this was attempt to implement, to fix that issue that I just talked about of the fixed sizes. So this would allow different sizes for full-texts. And we also got way to multiple full-texts that I will explain later. And the full-text swap, that is, we got this patch this year. It was aimed for some consumer-producer-specific loads. So the full-text switch multiple was developed by Collabram and Valve. And the semantics is to wait on a lot of full-texts at the same time. And on the first waiter, on the first full-text that issue and wake, you will wake. So this operation can be found on other operation systems as well. And for us, it's very useful for creating the emulation layer between Windows and Linux. Because the first step that Valve did was to use the Eventfd interface to simulate this behavior. But unfortunately, Eventfd doesn't scale so well with a lot of waiters. And also, some games could cause file descriptor exhaustion since they could create a lot of lock objects. So the full-text interface seemed natural to implement this. And this is what was done using full-text to implement the semantics. And we got some nice results. For instance, on the Temporator running over Proton, that is a compatibility layer that allows you to run Windows games on Linux. We got 4% less CPU utilization and 80% less calls to spin locks on the kernel. So that means that we could allow the kernel to do some proper work instead of just spinning. So after talking about all those limitations and about all those features that weren't able to be merged on the old interface, this is the solution that was proposed by Thomas, Peter and Florian on the main release to create a new API from scratch. So the first thing that you notice here is that the interface is not multiplexed anymore. And this basically will make the life of the kernel developers and JLPC developers easier. And if you want to know more about the benefits of not creating a multiplexed syscall, I recommend you to check the 2020 Link Stompered Conference. There's a talk by Christian and Alexa where they talk about extensible syscalls and why multiplexing a syscall did this to some headaches in the past. So we also, on this interface, we'll have flags for NUMA if you want to do a NUMA operation for the size because on the new interface, you can choose the size of full text. So for instance, you can use an ADB to full text and then use, tell to the kernel using a full text size eight that you will the size of your full text. You can also use shared to tell to the kernel that is a shared full text and a flag for a clock AD to specify if you want to operate on monotonic or real time clock. And here is this interface for wake and wait. You can see that they're very similar to what we had before. The wake, you have the address, the number of wakeers, the number of wakes that you want to perform, the flags. On the wait side, again, the address, the expected value, flags, and the timeout. And here is the interface for the wait on multiple. It's called the way to V because this vectorized weight. So the first argument will be a pointer to an array of waiters. So each waiter will have the address, the expected value again, and the flags because you can have different sizes on this array. So, and then you will have the number of waiters, the flags, and the timeout. And I didn't cover the comparative operation here because it's kind of different semantics, but basically it's about recuing waiters from one address to another one. The important thing here is that this interface already has six arguments. That is the limit of this call for some architectures. So we can add more arguments to this interface. So now I want to explain how we are solving the numAware, the numAwareness problem. So for the numAware's call, the user address will just point to an integer like before. But for a numAware operation, the user address will point to a struct. And as before, this address will be used to identify, uniquely identify a full text. So this struct will have two fields. The first one is the expected value and the second one is a hint. On the numAflag semantics, this hint can be either minus one, where that means that you want to operate on the current node where you are running. But you can also specify from zero to the max of numA nodes, the number of the node where you want to operate because maybe you want to share a full text with another node or maybe you are migrated. So this is how we specify on which node you want to operate. And also all those values members will be naturally aligned. So before, we used to have a single global hash table, but now we will have local hash tables one per each socket. So this will solve the problem of memory locality for numA architectures. So this is the interface that was proposed on the mailing list and also on the Linux forum conference. And this is the interface that I'm implementing. So what I have done right now is the wait and wake and also the wait, the vectorized wait and the timeout is also working. What I need, what I'm working right now is to implement the shared full text. And what I have kind of done, but I need more testing is the numAware and the variable size. And for the future, I need to do the compare queue operation. I will send the patch soon as I have a lot of future together because this will help us to identify if the architecture is working instead of sending small pets. So if you want to see the pets, watch closely the real-time mailing list because as proposed by Steven on the Linux plumbers, the real-time will be a nice place to testing and to play with the new interface. So thanks everyone for listening to me. And if you want to get in touch, just send me a mail and we can talk about texts. Thank you.