 So, today, before I start, I will explain what is a spin lock, why it's so hard to implement it in user land, and how can we possibly solve this. So we have locking mechanisms, because if people want to write multi-write applications, probably sometime on application, you need to compete for a resource. So we want to make sure that this resource is accessible by only one thread. And for that, we have a lot of different locking mechanisms, really. People are very creative. The most basic and infamous ones are Motexes and spin locks. But we also have semaphores, barriers, read, write lock. On the kernel end, we have a lot of different crazy things, like RCUs, stake accounts, we have per-CPU operations, and writing locks is so hard that we also have people that really try to avoid dealing with locks and create lockless structures and algorithms. So for instance, if you're writing a multi-read bank application, you probably want to change your account balance only one thread per time, because otherwise, if you receive money from two people at the same time, maybe just one operation, we will commit it and we will lose some money. So this is a very basic multi-read application, this is how it looks like. So I will start explaining a little bit about Motexes, because I think it's the most simple locking primitive and is the one that is most used on user space. So Motexes mean multi-exclusion, so basically just one thread per time will have access to the critical section. On Linux, that means that the thread that is waiting for the lock is lipable and printable. So if you are waiting for a lock, you will slip, the thread will give back to the kernel that time slice, so the kernel can print your CPU to do something else. And on Linux, we use full text for that. So we do a context switch and give the time to the kernel to do something else. So let's say that we have two CPUs, and the CPU zero gets the lock and do the work on the critical section, and the CPU one do a context switch, go to sleep, waiting for the lock to be free, so we can finally take the lock, work on the critical section and free the lock. But there's a problem here, because usually the critical section is very short, but doing a context switch in nowadays after the CPU vulnerabilities is very, very expensive. If you want to move from the user land to the kernel and back, this can take a lot of time. So very context case with small critical sections can suffer using Motexes, because instead of doing the work that you are supposed to do, you are wasting a lot of CPU cycles just locking and unlocking, and it's preemptible and doing the context switch, you are not really wasting the time, your CPU time, doing the work. So for instance, let's say that we have three threads, one on each CPU doing the work, and you can see here in the example, the critical section is very small, and then the CPU zero gets a lock. The CPU wants to take the lock, but unfortunately it's too late, and it goes slipping. It's slipping, and doing the context switch is so expensive that it can be that another thread on the CPU two can take the lock when it's free, and the CPU zero can take it again, and so on, and CPU one only will get the lock way after. So CPU can starve because of, unfortunately, it's caddling. So while, why we do a context switch, maybe we can just spin them instead of sleeping. So spin locks is basically, you waste, you use CPU cycles to check for the lock to try to take it, and this is probably the most basic spin lock ever. So basically, you look forever, you try to take the lock. If you take the lock, that's good, you break. If not, you just loop again, and once again you try to take the lock. So keep spinning, spinning, and this should be very fast. You can get the lock faster without context switch, without the need to go to the kernel. Just get the lock, and that's it. So the ideal image would be like that. You just, as soon as you get the lock, you just, the CPU one will get the lock, and that's it. Let's go. But it's not that simple. Because user space has no role in saying what the test caddler should do. So that means that we can waste a lot of cycles spinning for something that is not even running. So let's say that the CPU zero gets a lock, and the test caddler decides to print this thread in the middle of the critical section to do some other work. And in the meanwhile, CPU one is spinning. But this is just a waste of CPU cycles. Because you are spinning for nothing. You're spinning for something that is not going to be available. And then you need the test caddler to print again to give the thread again the time slice. So you can finish the job and unlock the spin lock. And that is even a worse scenario that you may be spinning against the lock owner. So let's say that we have thread A and thread B, and they are both working on the CPU zero. So thread A gets a lock, and thread B is printed on the same CPU. And now it starts spinning. But you are spinning for something that is not on any CPU. So you are spinning for something that is not going to be released. So you're just wasting CPU cycles, and you're in the way of the thread A. So you are making even less likely that the lock will be free. In the kernel, we can implement locking mechanism with a totally different way. Because in the kernel, we have all the resource in our hand. We know exactly which threads are running, which threads are sleeping, on which... We know on which CPU the threads are running. You can disable preemption. So you can make sure that your thread will run for the whole critical section. And you can check if the lock holder is running or not. And for that to work well, we have some rules for using spin locks on the kernel. You can't sleep while holding a spin lock. That means that you need to disable preemption and interrupts, and you need to keep the critical section as small as possible. So back to Ursuland. We would like to spin just when the lock holder is running, when we are sure that the lock holder is working. But there is no current mechanism to do that on the Linux for the user space to ask if any given thread is actually running or not. And we can't disable preemption. So these are the challenges for implementing. So we need to have a way to check if a given thread is running. But talking to the kernel is really expensive. If we do a syscall, if we add something on ProcoFS, as I said before, the context switch is very expensive. So if we create a syscall for checking if a given thread is running or not, maybe this will be way bigger than the critical section, and we are not really solving the problem. And now that is the question. Is there a cheap way to talk to the kernel about the state of a given thread? Let's see. So now I want to introduce restartable sequences. So restartable sequences were created to solve a kind of similar problem here. The idea was from Paul Turner, but the implementation is from Matthew. This is a Linux kernel user API implemented by the RSEC syscall. This enables user space to have efficient peer-to-peer operations without locks. And as I said before, user threads can prevent CPU migration or preemption. So RSEC creates an artificial way for having atomic contacts on user land. So I simplified the structure here, but basically when you start user thread, you call the RSEC syscall with passing a structure with some data. The kernel will fill the CPU ID for you to tell you on which CPU you are running, but user space needs to provide some information like the start instruction pointer for the start of the critical section. The post-commit instruction pointer points you to the end of the critical section. And the abort instruction pointer that is used if something goes wrong. So the way it works is basically you tell the kernel, they start and end of the critical section. And if during the critical section the kernel prints you, the kernel will check the current instruction IP and compare with the boundaries. And if the kernel notices that, well, I just printed or migrated a thread during a critical section, that means that the peer-to-peer operation will not be atomic. So we need to abort the operation. So before the kernel moves back the user thread, it will move the instruction pointer directly to the abort IP. And the abort handler can do whatever is needed. Usually that means to restart the sequence for the start IP to try again to do it atomically. And Matthew did some measures. And usually the atomic operation succeeds. Usually it's prior to here that you need to go into the abort handler. But anyway to ensure the correctness we need this mechanism. So yeah, this is our second, this is how it managed to create atomic per CPU operations. So if you get to the post-commit IP, you are sure that the kernel did not preempt you or did not migrate your thread to another CPU. So basically it was atomic. And that's it. And this is like an image showing on our orange, you would have all your critical section instructions. You can see that the start IP in the end, the post-commit IP to the start IP on the start of the critical section, the post-commit to the end. And another pointer for the abort handler. And also another very interesting usage that was created after ERSAC is that, as I said before, every time that your task is scheduled, the kernel will write in the CPU ID field, which the number of the CPU that your thread is running. And people are like, wait a minute, this is very useful for implementing fast get CPU. Because usually to know the CPU number that you are running, you will need a syscall for that. But now that we have that structure that is shared between user space and kernel space, you can just read this structure to know which is your CPU ID. And as you can see, some benchmarks on ARM and X86 shows a very huge improvement on just reading, just using ERSAC instead of using the syscall get CPU. So nowadays, if you are using a new version of GDBC, regardless of if you are using or not ERSAC in your program, GDBC will register your thread before running. So it can efficiently cause a very efficient get CPU if you ever call this in your code. So as you can see, ERSAC has a very shipping interface to get thread information for the kernel. And this is what we need, right? ERSAC created that structure that can be read and written both by the kernel and user space. And reading a structure is way cheaper than messing with syscalls. So John from LWM has suggested to use ERSAC to solve our problem. Now back to spinlocks. As I said before, ERSAC was created for a different purpose, to do atomic procedure operations. But it really suites well for our challenge here. So to have user space spinlocks, we need to know if the lock holder is running or not. And what if we add this information to ERSAC? Because as I explained, this code already is integrated in the test scheduler and already have an API. So we go to the struct ERSAC and we have a new pointer to a struct ERSAC SCAD state. And this is the ERSAC scheduler state. We have a version, a version member, because it's a good idea to have version members on APIs on the kernel, because if you mess up this API, you only need to create a whole new interface. You just integrate the version and you can expand the struct or even change the members. And we have the state member that will tell if the thread is running or not. And the thread ID that user space can register. So now we just need to update the state every time that a thread is printed or migrated. And this is very simple. As I said before, ERSAC is already integrated with this scheduler. So every time this thread is going to be printed, you just set the state as zero. And every time the thread is going to be migrated or it will be placed at run, you just flag it as running. And now if we connect everything, let's have a look on how a spinlock would look like. So first of all, you try to take the lock. If you manage to do it, you break. It's good. If you don't manage to do that, you check if the state of the lock owner. And if this state is flagged as running, you continue the loop. You also again loop and try to take the lock again. Because you know that if it's running, it's running on some CPU. It's not running on the same CPU that you are running. So you can spin because it's very likely that this lock will be released very soon. So you don't need to slip. However, if you see that the lock owner state is different from running, that means that it's printed. And if there's no point on spinning, so you're going to slip. So you just call few texts as they slope up as it's done with Mutex. Does this work? I'm not sure yet. We are working to see if we get good results on user space. We can speed up threads and different workloads. We need to investigate a little bit of cache optimization to see if the size of the struct is right. We need to integrate that with GDBCP thread locks. So we can benchmark a whole G-source running and spinning to see if it's good. And of course, do a lot of benchmarks. And I think nowadays, to merge something like that on the Linux kernel, you need to provide not only artificial benchmarks, but it's also good to have a more normal case benchmark. Because sometimes artificial benchmarks don't really represent what is out there. Matthew wrote an RFC extending the RSEC with the scheduler state pointer. So you can have a look. And also, he implemented on the self-test, he created a new test, the RSEC Mutex. So you can have a look on the code and run it to see if it really works, if it really looks that it's spinning and working. And depending on your GDBCP version, you need to have this NVVAR. Because GDBCP has support for RSEC, but not support for the scheduler state. So things can get a little bit mixed. So if you write zero here, you make sure that you are going to use lib RSEC instead of the GDBCP support. So to make sure that it will work. So that's it. Thank you very much. And do you have questions? So the question is, why would we use a syscall to get the CPU number? Oh, okay. So, good question. I think no one ever wanted to create an interface just for getting the CPU read faster. I don't think that maybe this wasn't bothering someone that much to create mechanisms for that. But the thing is, after RSEC, after someone creates a whole infrastructure for the per CPU, the get CPU was free. We get the get CPU for free, you know, without the need to invent anything. This was just a consequence of RSEC, and it was already merged. So yeah, I'm not sure why no one ever did that, but I believe no one ever wanted to investigate that deep. But now that we have RSEC, we don't need to do a lot to implement that. Yes, yes, yeah, you need to do some work. And yeah, it's not trivial to get their second to get their second implemented. Yeah. Okay. Oh, sorry. Sorry. RSEC is, so you're saying that RSEC is the same as Spinlock? Yeah, if you're doing like per CPU operation, yeah, you just can just use RSEC. But if you are competing with, if you can make sure that the per CPU operation, if you have more threads in CPU, then you would need to spin. Sorry, can you repeat? As far as virtualization, virtual machines, okay. Okay, so on virtual machines, you have the problem of one lock per application, right? And what's the next part? Okay, right, right, third per application, okay. Okay, I'm not sure how this problem translates on for virtualization. Okay, okay. Okay, so the next waiter is not on the CPU, not the CPU. Mm-hmm. Okay. Cool. Do I have more questions or comments? Okay, so the question is why do you need to speed up? Which problem do you want to use Spinlock, right? So this is very useful when the critical section is very short, and you have a lot of contention. And because nowadays, people, this is a long-standing problem. People are trying to solve that, I don't know, for 10 years, I think. Because people have measured in a lot of different ways that doing the contact switch is way bigger than just waiting for the lock to be free. But as I showed before, nowadays we can't spin, but you can't spin correctly. So yeah, I can't point to exactly the example, but a lot of mutexes can switch for Spinlocks, mostly if the critical section is very short. Do we have more questions? Yes. Right, so the question is maybe it's not Spinlock a general solution, right? Because maybe your head has short critical sections, but you depend on the hardware, and they are very unpredictable. So yeah, I think the idea is to have different locking mechanisms. Oops. So the user can go ahead and benchmark and choose which one is better. But I think having Spinlock on the table is a tool to be used, and I believe some case will definitely benefit. Okay, I think this is it. Thank you.