 I'm Steve Rosted, I'm an open source engineer at VMware, and I'm also the creator and current maintainer of Ftrace, the official tracer of the Linux kernel. Today, I'm going to talk about nested timestamps and where to find them. No, that's not what I'm going to talk about. But what I am going to talk about is nested timestamps and how they've plagued the Ftrace ring buffer design for the last 12 years and how I finally fixed it. So, come on, let's start. So before I talk about nested timestamps, I need to talk a little bit about how the Ftrace ring buffer design is, how it works. So when you create a ring buffer, it actually will create a buffer per CPU. There'll be one buffer for every CPU on the machine. That's because when you have a task on a single CPU that wants to write to the buffer, it's not going to compete with other tasks on other CPUs. It's going to write only to the buffer designated for that CPU. So when you have multiple writers, they all have their own buffer to write to that's going to be associated to the CPU. As you should know, CPUs can only do one thing at a time. So you don't have to worry about parallelism for when you're only dealing with a single CPU. So this helps remove a lot of the race conditions that are involved in writing to the ring buffer because we don't worry about parallelism. We only worry about things that happen on that CPU. Because remember, the Ftrace ring buffer is completely lockless. So if we look at one of these individual per CPU buffers, we'll notice that it's also broken up into sub-buffers or buffer pages. It just so happens that these buffer pages are the same size as the architecture page size. And what that is, is architecture page size usually breaks up the virtual memory up by pages. And that way, so say if you want to read data from the file system, a lot of times when it pulls it into memory, it's going to pull it in chunks of pages, the page size pages. And by making the sub-buffers in Ftrace the same size as the page size for the architecture, because each sub-buffer is self-contained for timestamps and the events, you could pass that sub-buffer straight from the ring buffer and do a zero-copy write and send it off to record it into the file system or send it over the network. It all works with zero copies, so it's very efficient to do it that way. So let's look at the single buffer page, because that's, since the buffer page happens to be self-contained and also controls how the timestamps works, we only have to worry about the buffer page for this talk. So when you look at the buffer page, you'll notice that it has a header. And that buffer page header is defined by the structure, buffer data page. And the very first item in that buffer data page is a 64-bit timestamp. Now this 64-bit timestamp is usually records whatever the clock is being used to synchronize all the events within the buffer. Because when you merge all the per-CPU buffers in the Ftrace ring buffer together and you want to see all the events ordered chronologically, it requires that the events have a timestamps to use to sort, know which one comes next. So that means these timestamps have to be monotonic. That means that it always has to increase, it can never go backwards. Because if an event, if a timestamp went backwards between two events that were back to back, so one, the second timestamp, or the second event had a timestamp that was back in time from the first event, that would screw up the merge sort when you merge all these guys together. So it's very important that these timestamps are correct, or at least monotonic. And the next part of the buffer is the commit, which is the, it's basically all, is the size of how much data is on that buffer. So the commit value on that in the buffer page tells you how much data has been loaded on that buffer, because it might not be full. It could just be just few events on it and when it's transferred off. So that tells you how many events. So basically it's the index of the last event on the buffer, the end of the last event on the buffer, shows you all the data that's on that buffer. And the data is where those events exist. Now if we look at the individual event, every event has its own header that's defined by the struct ring buffer event. And the very first item in that string buffer is this, well, the first four bytes is broken up into two parts. The first, the five bits of that first four bytes or the first word, the integer, is defined as the type length. And that type length tells you what type that event is, and possibly its length. So the next part of that part is the time delta. And remember the buffer page gives you the full 64 bit timestamp for that whole buffer page. The first event that is recorded, it's timestamp, the full 64 bits goes into the buffer header. And then all the events after that will just record the delta. This way we only have to save 27 bits and not a full 64 bits for full timestamps. Of course, you could ask what happens if it goes greater then. Well, I'll talk about that in a little bit. So to understand this, I have to move myself out of the way. So let me do this and we travel. Okay, so now if you look at the type length over here, it tells you how this is set up. Remember it's five bits, which means that the number inside that type length could be zero to 31. If it's zero, that means that the length of the actual event is stored in the four bytes of the array. So the first four bytes, so array zero contains the actual length of the entire event. If that type length is one to 28, that means the event, the size of the event is really four times that number. So if it's 112 bytes or less, because 28 times four is 112, the length of the event is stored in the type length. Why? Because that means we don't need those four extra bytes from array zero for the length. We can actually move the event into that. So most events are less than 112 bytes and we want to keep it as compact as possible because the smaller the event is, the more events you can have and the less likely you'll lose events. So we want to keep it small. So if it's 112 bytes or less, the type length is the size of the event. And by the way, all events happen to be multiples of four. The ring buffer forces that. And the data could start at array zero, so we don't waste those four bytes. Now, if we look at, oh, by the way, so let me continue with the type length. If it's 29, what happens, what that's used for is the ring buffer, F trace ring buffer allows for events to be deleted. If you don't like the event, even after you commit it, you can actually say delete this event. And what happens would be if the event is the last event on the buffer page and you go to delete it, it just changes that commit value on the buffer header page to bring it back and that event no longer exists. But let's say another event came in and got in after that. You can't modify the commit on the buffer page because then you'll lose the event that came in afterwards. So to let the readers know that this event is actually deleted, it switches its type to zero. The array becomes, gets its actual length again. And we, what's it called? That's a deleted event, just becomes padding. The type, I wanna look over. For those first sets of type lengths, zero to 29, you'll notice that the time delta is a 27-bit number. And that is the delta between the previous event and this event, the time delta. So as I said, the buffer page starts off with a timestamp and then every event after that only records the delta from that so we don't record a full 64 bits. And what happens if the time is farther than that? Well, we need more than 27 bits if it costs 27 bits only holds like 134 milliseconds. So we need something else. So that's where time extents and time stamps come in. And I'll talk more about those later but just to let you know that what the time extents and the time extent and the timestamp do is they take array zero. So it's an eight-byte character, it uses the full header. Array zero becomes part of the timestamp. So it takes the 27 bits of the time delta, attaches it to the array zero and gives you a 59-bit time, either a delta or timestamp. And I'll explain about that in a little bit. So let's look at how we add events to the buffer. So it's very important. So we start off with a right tail and say it's 24 here. And we want to add an event that's 36 bytes. So what we do is we call this function called local add return which adds the 36 to the 24 and the right tail and puts that value back to right, which is going to be 60. And now we know that that's the allocation of our event. And if we subtract the return value of right, which is 60 by our length of 36, we get the 24 which is the beginning of the event and this is where we're going to write to it. But once we allocate it, we're good, no one else is going to jump in and use that space. We got the allocation, now we just have to fill it up. But what happens if something came in right after we did the local add return and the right tail came up? We took the right tail and subtracted the 36 from it. We won't get that 24, so we won't actually know where our data is. Well, that's why we use local add return. Local add return is an atomic operation. If we use CS majors, you should all know about the read, modify, write, race conditions when you're dealing with multi access to this. So when you read something, so if I have the number that's 24 and I read it, I want to add 36 to it and write back 60. What happens if something squeezed in right there? I read it, the 24, something else comes in and it does the exact same thing. It reads the 24, it adds back 30 and it's going to write back 54 and then I write back 60 on top of it. It's, you know, you're going to have this fight on there. Well, the local add return happens to be a atomic operation within the architecture where the CPU actually gives you this power to do this or the operating system will have hooks in there so that you can do this atomically. It will update, it will read, modify, write and if something else comes in, it will be blocked. So basically it's kind of like if the CPU has its own internal locking system to keep multiple accesses from happening that read, modify, write. So you could be confident that when you do that read, modify, write, it will be done atomically and you don't have to fight anyone else. So, but let me talk about different contexts. So at the lowest level, we have the threaded context or the normal context, which is the only context that Linux allows the processor to schedule between tasks. So if you have a, that's why I call it a threaded context. That's because if you have one thread and you want to go to another thread, you have to be in the threaded context to do that switch. It's called a schedule switch. But what could happen, there are other contexts. Remember, when the F-Trace ring buffer, when you start to do the right, it will disable preemption. It won't allow this context switch to happen. And that means you don't have to fight other, don't have to worry about, once you start writing to the ring buffer, you don't have to worry about being scheduled out and something else coming in, writing to the ring buffer too. That's all, that won't, you can't, don't have to worry about that because it's per CPU. And in user space, if you're per CPU and you don't have to worry about preemption, then what's the problem? You don't have anyone that's going to interfere with you. But in the kernel, it's not that simple. We actually have other contexts to worry about. Namely, we have the software interrupt context, which can't preempt into us. And to explain the software interrupt context, I have to talk about the hardware interrupt context, which could interrupt software interrupt context. So basically, what a hardware interrupt context is, is when something like a networking packet comes to your network interface card onto your machine or through the wifi or whatever. And it wants to notify the CPU that, hey, we have a packet here, it needs to be processed. So it will send it interrupt to the CPU and the CPU will stop what it's doing to, and then it will jump over to some code, some handler of the interrupt handler and execute information on that interrupt handler to process the packet. Now, why do we have software interrupts? Well, let's say you have a lot of packets coming in all the time, and or you're accessing the hard drive. And when a hardware interrupt happens, no other interrupt can happen. So hardware, I mean, no other hardware interrupt can happen. So basically, when a interrupt, the Linux kernel doesn't allow for multiple hardware interrupts to happen. So until once finished. So we have one interrupt with the networking packets being processed and you start typing on your keyboard. Your keyboard won't be able to tell the CPU that you're typing anything until that hardware interrupt is done processing the packet. And since processing packets can take a long time, instead of preventing other hardware interrupts from happening, it will hand off to a software interrupt which can be preempted by, or sorry, can be interrupted by other hardware interrupts. So the packet will come in, it puts into a queue, and then it hands off to the software interrupt that handles the rest of the networking stack. And as it starts processing it, as you type on the keyboard, those interrupts could interrupt that processing to allow your keyboard or your text to show. So you don't have to wait for that. Above that, we also have non-maskable interrupts, which means you can't stop. These non-maskable interrupts can interrupt hardware interrupts. And they're usually done for profiling and obviously if you want to profile hardware interrupts, you need to be able to interrupt them to profile them. So you use the NMIs, non-maskable interrupts, for profiling or watchdogs. Like say if you had interrupts disabled and the system is spinning on something that you can't do that, a watchdog NMI can go off and say, hey, the system hasn't moved forward at all for a certain amount of time and then it can notify you and let you know that the system's locked up. So that's what NMIs are. There's other contexts higher than that, like SMIs, some management interrupts or machine check exceptions, but that's out of scope here. But to let you know that all these contexts allows writing to the F-trace ring buffer. So you can trace at the NMI level. So that's why the F-trace ring buffer has to be lockless and also can't rely on disabling interrupts because like for example, you can't disable a non-maskable interrupt. So we have to be able to make sure that everything happens and handle the case that one context can interrupt our processing while we are writing to the ring buffer. So I like to call these interruptions stack interruption. What do I mean by stack interruption? We have threaded context. So we have our lowest level context that I can schedule and in there, a software interrupt happens. Now, like I said, most times, software interrupts are going to be done by a hardware interrupt, but for this explanation, just think that a software interrupt just could preempt or it can interrupt the processing of a thread interrupt and then a hardware interrupt comes in and it could interrupt the processing of a software interrupt. Now, the reason why I call this stack interruption is, is that each context that interrupts the previous context will finish everything that it's doing before it allows the context it interrupted to continue, such that that's very important in our algorithms because if you're going to record it into the ring buffer, that ring buffer record recording in the context will finish before it goes back to the context it interrupted. So that way you'll see later on, I'm going to talk about this a little bit more about that stack interrupt, just realize that, that everything that gets interrupted will finish before you continue. So let's look at this adding of the events again and with timestamps. So let's say we have to always keep track of the timestamp of the event that is less in our buffer if we're doing deltas because we're going to calculate the delta of the new event from the previous event. We have to know what the timestamp was of that previous event to calculate our delta. So let's say the last event on the ring buffer was had 10,000 and we write our tail and we add, or first we're going to add a new event and we read the clock and ours is 10,010. We do the local add return, add our space there, we get our event and now we're going to calculate the time delta and the time delta here is takes the 10,000 or takes our 10,010 of our timestamp minus is the last timestamp and we get 10 and that's what we're going to store inside that time delta field of the ring buffer or from the event header. And then now we have to update the last timestamp to our events timestamp because when another event comes in it's going to use that to create its delta. Well, let's look at this when we have interrupts involved. So the last timestamp is 10,000 and we read our clock 10,010. And just before we allocate, something jumped in and allocated instead. And it updated the last timestamp to its timestamp. Remember it finishes, if we get interrupted it's going to go through the whole process and come back. So everything it does in an interrupting context will do the whole thing before we get to continue. So it just updated the last timestamp without us knowing it. So we calculate our event and then we say, okay, let's take our timestamp minus the last timestamp that's stored there and we get minus 180. Oh, while I'm doing this, let me just realize I'm in the way. Let me move myself back down. Ooh, there. So as you see, the last timestamp is minus 180. So, that's not the only thing, that's wrong. Let me go back here. This minus 180, you can see the issue here that we can't trust that last timestamp. And it gets even worse than that. Another situation could be that we get the last timestamp to calculate it and we get in there, we update everything, but something else got in there before we were able to update that last timestamp. We got our information there. We're going to put in the time delta of 10 and put something else stuck in there. And from here, it does the operation and it looks, its timestamp is 10,200 and it does its delta, it takes the last time or its timestamp minus is the last timestamp and comes up with 200, but that's wrong. It's doing a delta from an event that was too ahead of it, not the one in front of it, the one that we are writing to. It's actually calculating the event before that and if the delta was a lot more, it'll be much more off. When it really should be 190, have a typo in there. Should be 190. So, how do we handle this now? Before the fix. So, 12 years ago, I realized this was a situation. How do I solve it? Well, this situation is actually quite easy to detect. You can easily detect what it, I'll talk about it. We have this commit index that whenever something's committed, we keep track of that last event that was written and this commit index can only be updated by the lowest level in the stack. That means that, like say, if you have, if you start and you didn't interrupt anything and you just start writing and something else interrupts you, it will update that commit index because it will see that it interrupted you. So, when you get back, when it returns and your context continues, then you can update the commit. So, let's look at how this is done. So, the buffer has this committing variable that will increment itself. So, the first thing it does is, it takes committing and increments it when you start the event and if you look at the bottom down here, it decrements it here. The funny part is if you notice that I don't need local add return here. I just do a simple read, modify, write. Well, why is that? Here's one of the benefits of having stack interruption. If the what interrupts you will put everything back to the way it was when it interrupted you. So, if I were to say the buffer committing is zero and I want to add one to it. So, I will read it, it's zero, but I get interrupted and the task in there that interrupted us does writes into the ring buffer and it reads the buffer committing and it sees a zero as well. It will set it to one, do all its processing, come back down and just before it leaves, it's gonna set it, it subtracts and sets it back to zero and when it gets to us and when we write the one back to it, it's actually at the correct state. So, when you have stack interruptions that put things back, you don't really have to worry about, you don't have to use atomic operations because everything will just work. But then, after that, I record the last timestamp which is the event that I want. I take my clock and then I allocate space on my buffer page. I subtract my length of the event from the right to get the start of the event and that equals the buffer commit, the commit that we only update at the end. If that equals, that means that I know that this event wasn't interrupted so that I could use the delta. I know that what I read this last timestamp was from the buffer and the last buffer because I added the commit and from here, it matches. But what happens if it doesn't match? What do I do? I set the delta equals zero. Why? Well, good question. We set it to zero because we have nothing else to do. We're confused, we punt. It's an American football term which means that you're at a point where you know you can't succeed where you are but instead of doing nothing, you kick the ball as far away as possible, give it to the other team but way further back so hopefully you have a better chance. So basically it's just another, the best thing to do with the options you have. And if you think about it, we're technically nested anyway because so that means that if the event is interrupted, in the world of time, if you interrupted writing to the event and you're comparing that event to what interrupted you, well, who happened first? Well, it doesn't really matter. I mean, it's two different contexts. The problem with this approach though, where the two contexts might not be an issue where you care which one happened first because it really doesn't matter because it's two different contexts and if it happened just before the event or just after the event, it doesn't matter. But when you're in that interrupting context, the problem with this approach is all the event deltas are zero because all events have been interrupted another event. So when you go into a hardware interrupt and you're gonna trace a hardware interrupt, you don't just trace one event. Usually you trace a bunch of events that are happening. If you're doing function tracing, maybe you wanna see how long that interrupt lasts it. So you take a trace when the interrupt entered and when the interrupt exited. From here, since they're all zeros, that time that the interrupt happened was instantaneous according to the trace buffer, which we know is wrong. So if you wanna look at how this actually happens, I run trace command, which is an interface 2f trace, byte or a thing, trace command record will record the data, put it into a file so I can analyze it. I enable function tracing and all events. I run this hack bench utility, which is basically a benchmark hack and does 50 groups, whatever that means. But it does a lot of things. So I get a lot of events. And then afterwards I run trace command report, which will read that data file. And now what trace command does, remember I said to how the buffer pages are all stand-alones? Well, trace command will use splice, splice system call, which is a very interesting system call, and move the data directly from the ring buffer into the file system with a zero copyright and fill it there. So the buffer page in the trace data file that trace command report will read is exactly what was in the ring buffer itself. And I could use this. So I use dash L that gives me this, which tells me when interrupts happen or not, the dash T gives me the full timestamp. Without the dash T, truncates it to the microseconds, but I want to see the full nanosecond output, which might be, you know, reason why it usually truncates it, because nanosecond operations are usually too much info. The dash TS diff will record the difference between the two events. So the time difference between the event before it, it will record it right here with the plus there. I also add this dash dash debug, which is a non-documented option from trace command. You could use it, it's fine. It's basically the dash dash debug there's no standard to it. I just add whatever I need to debug the processing of trace command. And it produces this stuff over here that gives me some data. And dash dash CPU zero is like, I don't want to see multiple interleaves. I only care about the one buffer to show you this effect to give you an idea. So the dash dash debug gives you this extra information and that one thing is 24. That's the size of the event that was recorded. So it's 24 bytes. This is a hex integer of the index into that buffer page. So it tells you where the buffer page is. This is near the end of the buffer page. And here is the time delta, the delta between this event that's recorded by this event and the event before it, which is usually the same that you'll see with the dash dash TS diff, but not always. Why is that? Well, look down here. The run local timers has a time delta of zero but its TS diff is basically seven microseconds. Well, why is that? Well, that's because remember I said the first event timestamp is really recorded in the header of the buffer page. And here it shows you the dash dash debug shows you that we've entered a new buffer page and that buffer page has the full timestamp. So now we actually have an idea of when this event actually happened. So we see zeros all the way up for seven microseconds and we lose all that space. So that's all these zeros and we don't know actually what the event happens according to the trace, they're instantaneous but we know that's not really true. And if we look further, we see that the Sysvec APIC timer interrupt had, which we see as a hardware interrupt over here interrupted the TLB remove page size information or when that event was being written, that got interrupted. That's because we could see that the timer interrupt first event had a delta of zero, which means interrupt on something. Here we show the handoff from the hardware interrupt to the hardware interrupts and these are our software interrupts. And finally it's a band when we return back and we get back to the normal processing, we get to the next event that actually calculates everything over. It's 13 microseconds, almost 14 microseconds later. So that event, and if you add the seven microseconds on top of that that we saw from the buffer page, that's 20 microseconds that interrupt happened and we lost where the problem was where. All we see is zeros and we just know that there's 20 microseconds that happened, but we don't know exactly how that was split up. So let's go back and talk about time extents and timestamps that I talked about in the beginning of the talk. Well, remember I said that the time delta in the event is only 27 bits, which means that that's 27 nanoseconds, no, two to the 27 nanoseconds, which is approximately 134 milliseconds. But what happens when we have an event that we record, so we record an event and then a while later record another event, that took longer than 134 milliseconds. By the way, just a note, I'm using nanoseconds and milliseconds here because most that's the default timestamp, the ring buffer doesn't care what type of unit you use. They could be counters, it could be milliseconds, four toners if you wanted to, it doesn't really matter. As long as it's monotonic, that's all the ring buffer cares about. So anyway, let's go back to nanoseconds here. So from here, you'll say my event happened and so the last timestamp of the event was 10,000. My new one is 140 million, 10,000. And when I do the subtraction, it's 140 million, which is greater than the 134 million that was allowed in that 27 bits. So what I do in this case is I detect that it's greater than the max delta. So we now add the length to the size of the event by eight bytes, which is enough to store a time extent event. So we actually allocate for two events in this, when we do the local add return. One for the time extent and one for our event itself. And that way we put the full delta within the time extent. Now timestamps act the same way. Well, the reason why timestamps evolved was Tom Sinusi was adding histogram logic into the ring F trace, where you could actually add synthetic events that could take a timestamp from two different events and calculate and do subtraction between the two to create a third event. And to do that, because the events itself, when you got the information of the event and only gave you a delta, he didn't know the absolute timestamp of there. So when you ran with histograms, it would actually put a full absolute timestamp in every event, which yes, it's a little bit, it does add more data, because now you're adding the full eight bytes ahead of every single event to do these subtractions. And it will actually suffer every so often when you hit one of these races, time will go backwards when you have that enabled. But in most cases, it's just a little glitch and it doesn't break anything per se. So the timestamps are basically the same size as time extents. But for this algorithm, we actually used those timestamps. Actually, I'm glad he added that because we could use that information. That was one of the key ingredients that we needed for this fix. So the 12 year old puzzle solved. Remember, for 12 years, we've had a zero delta in interruption and this was a pain. So I tried various solutions using stack stuff and I talk about this in a Linux weekly news article that I wrote that I'll talk about at the end or I'll post the link to it. And everything I came up with always failed. It got too complex, too many states, I couldn't handle it. And that's because I was always trying to make it go as fast as possible. Well, then Julia Law reported this bug that in a normal tracing, not with histogram, but normal tracing, that time actually went backwards once. When I looked at the data, she gave me the trace.dat file and I was able to run the trace command of the debug in it. I found out that it was a bug with when the introduction of the timestamps happened, there was a case where time extents didn't put a delta zero in, or we used a time extent and didn't put a time zero in a situation which allowed time to go backwards. And I started thinking about it because when I was debugging this, I realized that this interruption didn't happen often. It was very, very rare and frequent. So when I was saying about this, I'm like, you know, compare exchange that I always try to keep away from it because when I time compare exchange, and I'll talk a little bit about it, it's an expensive operation. It's a hardware operation, but it's kind of expensive. If you're doing it for every single event, it could cause troubles because like something like, if you're recording every function in the kernel, you don't want to do compare exchanges even if it's processing on a per CPU basis. So I try to avoid it at all costs, but I realized that in the slow path, I don't really have to. I mean, I use compare exchange when I cross over to a page. So it's already there. And if I were to add it to the slow path that rarely happens, no one's going to notice it. It's going to be within the noise. So what I said, let me introduce compare exchange to the solution. So if you only use compare exchange in the slow path, that means that I could bring the number of states to much more something comprehensible. So let me talk about compare exchange. So what is compare exchange? Well, in a lot of cases, what you'll see is you want to update some value and that other things could be updating at the same time. It's used for, this is how a lot of times locking is to grab a spin lock or to grab a lock. You need to update a value that other things only grab the lock. So this is actually the implementation of locking usually uses compare exchange. And the way it works is, let's say you do a little loop and you read some value that you want to update. First thing you do is you check to see, is this value something that I need to update? So if it is, I'm going to write my new value to it. But if something comes in and changes it to something where you don't want to write that new value to, you have to be careful about that. So you do compare exchange and compare exchange is this. And this is kind of a C representation. Compare exchange is usually done as like a CPU, single command in the CPU instruction set or other ways. But if I was to make it into C, this is what it would look like. And that is you take the value from memory and then you give it an expected value. And those two match and only if they match, you write your new value to the location of the original value. And at the end, you return the value that was read to begin with. If they don't match, you don't do anything. You don't write, but you still return the value you read. This way you go back, you'll see here old value of X. I could see, I could check, is the value I read compared to the old value I read? If they match, that means the update happened. If they don't match, I gotta go back and try again. Now remember, this is all done atomically and that means it's pretty expensive. There is the normal compare exchange. We'll do this across all CPUs. So if you have 64 CPU machine and they all do compare exchange, you have to lock the bus to make sure each one does this and it'll serialize the whole thing. So that's why it's really expensive. So for our purposes, for this case, we need to use compare exchange. The first thing we do is we use local compare exchange. If the architecture supports it, a lot of architectures have a compare exchange that doesn't affect the all CPUs. It only affects the current CPU and that helps you synchronize among what stack interruption like interrupts. But that's why, so we don't need to, we just use local compare exchange. Another thing is if something comes in and updates it, it'll usually put everything back to what we want it to be. So we don't really need to do a loop. If something, if we get interrupted, that means it's what we want it to be. I could basically say that save expected will be like a value that we will be the new value that we want it to be. So whatever interrupts us, we'll put everything back to normal. So we don't really care about doing a loop. And in some cases because of that, we don't have to check the return value. Just do a local compare exchange and continue. Although there is sometimes that's not true. Okay, so let's look at the players. The players, the two players that have the previous solution we bring over to the new solution. And that is the right tail. And I mentioned the right tail earlier. That is the index of the last event, the end of the last event on the buffer. The right stamp is the timestamp for that event that we're going to use to calculate the deltas with. But we bring in a new player called the before stamp. And the reason why I call this the before stamp is this will be written to by all events before it tries to allocate the buffer. And now here's some things that have to be really important. When this whole process is done at the end of the recording of all events, this is always true. The before stamp will equal the right stamp and the right stamp will be the representative of the timestamp of the last event on the buffer. That's very important for this algorithm. So let's look at the states. The states we have here is we start off is this is important, remember this. We save the right tail. We read the right tail and put it into save right, a local variable, so we know what that right tail is. We're gonna use it later. Remember, save right is the right tail before we start anything. When we start the process, save that. Then we read the before stamp and the right stamp that I put it into after. That's because the right stamp isn't updated until after you allocate. And since the before stamp is before you allocate and the right stamp is after you allocate, I just save it into a variable called after, it makes sense. And then I read the timestamp counter from the clock. Now, if before stamp equals after stamp, then I know that we didn't interrupt anything. Because remember, at the end of all processes, or after this whole process, the before stamp will equal after stamp and I'm good. But if they don't equal, that means that we interrupted something. So we interrupted the process. So this can only happen in a context that could interrupt another context and it interrupted the event being written. So in this case, what we do is we're going to ignore the before stamp or the right stamp and we're going to just say screw it. We're going to add a full absolute timestamp in there. So we add to our length enough room to put in that full timestamp that Thomsonouci created. And then we go back to the fast path again. For either case, this is the fast path. So either way, now we have to allocate space on the ring buffer. And before we do that, we update the before stamp. Remember, before we allocate, everyone updates the before stamp with their timestamp counter. And then we allocate our space and then we're going to subtract the allocation from our length and that gives us the beginning of the events. So we know where the allocation happened. Yes, I call it tail. There's a reason for it inside the code. There's other processes that's used. It's still called tail, that's fine. But for now, I just think the tail is the same as the starting location of like the return of our allocated memory of the ring buffer. So we look to see, does tail equal the save right? Remember I told you, remember what save right was? Save right was the value of the right tail before we started all this. So if the beginning of our event equals to the right tail, that means nothing interrupted. From the time we read that to the time we allocated, we are the next event on that ring buffer. But if they don't match, that means something snuck in in between those two times. Well, I'll come back to that later. So we're good, let's update the right stamp. So we're gonna update with the right stamp with our timestamp. You sure that's true? You sure that's okay? What about the read modify right issue? What happens to something snuck in and updated it to remember, if something interrupts this right after we allocated, somethings came in. It would have updated the right stamp too to make before stamp equal to right stamp. And we just updated right stamp to go backwards. It now has our timestamp, not the timestamp of the event that just interrupts us after we allocated. How do we handle that? Well, to detect that case, after we update the right stamp, we again read the before stamp. Now remember, if something comes in, it will update the before stamp as well. And if the before stamp still equals our timestamp, we know nothing interrupted us. So we're good, we're good to go, everything's fine. But if they don't match, that means something snuck in and we just updated the right stamp and we don't know if it came in before or after we updated that right stamp. And for some reason, that right stamp made, we might just corrupted it and we have to fix it. So we have to do something about that. But come back to that later. Anyway, this is the time where we look at, remember the before equals here? I said this is the pinks, the slow paths, the greens of the fast path. Now we have to say, were we in the fast path here, up here? So we again, check to see if it was before or after or if it's not. And jumping to there. If the two were equal, that means we're the fast path all the way down. This is the most common case. And all we have to do is we know the after, what we recorded in the right stamp. Remember, we saved it to a variable and nothing interrupted us. That is the right stamp was the timestamp of the event before us. So all we do is subtract the two, we have our time delta and we're done, finished. But if the two were not back in the beginning, the two were not, we remember we created a timestamp absolute value. So we just update that absolute value with our timestamp and we're done. Very easy. Rolling back to state eight, where remember we just added and we updated the right stamp and then something came in. And we don't know if it came in before or after we updated the right stamp. So what we do is we want to read that right stamp again because we'll just see what the value is. Now, if the save before, before stamp is less than or equal to save after. Remember, the save before is the before stamp will, if something interrupted us, remember if the two don't match, that means something interrupted us. If the two don't match, but the before stamp will be of that event that interrupted us. It'll be the last, it'll be the timestamp of the last event that interrupted us. It will be saved in the before stamp. So if the before stamp is less than or equal to the after stamp right now, we know it happened after we wrote to, the update happened after we wrote to the timestamp. So we're good. We don't have to do anything. The thing that interrupted us, fix everything up. But if the before stamp is greater than the after stamp, that means they went backwards and they don't match. We need to update it ourselves. So we have to remember that the before stamp is, has to match the after stamp. And we're going to update it, only if the save after stamp matches what we know it's going to be. If it doesn't match, it doesn't match the save after here. That means something else came in and it fixed everything up for us. So we're good. Good to go. So this is just showing you some of the things, the various things that happened. Anyway, let's roll back to the very beginning where something squeezed in before we allocated. So we started the process as something jumped in and we allocated. We detect this by, remember our tail, which is really the beginning of our event, does not match the end of the buffer that we read in the beginning. That means something slipped in. What do we do here? Well, this gets a little more complicated. So right now we don't, we can't trust any timestamps. Even the timestamp we read, we can't trust because we don't know if they came in before or after we recorded this and we don't want time to go backwards. So what do we do? We reread the right stamp and we reread our clock. Now, if the right, okay, this is the thing we have to look at. So the right tail that we have now, like I said, still equals the end of our event. Remember, right was at the end of the local add return with the right tail. So this returns the value of this to added together. So if right tail still equals this right, that means nothing interrupted us from this point to this point. Okay, and if that's the case, we know that the right stamp happens to be of whatever interrupted us in the beginning. Remember here, we reread the right stamp. The right stamp, remember, the right stamp holds the timestamp of the last event. So if something interrupted us and came back to us, that right stamp happens to be of the timestamp of the last event on the buffer, which is what we want to do the delta because we know that interrupted us and nothing else interrupted us. So that's this event right here. So all we do is take the delta of those two and now we have to re-update the right stamp. So what we do for the right stamp is, okay, easily. Well, just if it still matches the after, we're going to update it with our own timestamp and we're done. But what happens if something came in and it doesn't match? Okay, so that means something jumped in before us, we allocated our event and then something came in again because right no longer matches right tail. We're kind of confused now. We don't know what the before stamp now matches this guy, which is after us. We don't know the timestamp of the event before us and we need to be a delta of that. Well, what do we do? Well, we have no idea what the delta is so we punt. Really? Why is that? Well, the reason for this is simple. Remember I said in the beginning how when an event interrupts another event, it really doesn't, where does it matter between those two events? There's two different contexts. The event happens and it came back to you. So for that one event, you kind of really don't care where it happened before or after. Zero's okay. And another thing is it only happens to one event. If another event happens after our event, it's going to have the correct deltas and everything's going to work fine. So these delta zeros only happen to a single instance and only if a context changes. So in reality, you could say, yeah, those two events kind of happened at the same time. It's kind of, you could think about it, it's okay. So anyway, note there is a small slight race condition in the algorithm I showed above. See if you can find it. I mentioned it also in the Linux weekly news article I wrote that's right here. So if you go there, you'll see that, you might be able to find that same small race condition. And if you see what it is, you can figure out and you could send me a patch. I'll take it, okay. I could fix it. It's a trivial patch. And but I'm going to hold off because it's really not that big of a deal if it happens. It could cause a little bit of a time hiccup, but it's a very, very seldom race. And if you could find it and you want to get your name in the kernel and get your claim to fame, send it to me. I'll be happy to take it. So I'm going to leave it there as an exercise for the reader or the viewer to fix that. There's also an issue with a lot of these atomic operations that we're doing on the timestamps which are 64 bits. Atomic operations don't work so well on 32 bit machines on 64 bit numbers. So I had to come up with a whole new way of handling that. And that's described in the LWN supplement that you could find there. So I hope you enjoyed the show. By the way, this is a VMware sponsored talk and thank you. And let's see if there's any time for questions.