 So, thank you, everyone, for coming to my talk after lunch. I know you're probably all exhausted from eating, so you're probably going to be sleeping through this timekeeping presentation. I usually get asked when people, when I tell people I was doing this presentation, like, why are you doing this presentation? So I think what happened is that I had some commits in this area of the kernel, and so people found my name when there was a bug there, and then they tracked me down and said, hey, what's wrong with this code? And I did not work on any of this code. Really, so I got basically like a tag, you're it kind of situation. And so then they said, hey, why don't you make a presentation about this, and you can tell us how everything works. And so about a year went by where I said, I'm not going to do that. That sounds awful. And then I decided that maybe I should do that because it might be interesting. So this is a culmination of a few months, I guess, of trying to pick together pieces to try and talk about how timekeeping works and how, kind of give like a high level overview of how all this code actually kind of works together. So with that, let's get started. Does this work? So in the beginning, there was a counter. And the counter counted, and that's what the hardware is. It basically just keeps counting. And that counter, you can take it any rate, if you want to say, it takes, it can keep counting, it just keeps incrementing. It can also decrement, by the way, but in this example, it's incrementing. But this counter is great. Sure, I have something that counts, and it might be 32 bits, it might be 64 bits, who knows how many bits it actually is. It might be 16 bits, who knows. But this counter counts, and so how do we actually calculate time based on this counter? So to calculate time, it's this math. And this math is basically how many cycles you have divided by the frequency of the counter. And then I'll tell you how many seconds it's been, right? Because frequency then hurts. And then if you want to do the math and translate that into nanoseconds, just follow this equation, right? Now, what's wrong with that? Well, in the kernel, there's a few problems with this. Division is slow. And we're doing this all the time. We're reading the counter quite constantly, whenever we're measuring heat and season and stuff like this. So division is going to take a long time, and that's more time, that you're not going to be able to calculate. Plus, this requires floating point if you're doing division. And you might actually have underflow problems or overflow problems with all of this math. Of course, the math is pretty simple. So what do we do to actually do it better? What we have is this function in clock source, psych2ns, which basically translates to cycle count that you're given with some kind of multiplier and shift factor to do this much more efficiently than doing some division, right? And where do these molten shift values come from? Well, there's a simple function. I don't know if it's simple. There's a function that exists that you can call to calculate this for you based on your frequency. So it's called clock's count molt shift. It'll basically take some from to and minsec. So it takes a way you can convert so many seconds and translates that into this multiplication and shift factors. And if you see here, when you do a clock source, you're translating from a frequency to nanoseconds. And so you want to convert from that frequency to nanoseconds. I lost a screen, didn't I? And so this function is kind of like specially tuned to make it so that you don't overflow to large streets of urban numbers when you do this multiplication with the cycles. And also, it tries to kind of focus on a larger shift value so that we can have a finer grained way we can adjust this frequency of the clock later on when we do NTP adjustments. So this function, I think, is like as the time keeping me saying, you should use this function. If you're not using this function, you're probably doing something wrong, or you're trying to do something custom. So once we have this molt shift thing, we need to start abstracting the hardware. So when we're abstracting the hardware, we're just going to make a struct up called clock source. That's the thing that basically reads the hardware counter that you have. And that's going to store this molt shift. And as you see in this time-diff example here, it's basically how you calculate how much time has passed. Now, I mean, you might have just this one hardware counter in your system, right? But you need to actually implement all these politics clocks, which is not the same as just one counter. This is multiple different timelines. So you have clock boot time. You have clock monotonic. These things are different in the sense that they're not exactly the same time when you read one versus the other. So monotonic is this always increasing time. It doesn't count the time you're in suspend, but otherwise it's just an always increasing count. Whereas the boot time is going to include the monotonic time, plus it's going to include the time you spent in suspend. And then the real time is what we usually consider the wall clock. And it's something that can just be changed really nearly. I don't really know this right, but it can be changed at runtime based on what the user decided to change the wall clock to. And then tie here is the international atomic time, or TAI. I don't know how you say that, actually. And then one more thing to say about politics clocks is that there's this raw and coarse, some of which you see appended to them. That basically just means that raw means that there's no NTP adjustments. And that's pretty much like you should never be using this unless you're doing some kind of NTP code. But that monotonic actually is adjusted for NTP. And raw would just not have those adjustments in them. And then coarse is pretty similar to monotonic. But when you add the coarse onto it, that's basically not giving you the time it would take to read after we've accumulated so much time. So let's just, I can probably get to an example of that later. But that's just a different kind of clock that you can look in. Of course, you should probably just read your man pages on these kind of topics. So I tried to make a little comparison presentation to say what compares differently. So you see here, suspend happened, and monotonic stopped counting. And then real time clock is going to be adjusted when the set time of day happens. And it's going to go back because I just set it back. But boot time just continues forever. And it just kind of goes to the end. So you have these three different timelines. But again, we still only have one physical clock source and one piece of hardware that's counting up and continuing to count. So how do we accumulate this time, and how do we track it? So I came up with this really awful acronym called RAT, which is basically read accumulate track. So you read the clock source, you accumulate the time that you've read since the last time you tracked that timestamp you read, or the counter value that you read from the clock source. So if you look in the code, there's this code that does reading. So if you look in timekeeping get ns, that's going to do this tkr-read. And that's basically just reading the clock source. And then you see here, it's actually also doing the multi-shift, and it's doing its own version, not the clock source version. So there's multiple layers actually involved in reading. But what we want to do is accumulating a track. We actually accumulate into this tkr-mono.cycle-last. We kind of accumulate, but we're basically keeping track of the last time we read the cycle from the counter, like what was the last cycle counter value you read. And then we're accumulating into x time and n sec. So we're accumulating how much time it passed since the last time we read it into this accumulator. And that's keeping track of one timeline. So x time, I guess, is kind of like the timeline that we adjust things to, but it's not really true anymore. And now we have multiple timelines that we're kind of accumulating and tracking. But x time is kind of like one of the classic ones that we're being used. But once we have this, we can start supporting multiple timelines, because now we're accumulating into this accumulator. So as you see here, if you want to support these other clock real time, clock boot time, clock tie, what we basically do is we take the time we accumulated into the base, we add this offset. So the offset is just an addition that we add into it to adjust the time that we're maintaining. And then we add that offset to get us to some similar time. And then we add in the time from the last cycle we read to the cycle counter that we see now. And we add these all together and come up with a clock real time, or clock boot time, or clock tie. So all these things are just, conceptually, maybe simplified. It's basically just thinking that everything is an offset from this one true timeline that we're maintaining. So all politics clocks, you think of just basically as an offset. Now, what do we do if we actually have to handle a clock drift? So the hardware is never perfect, and it's not actually taking exactly the same frequency as you think it is. Software has to deal with real world environments where a hardware time counter is drift, things get out of sync. So what do we do? If you see here in this example, if the frequency is off by 8 hertz, we're losing just so much time. And after 100,000 cycles, we've lost two nanoseconds. So this is bad. We need to try and adjust for the drift we have. So the way to adjust for drift is basically change the frequency that you use to multiply the cycle counter width to adjust the frequency so you come up with the right number. So super simplified way to think about it is that the NTP code is basically just going in and adjusting this mold factor a tiny little bit every time so it can make the clock go faster or slower, in a sense, but without actually making the clock jump forward or backwards in time. So when you read it again, you'll see the time slightly slowed down or time slightly increased, but you don't really notice that. But we've adjusted for the fact that the hardware is not telling you the truth. So the next thing we have to do is make things really fast and efficient. So people want to read time from anywhere. Basically, we want to read time from out of my context. And people want to read time inside the deepest bowels of the kernel. So we can't really take any locks. We can't really do anything too crazy. So we need to have some poor man's RCU design. So we can track the time and the last time we read into this TKR read base. But we can make it NMI safe by having the sequence counter and by duplicating the base, basically copying. And then when we're doing an update, we can switch between one and the other. And in the NMI context, you'll see something that's correct. But the thing is that it's not exactly going to be correct if you read it between an NMI and after NMI. So in the code, there's actually an example that I just copied that into this and made a graph to describe this. So if you have a reader and they read right before we adjust the multiplier, and then an NMI happens and we read it, and we're still on the old slope because the frequency hasn't been adjusted yet. But then a reader comes after the NMI and they read the new adjusted slope. These people are going to be out of sync. And they're going to see different times. And they may actually be overlapping or may look more forward in the future than the other. And so people who read in the NMI context just need to know about this fact and kind of a key track that this may happen. And this is kind of what we accepted. But this is pretty much the best we can do for letting people read time from anywhere and have it be highly accurate. So with that, basically this is like a block diagram of kind of where we are with clock source and maintaining timelines. So clock monotonic, clock build time, all these things just go into this TK core structure, which has these monotonic basins, monotonic basins side, which then goes and we use this clock source, which will actually go read the harbor counter. And that'll finally tell you what time it is. So there's multiple layers involved just to get down to figure out what time it is. The next question that usually comes up is what if my system doesn't actually have a counter? So this is not really common, but you might not have one. And then if you don't, then we can still support it by basically saying there's a clock source and it's based around Jiffy's. Now, Jiffy's are not nanosecond resolution, so it's not going to be great, but it'll be OK. So let's talk about how Jiffy's work. Jiffy's are just one over the config hurts value, and that can be pretty large. I mean, it's configurable. Typically on an ARM system, I think it's like 100, or 1000 depends on if you're on server desktop or something. And it's updated during the TIC. And the TIC is not the TIC comic book. The TIC is this periodic event that's going to update Jiffy's into your process accounting, do global load accounting, do all these things. Basically, it's like the, as most people know, it's doing the time slicing for the system. So not everything is done every Jiffy, so it depends on what's going on. For example, the HR timers, as I listed here, don't actually happen every Jiffy. And IR key work might not actually happen. It depends on if it's implemented here or not. But this is pretty much like a periodic event that's happening quite often. So if we need to implement the TIC in the hardware, we're going to have something like this, where the hardware is some kind of timer value that maybe is incrementing, could be decrementing. It could be counting down to zero. But in this example, I just have some incrementing value and then a match value. So this timer basically can let the hardware let us program some value to match and then raise and interrupt for us. So this is some specific feature of some chip or some hardware, right? And we need to abstract that. So what we do is we make a clock event a device. This is how we encapsulate something that can raise and interrupt or create an event based on some time. Now, the thing that's important to see here is that the event handler is a place where basically we call a function to run the TIC, in a sense. And these other features that you see, like the unsigned-in features, there's these different types of features. There's periodic feature. There's a one-shot feature and k-time. So periodic is something that says, hey, I can raise an event every one over configured seconds. And I've configured my hardware to do that. Whereas the one-shot is like a, I can raise and interrupt so many cycles from now. And that cycle time from now is actually converted from nanoseconds in the system down to whatever the cycle count value would be for the clock event. And then k-time is this one, kind of like one-off feature that we use for broadcast HR timers, which I don't really talk about here, but we'll talk about the broadcast system a little bit later. But that's just another clock event that we have, the clock event feature that kind of signals that this thing only deals with k-time. And this doesn't really need any kind of translation between cycles and time or time and cycles. It just deals with nanoseconds, purely nanoseconds. So there's three different event handlers when you're thinking about what's happening. The event handler is going to be, well, there's essentially more than three, but there's three main event handlers. There's the tick handle periodic, which is this default one, which I'll talk about in a little bit. And then there's the tick-no-heads handler, which is this low res mode. And there's a HR timer interrupt, which is this high res mode. So I have a few examples to show the difference between these three. So in tick handle periodic, it's the default function. So everybody who registers with a clock event, if nothing else happens, this will be the event handler that you get. And it runs the tick every configured cycles. And usually, after one tick, it's pretty much gone. And HR timer is going to take over. Or low res mode is going to take over the no-heads handler. And this handler can also emulate periodic mode, even if you have a clock event that has one shot. So the event handler actually can reprogram the clock event based on how many configured cycles you've configured your system to have. But overall, the tick always happens on a regular Hertz basis. And you can't actually stop doing idle. You can't turn off the tick. You have to keep ticking every time, because it's a periodic interrupt, and that's all we got. Now, if you compare us into the no-heads handler, the no-heads handler is supposed to remove the tick during idle. So when we're in no-hards, we actually can reprogram the clock event to not fire and interrupt every tick, just tell us to raise the event after we're going to wake up again. So for the time the next task is going to run, or something has to run. And so the no-hards handler requires that you have the one-shot feature of the clock event. Otherwise, we can't do this. And we can't actually turn off the tick, because all we can do is periodic. But this is a low-res mode, because as we'll see with the HR timer interrupt mode, we can't actually raise interrupts in the middle between two ticks. We can't raise an event, because we can't program the clock event to fire in the middle of a tick. So when you have the HR timer interrupt setup as your event handler, you're running in the high-resolution mode of no-hards. It's really similar to the low-res, but again, it's just you can set up an event to fire anytime in the timeline of nanoseconds. And this just kind of shows the difference, right? So there's like a couple of HR timers I threw in the middle of idle that may actually fire, but no task runs. But then there's also an HR timer in the middle of tasks too, for example. So we need to make an abstraction for these tick devices, because we want to have a tick for every CPU in the system. And the clock event's not actually owned by the core framework. The clock events are registered by individual architectures and individual clock event drivers. So basically, we have this thing called the structure tick device to just the storage system for every CPU's current tick device. And it also has its own mode, which is tick device mode, which is very similar to the clock events mode. But you notice there's only periodic and one-shot. And periodic basically just means that the device has been configured to call the tick periodic handler each time a jiffy, each time a tick is supposed to happen. Whereas the one-shot mode means that we've configured it to call either the no-hertz or the HR timer handler. So either we're in low-res or high-res, no-hertz mode. And we would always use that one-shot mode for these tick devices if we're going tick-less or no-hertz. So how do we run the tick? So the tick is run, basically, the hardware event raises an IRQ or something that happens to raise an interrupt to the clock event code. And the clock event then calls the event handler that's been configured. And that's going to either go into the HR timer interrupts, if they're going to hurt for the high-res, low-res, it's going to go into periodic. And now the periodic interrupt, or the handler, the event handler is in the middle, is either going to call the runHRTimers code. So it's going to run into the HR timer wheel, or not the wheel, it's going to run HR timers. It's not a wheel. And then that's going to run this tick-sked timer, which is the actual tick. So it's implemented as an HR timer inside of this tick-sked structure. Or we're going to run the tick-no-hertz handler, and that's going to run the tick directly without running this HR timer. Or we're going to have the periodic handler, which is just going to call the tick-running code directly as well. So the only time we actually run the HR timer is when we're in high-resolution mode. Otherwise, we're just calling the tick code directly. So we still reuse the HR timer code, even the low-res mode. So we reuse some of it to reprogram the next tick. But we don't actually run the HR timer function. I guess it's less like the minor difference. Now, when we're running it per CPU, we actually duplicate this. So if you think about it, if you have four CPUs, you're doing this four times in every CPU. And we're calling it all these paths the same. But that's why we have this per CPU structure for the tick-sked structure. So what do we do when we want to stop the tick? We want to stop the tick. It's not always as simple as we just cancel this HR timer because it might actually be in using the HR timer. It may actually just be in low-res mode. And it might not be that we have to actually restart the timer so far in the future. And so it might have to do HR timer start. There's actually quite a few things we need to consider when we're stopping the tick. And the code is actually quite complicated. But we need to consider things like, you know, timers, intro timers. Maybe we need to call RCU. Maybe we need to figure out if we're the CPU that has to handle the Jiffy update, or maybe there's some other Jiffy or other CPU connected to Jiffys. It depends on what's going on. And we also need to make sure that we don't go idle for too long. And then when we wake up again and we read the clock source, that that time is so large that now when we try and accumulate again, that overflows our mult-shift calculation. And then that value says, you know, we were idle for two seconds. And we were idle for 20 minutes or something. So we need to try and avoid these problems. So when we do that, you basically get to this function called tick-not-heard-stop-skip-sick, which has all these different cases handled and covered in all of them. And it's actually fairly well-commented. So I encourage you to read the code if you're interested in seeing how bad it can get. So what do we do when your clock events actually don't work when you're idle and maybe your CPU has the clock event inside of it? And when you try out the CPU, the clock event can't raise the interrupt. So with that, we have another mechanism called tick-broadcast. So the tick-broadcast is basically made for these CPUs or these systems where your clock event dies when it's in some low-power mode. And there's this flag that we have called clock event fee C3 stop, which is an x86-ism, where in C3, idle, you lose your clock event. But that flag still exists, even though it's not actually true outside of x86, I think. So what we have to do is we have to emulate the per-CPU tick device across the entire system with a global tick device that always works. So the global tick device is this tick-broadcast. And that's the bottom hardware event. This thing always can raise and interrupt even when the system is in low power. If a CPU is in low power, somehow it can figure out how to wake up that CPU and then call the clock event for that CPU. Call the clock event's broadcast function. So as you see here, the IRQ goes into a clock event, and that calls these two different event handlers, which are specifically for the broadcast clock event. And that calls into this tick-do-procast function, which then calls this broadcast member of the clock event structure, which then finally calls the correct event handler that you really wanted to call on that particular CPU. So this gets fairly complicated. And there's actually quite a lot of code involved in doing this broadcast mechanism. But it's almost always required on ARM systems, actually, nowadays, because of the fact that the clock event practically dies all the time. Usually, it's embedded inside the CPU, and then it never works when you're in low power idle. So this feature is used quite extensively. The next thing that happens is maybe we need to implement some timers on top of this. So the event handler is running, and it runs, you can run atrial timers, and it takes a good timer, and then you call run tick. Eventually, when we run the tick, we're going to raise this soft IRQ, and then that soft IRQ is actually going to run the timers for the timer wheel. So after this long chain of events happens, we finally run the timers. And that's OK, because the timers are basically Jiffy's granularity. They don't really care about anything besides making sure that this event handler runs, and the soft IRQ is raised, and then Jiffy's get updated when the tick runs. And then the timer wheel can move forward. So timers are, I guess, I don't know if they're simpler or harder than HR timers, but for HR timers, it's fairly similar, right? We do the same thing. We call clock event, and we call the event handler. We call runHR timers. But the difference is that if you're in the low res mode, the tick runs first. That code then calls HR timers to run the HR timers. That then calls into the timekeeping framework to figure out what time it is, which then finally runs the tick schedule timer. And also potentially however many other HR timers expired when that happened. And then everything just continues on again. So the system just keeps running. And it's important because in low res mode, we can still keep using HR timers, but the HR timers are actually a high resolution. They're just Jiffy's granularity, but everything still works, and no one has to change their code to run HR timers versus timers, and everybody's happy. So in summary, we've covered some of the clock sources code. We've covered some timekeeping, how the clock events work. We've seen how the Jiffy's get updated. Some of the way how no-hurt's working. Of course we didn't cover no-hurt's full, but we covered no-hurt's idle. We covered the broadcast and timers. The thing that I wanted to take away is that timekeeping actually has lots of extraction layers. They split across about, I don't know, six or seven different files. And it's kind of hard to keep track of where you are in the code. So hopefully this presentation helps you figure out where you are. And no-hurt's actually makes things really complicated. When you want to start and stop the tick, it's not tons of code, but there's lots of different cases involved. And you have to keep lots of those different cases in mind. And when things go wrong in this code, the things go really wrong, and time gets really messed up, and the system just really starts failing. You may see things like RCU failures. You may see things, you know, you may see time just jump forward or jump backwards, it depends. But it can get pretty nasty. But overall, the code is very solid, and it's not usually any problem in here. It's usually a hardware problem, actually. And then the broadcast stuff actually makes things really complicated too, because we're doing lots of cross-calling and synchronizing between CPUs, and it's not really that pretty, but it's a good enough solution for now. So with that, I was just like, thanks, John, Stolz for reviewing my slides, and thanks for, I used some fancy JavaScript library to do these slides, so I thought it was nice to just call it remark.js for doing that too. But with that, I think I'm done. If anybody has any questions or anything that I want to talk about, you can feel free to ask. Everything, you're asking everything I had presented is upstream? Yeah, everything is upstream, not talking about here. Nothing is specific for my employer. It's just me talking about the code that people complain about. Okay, there's one over there. How did you detect when the clock drifts? I didn't really go into the details. I think that's all done in user space. So we can detect, in user space, they can tell us what the NTP adjustment, and they can figure out that that's all outside of the kernel, as far as I understand. It's not actually inside the kernel detecting this. We have to be told what the real time is supposed to be, and then we do some kind of, some complicated code to figure out that time is slightly different, and we need to adjust it to match. I'm no expert in that NTP adjustment stuff. You can ask other people, sorry. You guys want back there? You want to know if you're asking, you want to know how much time has passed? So it's just like, I want to know the cycle count, or basically the question is like, how much time has passed? But do you, right? I think the thing there is that with NTP right, even with the hardware drifting, you want to know the real time, or do you want to know what your system thinks the time difference is? But usually that's just a, even then you want to know the adjusted drift values, because then you want to know what the real time is. You don't want to know what your drifting hardware counter thinks the difference was. All right, you want to, basically, that's why K-time get would always call into monotonic, and not try to do this raw, or this course stuff. It would just try and let the adjustment happen so you avoid your hardware getting way out of sync. Monotonic includes the drift accounting. It's just the raw that removes that drift accounting. I think that we have some TSC clock reading source, or clock now. I don't know if John wants to talk about that. Yeah, is that another question, or is that good? Okay, thanks for coming.