 All right, all right. Hi everybody. So this presentation is about timekeeping, timer, stick, and, you know, turning off the tick, which is to tick less kernels. I'm Joel Fernandez. I work at Google. And I've been, I've been working on kernels for like 13, 14 years now. And this is a very interesting area to me. So I hope, I hope you'll find this presentation useful. And so with that, I will go over the agenda. So one thing I would say right off the bat is, you know, there's a lot of content and there's, you know, a lot of details. So definitely don't put too much pressure on yourself to understand everything that I'm discussing, but see if you can pick some things up and things like that. So to roughly tell you how I've split this, first I will be talking about use, you know, timekeeping. So that is reading the time and how much time has passed and things like that. I'll be going over the user space API is briefly. And then I'll go into the kernel implementation to describe how it works. And then I'll do the same thing for timers. The focus of this talk is mostly the kernel. So I'm trying to focus on that more than user space. And the other reason for that is most of the stuff for user space is documented in the man pages and so forth. But I still want to mention user space stuff. And then in the end, I will go deep into like the internals of how timers and the scheduling clock interrupt are implemented. And if time permits, we'll also go over VDSO. All right. So first let's start talking about how do you get the current time. So there's an API called the clock get time API. This is also called the POSIX clock API. And the idea with these POSIX clocks is that they have a clock ID associated with it. So you call collect clock get time with the clock ID, and you pass this structure, the time spec structure, which has a seconds and a nanosecond field and the kernel fills it up. The clock IDs basically have, they're similar, but they have a slightly slight differences with how they track how much time has passed. And we'll go over what the different clock IDs mean. Another API to get the time is the get time of the API, which directly works on the real time clock. And you cannot pass it a clock ID. But that API is kind of mostly being replaced by the clock get time in the recent past. So let's go over the clock IDs and what they mean. So we have the clock real time clock ID, which basically is affected by changes in time by the user. So the user can set the time using clock set time. And this clock can be changed that way. It's also adjusted by NTP. So if the clock, if NTP detects that the clock is drifting and not counting fast enough or slow enough or whatever, then NTP will actually issue this ADJ time sys call, which will change the rate at which the clock takes because these clocks are not perfect. Because, you know, they're, they're driven by crystals and things like that. And so there might be some errors in their progression. And the next clock I'm going to talk about is clock monotonic. And clock monotonic is not affected by changes in time by the user. However, it is affected by NTP clock rate adjustments. And the other important thing is clock monotonic does not account for suspend time. So if you suspend the system, clock monotonic stops counting. And the next clock I have is the clock boot time clock. This is exactly like clock monotonic. That is, it cannot be set by user. It can be set, changed, you know, by the NTP, the rate at which it's ticking. But it does account for suspend time. So that's a different way in monotonic and boot time. And there's also a monotonic raw, which I will briefly describe, which cannot be adjusted by even NTP. It's just, you know, ticking. The other important thing is clock, all these clocks, they're slightly different. Like clock real time counts from the epoch, which is decades, you know, since the back, the epoch, whereas monotonic and boot time count the time since the system booted. So that's a different, that's a big difference between monotonic and real time. And then the rest of the columns here kind of compare all the clocks. And I described all of this, but I'm this slide will be useful for wants to go back and take a look at, you know, what the differences are in the clock. Okay. So how do you set the time you have, you know, I won't go over this too much detail, but because all this is in man pages, but you can set the time using the same kinds of APS clock set time. I already talked about ADJ time and there's a set time of day as well. So yeah. And then you can get a resolution of a clock. These clocks can have different resolutions because of the, you know, the capabilities of the hardware and so forth. So you can get their resolution using clock get rest. All right. So that's all of that is user space stuff. And I went over it pretty quickly just because I think it's, it's all documented and, you know, it's pretty straightforward. So let me now talk about how timekeeping is supported in the kernel. So, so we have this struct timekeeper data structure. And this data structure actually has this, you know, TKR mono TKR raw stuff, that's actually keeping track of the monotonic time. So the interesting part is all the other clocks are like clock real time clock boot time, they're all derived, they're just offsets from that clock. So the kernel has to only keep track of the monotonic time. And the other clocks can be, you know, offsets from that. So that way, every time the kernel updates the time, it doesn't need to, you know, you know, redo it for every clock ID, it can just offset from the monotonic ones. And something I do want to mention is there are several, several of these timekeeping APIs are in VDS. So because for performance reasons, so the kernel maps this page into, you know, the self object into user space and then user space can directly call functions in that object and do the, you know, get the time of day and stuff like that directly without having to make a sys call. So this is a huge performance improvement. I'll go with the VDSO mechanics if we have time, but there's something I want to mention is that, you know, most of the time when you read the time, you don't actually go to the kernel. The kernel, you know, provides all of the information to use a space through the VDSO. All right, so now let's go a little deeper and look at the hardware a little bit as well. So we have this abstraction in the kernel called the clock source and the clock source basically is an abstraction of a simple counter that can be read from. So, you know, finally the time has to come from this hardware counter that is discounting and it has a certain rate and so forth. We go over all that. But this, this counter can be directly read by the CPU every time it needs to know, okay, how much time has passed and what's the time right now and so forth. So finally it boils down to this. Okay, to give an example, in x86 that counter is called the TSC. It's a 64-bit counter and it is an MSR, so it is really fast. However, it still has overhead. It's slower than the cache. So you can't keep reading the TSC and hope that it's going to be fast. There is some overhead. This is high resolution because it uses a CPU clock to count and to read the TSC you use the RD TSC instruction and there's also an RD TSC instruction which gives you the CPU number as well, not just the time that was read. So coming back to clock source, this is what the clock source structure looks like in the kernel. You have a read callback that is basically the hardware driver that registers the clock source. It tells the kernel that you can call this function to read the time. So that's the read pointer and then when you register the clock source you also provide it a frequency. So that frequency is used to calculate the multipliers and shifts because remember this is just a counter. So it cannot directly tell you what time it is. You need to know how fast it is ticking and then scale that value in the counter to time. So using multiplication and shifts. So that's what mult and shift there are and you can even call the read handler of the clock source structure directly. Like I have an example there where you do two reads and you can kind of figure out okay what's the difference in time between those two reads. So one thing to note is that for this sort of thing to work like you have to make sure that the clock source value, that is the TSC values aren't synced across all the CPUs because what happens if you read the time on one CPU and then read it again on another. You know they have to, you're reading two different TSCs so they have to be in sync. So this is something that the kernel absolutely requires. There's efforts, the kernel goes to efforts to make sure that the clock sources aren't synced. In the case of x86 the TSC and there are protocols that it uses to make sure that everything is working properly and that it can trust the TSC. So what do we use the clock source for? We use the clock source. So the question, one question we have here is that is this x86? Yeah so RDP is x86 specific. And then the other question, there is one question so when we expose a clock source or some kind of counter, user space doesn't go through the file system to read the value. That's the question. So there's no involvement of, no there is no involvement of file system here. There's a sys call that reads the timer and if VDSO is available then the user space directly reads the hardware. But getting the time doesn't like nobody uses the file system for that. Yeah you have to use the sys call, the clock get time sys call or get time of day. So when you have a VDSO that part becomes quicker because that value is exposed to the library. Okay that sounds good. Yeah and the constant TSC I will go over that. I'm about to cover that. On the other hardware other than x86 you'll have a report of them but usually it'll be something like a memory mapped IO. It depends on whatever that handler does, right? The clock says read. So you can look at what clock the clock says read handler does but typically on other activities what I've seen is it reads some memory mapped IO register and gets the value of the counter from that. Okay so let's see. So that's all great but what do we use the clock source file? We use it for two reasons. One is timekeeping right? We have to keep moving the system clock forward. Another one is to read the time at any given instant. So let's go over both. So this is the algorithm for time like for moving the system clock forward. So we cannot just you know wake up one at one point and say that I want to know the time now. You have to keep accumulating the time because these timekeeping the clock source might be like you know it cannot keep track of time forever. It has to be periodically checked and read. So this is the algorithm for that. This runs every Jiffy. So every schedule and clock interrupt I'll go with that as well what the schedule and clock interrupt is but every time you every Jiffy is what we call it it goes and consults the clock source. So what happens is every Jiffy and interrupt goes off we read the clock source. We look at the value of the clock source that we read last and we find the delta between that. That's the delta of how long it has been since we updated the system time and then we scale that because remember that's just the counter value right? It's not a time. So we have to scale that with a multiplier and then we accumulate that in this in the timekeeper data structure in this field called x time nsec. So we keep accumulating the nanoseconds every time we update the time. If that nanosecond overflows it goes over one second then we normalize it. So what we do is we kind of write the nanosecond second and we reduce the nanosecond by the same amount because the nanosecond also might overflow. So we don't want it to grow too big. So we do this normalization process but essentially this flow chart I'm showing here happens every Jiffy. So to summarize the previous chart clock source is read and accumulated into the timekeeper structure. The structure has two components to keep track of time in seconds. There's a nanosecond and there's a second field. So both are combined to finally get the total number of seconds. And the number of cycles every time you read the clock source for timekeeping the value of the clock source is also noted because the next time you do the timekeeping update you need to know the delta. So it saves the value of the clock source that is read as well. And next we'll see this is this is only for updating the time right but what about if you want to read the time at any given instant. So how do we read the time at any given instant because these clocks are nanosecond stale. So for that what we do is we combine the the the timekeeper stuff that we just did every Jiffy to up to accrue the time. We combine that information with the latest value of a clock source the delta between the latest value and the the the one that was used to update the timekeeper and then we make further adjustments. So this is what this is this is what it all looks like. We have the monotonic time as I said that is the only thing that is updated when we update the system time. We combine that monotonic time with the other offsets to get the final value for that clock. So for real time we have an offset from the monotonic time for boot time we have an offset because the suspend time has to be added to the monotonic time. So the monotonic time which is updated every Jiffy is combined with these offsets to get the final value of the clock but that's not good enough because we need to know how much time has passed since we updated the time to read the clock source calculate the delta and then add that delta to the to what we have so far and then we get the latest clock value. So that's how when you do clock get time all of this happens and we know what is the time at exactly that instant when you call clock get time. Okay so for completeness I just want to show you the functions in the kernel that our executor when you update the wall clock time it's not important but if you're a kernel developer and you're looking at the kernel sources you know now you know like this is the call stack so essentially an interrupt goes off we update the Jiffy's and when we update the Jiffy's we also say you know we need to update the wall time now. So let's now let's jump into into an x86 clock source which is the TSC again I just want to go into more details mostly I want to go through the issues that the TSC has. So there are a couple of questions yeah let's go might be useful to answer yeah so the question let me start with the one question that has been can you say something about constant TSC migrating effect of P states are you planning to cover that? I will cover constant constant TSC yes not the effect of P states. Well there is an effect of P states you know when you change the frequency I'll go over that I'll go over that a little bit later on okay so that'll come that's coming up a little bit later and then when the user mode and the kernel mode write and read from VDSO that must be done atomic instruction that's a question there is that an atomic instruction and I did paste the VDSO man page there yeah so the VDSO is not a a single instruction it's basically code so you run like a piece of code to do to read the to read the time so I didn't follow fully atomic instruction but yeah hopefully that answers that so yeah go ahead and read this man page that gives you information Roman about VDSOs usually it short circuits some of the system call overhead it is lower overhead than system call but you still do some some some code it's not like one single instruction and then that is VDSO and that is handled the atomic part if you're concerned about that's handled okay let's see what other questions yeah yes that's a shared page between kernel and user mode yes yes there's a question on sorry that after after reading that I think after the man page reading okay let's see there's lots of questions coming up so yeah yeah let's tell me stop me if you want to continue right so there's a question on per cpu so the time keeping update is not per cpu because the time is global so there's this cpu that is designated as the it's called the do timer cpu and it actually runs this function here in this call stack this text scared do timer is run by that designated cpu that is supposed to update the wall time whenever it has its you know jiffy interrupt go off so it's not it's not global it's you know it's it's not per cpu it's definitely global okay so we'll move on and they will take questions after that all right so back to the tsc tsc is a 64 bit counter it's an msr so it's fast like I said it's gigahertz uses a cpu clock and it has some issues so one of the issues that tsc has depending on the hardware is that it might not be frequency invariant so remember like we depend on the tsc to take at a constant rate because we need to know what how much time has passed we cannot use that value if it's always changing at a different speed because we cannot convert it to time effectively uh all the cpu's this is mostly on older cpu models of intel um but recently uh there's this flag called the constant tsc flag somebody asked which actually tells you that the tsc will be constant and will not change uh frequency so will not change on frequency changes so what you can see is that if the cpu does not have a constant tsc feature then the kernel monitors for cpu freight changes so in the cpu freight subsystem you will see that it specifically checks if the tsc is uh not not constant or the cpu doesn't have constant tsc if it doesn't then it marks the tsc as unstable and we switch out of the tsc to another more reliable clock source i'll go over in a little bit okay and uh we also have another issue with the tsc which is the tsc stop is due to deep idle so the tsc can stop counting when you're in a in an idle state because uh you know we depend on the cpu clock right so it's possible in some hardware again not on all hardware in mostly older ones the tsc could stop if uh the cpu goes into deep idle and to know if your tsc has played with this or not you can check the non-stop tsc flag so if it has non-stop tsc then that uh going into idle stopping the tsc stuff doesn't happen and just like cpu frack uh the cpu idle uh subsystem uh marks you know the tsc as unstable if if uh it is possible that the system will enter into a deeper state that would turn off the tsc so the idle subsystem says the tsc is unstable don't trust it because i'm going to enter a deep idle anyway uh it doesn't say that i'm not going to enter deep idle because i want the tsc to be you know working well it says that the tsc is not reliable i'm going to do whatever i want i'm going to enter deep idle so that's what the the behavior is and similarly like cpu frack there's a reselection of tsc of clock source because the tsc is no longer reliable because it was marked as such now even after the tsc doesn't have you know even if it doesn't have any issues the kernel still doesn't fully trust it there's this flag called clock source must verify which is passed along to the kernel when the tsc is registered and what this flag does is it it triggers this mechanism called the clock source watchdog which is constantly checking the tsc to see if it is really behaving properly in fact what it does is it takes another clock source outside of the um of the you know the cpu and it compares that with uh with the with the tsc so that mechanism is called the clock source watchdog if it sees that the progression of time is different looks different between the two clock sources then it marks the tsc as unstable as well and again it switches to another clock source okay so that's all for time and time keeping i think we're right on time well no pun intended but i was going to keep the timer jokes to a minimum but uh can't help it um so sure i think we'll move on or do we have any burning questions we have some questions i think it might be better to um to handle them all right so see it seems like there is from the questions i'm seeing there is a little bit of a um a little bit a little bit clarity might be needed in terms of uh uh system call functions a little bit um there is there are system there are calls you you covered the cardinal side of it cardinal code needing to get calls but i think there are more questions from the vds so clarification system calls do you recommend uh them to people yeah educate themselves on the media system calls vds so is definitely very well documented there are full-blown presentations on that vds so doesn't uh well so as far as i understand vds so doesn't directly make any system calls i think there are some cases where it does though if if it figures out that it's i'm trying to read the time but i'm not able to do that it'll fall back to doing a system call from the vds so but that's not the common case the common case the whole point of vds so is to directly do what the kernel would have done that's the whole point of vds so yeah right so go ahead and uh look at that um people that have the questions on vds so yeah and then reach out to me yeah and this scope the scope is more uh Joel is covering a lot of the cardinal side of things so today also um is there one instance of timekeeper used in used in the cardinal there are different clock sources you can select is that that's correct is that correct uh no yeah well so there's only one clock source active at any given time that one yeah there's only one active one we might switch to a different clock source like for example if the clock source watchdog triggers then we switch the clock source to another one there is a rating field for each clock source which tells you how good the clock sources and the kernel chooses the best clock source from all of them hardware dependent some of it is yeah it's hardware dependent and you can look at sysfs and see all the all the available clock sources and things like that it's all uh available there is a question about um preventing clock anomalies and there's also another question about I think the same say anonymous I can't see the name but um about migration and then dealing with the monotonic time going backwards is that kind of in your in the scope Joel or is it um yeah it's a little more uh involved so we definitely don't want time to go backward if time goes backward that's called like warping and that actually makes that like it uh the kernel rejects the tsc if it figures out that it's doing that that's absolutely something that cannot happen uh because it will mess up the the kernels uh stuff um you don't want that happening because you yeah everything the file stands on yeah files journaling everything goes statewide yeah in fact for the first uh first cpu that comes up in a socket the kernel actually does a test where it checks whether time is going backward or not by comparing the time the tsc of that cpu with another one another socket on cpu and if it finds that uh you know time is going backward and stuff like that then it uh marks the tsc as unstable so it's very complicated the kernel does a lot of stuff to make sure the tsc is working it's synchronized but the end result is that it has to be synchronized for it to work and another thing is uh i want to mention is trace the tracing subsystem directly reads the tsc for performance reasons it doesn't go to the time keeping you know it doesn't do what clock at time does directly reads the tsc and um you know so if you look at an f trace um you definitely want the events to look in some kind of order like they might be mixed up differently depending on luck but you don't want events going back and forth like that's really going to be horrible for f trace because a trace won't make sense then so we definitely want the tsc to be moving forward and to be in sync yeah and then there's a question on kvm um i haven't gone through timers yet uh but yeah there's uh there's definitely uh uh support for timers in kvm i'm not going to cover them and just so there's it's a it's a topic of its own but yeah so shall we move on sure yes all right go ahead yeah so now let's go over timers uh let's see how timer events are handled so first let's like the same spirit let's first look at user space we'll just come over it again because the focus of the stock is mostly the kernel and all of this is docking all the user space are documented man pages but i just want to briefly talk about these things so you have posix timers we have timer fd uh then you can sleep that arms timer too we have sys calls that uh program the timer hardware for timeouts and then we have users in the kernel as well that require timer events um so posix timers as this uh timer create api that you call with a clock id and uh you know that clock id is basically you know how the time progresses based on the definition of that clock id that we just went over um and this is a per process interval timer so you get a unique timer id that and that timer is common for all the threads in the process it's per process and there are some additional clocks other than the ones we discussed that you can pass to clock id here for timer create there's a process cpu time id where you measure the cpu time consumed by all the threads in the uh in the process and then there's a thread cpu time id which is only using the time of the calling thread for the calling thread so all of this is very well documented in the man pages feel free to go over it and then there's a signal event sig event structure as well that you pass to timer create that tells the kernel how to how the caller should be notified when the timer expires like what signal to you know issue and so forth that's how the timer handling happens for timer create for posix timers is you use signals to uh expire to to run your timer call back and there's uh you know i don't want to go through this too much but this is how you would arm it there's a set time uh api to arm it and also disarm it you can use get time or get time to figure out how much time is left till the next expiration um you can program the timer to be one shot so it won't expire at an interval or you can say that i i wanted to expire at an interval so there's all kinds of things you can do with with posix timers so i went through all of this already um just to tell you how like it's implemented under the hood there is this infrastructure called high resolution timers which i will go over all of these posix timers they use high resolution timers inside of the kernel so in the kernel you'll see the structure called k clock which tells uh you know for each clock id you'll you'll have a k clock like this and it has a bunch of uh you know fields that are uh pointers to functions that you know that that will run when uh the posix timer needs to do something and you'll see that a lot of them have the words hr timer in them so i just want to make people wonder like how are these posix timers what are they using inside so it's actually using an hr timer in the kernel to do to do what it's uh what it wants to do and um that's also important to know because it's also historical like hr timers when it came out one of the use cases was to support posix timers because you have these high resolution clocks and you want to use them to um to run timers and before hr timer infrastructure that was not possible in the kernel so it's just a just to you know keep in mind that it's using the high resolution timer under the hood and they are using posix timer there are also two additional clock id's one is called uh the clock real time alarm and another one is the boot time alarm and they're similar to real time and boot time but uh they also can wake up the system during suspend and you'll see a separate k clock for e for those extra timer id's as well so using a posix timer you can even wake up the system from suspend at a certain point of time i believe android uses this feature and that obviously depends on some hardware that is still ticking when you know the system is suspended uh and typically that is called rtc and that can wake up the system from suspend so similar to uh posix timers we have another way of programming timers called timer fd and this is based on file descriptors unlike the timer create posix timers which are based on a clock on a on a number like a timer id these are using file descriptors and the advantage is you you don't need to use signals with timer fd you can just use select poll so programming is a lot easier um and this also uses hr timer under the hood and feel free to look at the docs to see how to use timer fd but uh you know it's the main difference between timer fd and posix timers is that timer fd deals with file descriptors and here's another table i came across i came up with that compares posix timers and timer fd you create them with timer fd create or timer create in timer fd's case it returns a file descriptor in the timer case it returns a timer fd in for deleting the timer you use the closed system call because it's again a file descriptor so use the sys call the the same file system sys calls with posix timers you say timer delete uh and to arm it you use the set time variant of each method uh now portability is something to think about like your timer fd's are linux specific whereas posix timers are as it's uh it's basically in the posix standard so it's more portable uh and uh yeah synchronization is easier this is using timer fd's is easier because you don't have to deal with concurrency issues with signals you can you know fit uh fit these timer fd's easily in your event loops and use poll select and so forth whereas with posix timers it's a little harder because you have to use signals and worry about concurrency and and things like that so okay so now we've jumped into the kernel implementation like so at the at the raw like at the lowest level right like these timers are per cpu and there is a something called a clock event so just like clock source there's a clock event structure and a clock even basically is a is an abstraction on a device which generates an interrupt at a program time in the future which is a timer right a timer you know does exactly that and that's what a clock event abstraction uh does and you have two types of these you have the per cpu ones which is like for every cpu you have a timer that you know does the uh you know does does whatever we talked about and uh there's also a global uh a global timer that is uh more independent of the cpu and uh you know can also do uh generate timer events so there's clock events for both both of these are represented by clock event um and in the case of the per cpu one for x86 that's the local that's called the local local a pick timer i'll go over that and the global one is uh is called the hpet the high precision event timer which is off the chip it's external to the cpu and it it's it's counting uh not only for the cpu but peripherals and so forth i'll go over all that a little bit so the clock event abstraction uh it has these uh two callbacks so you the the the the low level drivers they register the clock event uh with the uh kernel at boot time and and they tell the kernel that these are the these are the functions that you need to call to uh program the next event and the kernel just blindly calls this function and then the hardware does whatever it needs to do um you can program both relative or absolute time and uh yeah and then you also have to tell the kernel call this event handler so the beauty of it is that you might have thousands of timers or timer callbacks or whatever but in the end it all comes down to the calling of that clock event handler which starts the whole process of executing everything that needs to needs to run and clock events have these features as well like they have different capabilities because the hardware might support different things so there's this periodic feature which i believe all clock events support that is basically the timer uh the the clock event is capable of going off periodically once it is programmed and there's another one called one short where you can say that you know go off at this time and then stop uh you know and then stop after that that you know and we need that one short capability if we want to turn off like for example the scheduling clock interrupt we don't you know if we want that to be turned off we have to have that one short capability and i'll go with that as well uh in a little bit so as i was saying like you know finally everything comes down to that clock event handler running and that's what runs all of the other timers like it runs the etcher timer timers which run the posix timers queued from user space for instance the timekeeping updates that happen that i mentioned that happens from the periodic tech that's also run by that handler i mean that handler starts the chain reaction which leads to the timekeeping updates and then there's these low resolution timer real timers as well which run so everything comes down to this clock event uh structure so it's very important um so to uh go into some specific examples of clock events so you have the local epic timer this is a per cpu interrupt controller with a timer in it and it's tightly coupled with a cpu call so by default it takes uh the thing is uh it its clock is actually is actually driven by the external bus but you have this tsc deadline mode in it which gives a gigahertz precision so um in tsc deadline mode whenever the tsc crosses a certain value that is programmed by the programmer uh it generates an interrupt so this lets you actually use the local epic timer in high resolution like gigahertz precision title mode this diagram basically shows you at a high level how uh how uh it is organized so at a you know the the only takeaway here that i wanted to have is that the local epic is uh connected to this uh processor system bus and it's driven at a frequency that is lower by default it's driven at a frequency that is lower than uh the cpu's frequency and uh you know and you have to use the deadline mode if you want to get higher uh higher uh precision so now let's go over the edge pad which is this out this timer that is outside the cpu die and it is um lower resolution than the local local epic however it's independent of the cpu it's outside so the cpu turns off then the then the edge pad is still counting and so forth um and so the main advantage with uh edge pad is that because if the cpu enters like aggressive power management states it's still ticking like i said and um it's definitely something that should not be used by default because it's slow like so to uh program the edge pad and things like that it's slow and it's also not high precision and things like that so edge pad is not preferred as a clock event unless it is really needed so this is a diagram uh that i created to show how the whole thing is organized so you have like a socket and cpu course in it and in that you have cpu's with their own um uh local epic timers and you have the edge pad which is sitting outside the the socket in it's in a separate chip um in again this is intel specific uh and that chip is called the platform controller hub it used to be called the south bridge um but these days it's called the platform controller hub and it does a lot of other things uh one of the things it does is it has the the edge pad timer um and uh just to show you visually what happens so when um the clock when the cpu enters deep idle state um the apic is no longer able to do its duties so if there's a timer event that is queued now we will lose it right because the cpu is never going to be woken up so the edge pad is absolutely necessary here if we want to enter this deep idle states where the apic is no longer uh working right uh so you know so basically that's exactly what i said that the apic the edge pad uh kind of comes to the rescue here and um it's still uh it's still servicing those timers for all the cpu's that are no longer able to have their own local timers working and you can kind of so this is this concept is also called broadcast timer different hardware they have they have a different timer that is external to the cpu and is uh is uh keeping track of time and you can kind of see this in sysfs to see which what broadcast timer you you're you have so if you're on arm or something i would encourage you to do this after the talk and see what broadcast timer you have uh on on these arm devices because even those have the same problem where cpu power management makes uh makes us require this external timer to serve the purposes of the the the one that you know that is in turn to the cpu so one question is like obviously you have one edge pad and multiple cpu's can be in deep idle state right so how how can that possibly work like how can you have one edge pad that can help out multiple cpu's that are in deep idle um so this diagram uh came up with tells you exactly how that is done and uh i don't know how much time we have to go over it but basically the main takeaway here is that there's a cpu mask that keeps track of all the cpu's that are in that deep idle state and that need the services of the broadcast timer the edge pad and the broadcast timer repeatedly fires as many times as needed till that till no cpu needs it i mean if you think about it that's the only way to make it work like there's only one broadcast timer typically so um and you can have multiple cpu's that are entering idle um so to to quickly go over it when a cpu enters idle it mask marks itself in this global variable the broadcast mask and then the edge pad is armed if it is not already and then at some point the edge pad broadcast timer fires on some cpu and in that in the handler of that um of the of the edge pad it checks whether there are any cpu's in the mask other than this other than the cpu where the handler is running on if there are cpu's in the mask it sends ipis to all those cpu's inter processor and drops to wake them up and that's kind of where the broadcast name comes on you're kind of broadcasting these ipis to fire on all the cpu's that are in deep idle and uh finally it also checks if the local cpu that has woken up by the edge pad itself has any events in itself is in the mask and has events if it does then it it executes the clock event handler of that local cpu with you know the uh the edge pad handler is running right now it calls it directly and then it then finally we check if there are any uh there are still any um any anything in the mask which has which has events that have not yet expired so basically any future users of the broadcast mechanism are there any such future users if there is then it has a reprogram the broadcast timer the edge pad and the whole process repeats so hopefully that makes sense um that's kind of the the internal algorithm that that uses uh some more things to note about the edge pad it's actually not only a clock event it can also be used as a clock source as i mentioned the you know you have this clock source watchdog that compares the tsc with something that something is actually the edge pad typically and this can it can be used as a stable reference for tsc because it doesn't suffer from the frequency scaling and cpu idle and all those issues of the that the tsc does on some hardware but it is slower than the tsc as i mentioned and you have to be careful with how often you access it in fact there have been performance issues reported on to the mailing list where um you know the clock source watchdog was too aggressive and it switched from tsc to edge pad and now the system has performance issues because the time the clock source is really important for performance because you you update the time every jiffy and you're always reading it so you want to be sure that um it's it's a fast access it has to be stable as well but it also has to be fast okay so uh with that uh maybe we can take a few questions sure um yes there is one question i think might be uh good to answer how is it made how how do you make sure time stamp for each core is the same um so that um so there is a register called the tsc adjusts and the value of that register for every cpu you can program that register to a certain offset and that offset uh makes sure that the tsc's are all synced and my understanding is on some hardware that's not even needed like the cpu itself the cpu core the the cpu cores are you know in the hardware it's so i don't know the hardware magic that does that but the tsc's are already in sync and you don't even need to do that but if doing that is needed then the kernel does have mechanisms to to synchronize them as well using the tsc adjust register yeah so maybe if time network time is one of the cpucs is made master for the managing the network time and then if there is a need to sync it i guess yeah it could be done okay so hardware is generally keeps track of it um what you heard what you said is tsc adjust is the one i'm just repeating it for people yeah tsc underscore adjust yeah just go ahead and do get up on tsc i'll put that in the chat and go ahead and do that on the repo it'll give you all the information to do that okay so i just i put the command um yeah and look at this tsc sync sync dot c file as well it does the synchronization in that uh where you'll see in that pilot actually uh writing to the tsc adjust okay and the reason for tsc adjust is actually because if you want to synchronize multiple tsc's at once you cannot atomically do it because you have to do it one at a time and if you do it one at a time then you know it's going to get weird right because the value you write to one might be different it depends on how how much time the instructions start to do the right so tsc adjust is a way to like not bother the tsc directly and just modify the adjust and have them all at the same uh you know final tsc value so on older cpu's somebody is asking tsc adjust msr isn't present on older cpu's um i think in which case how is that handled i haven't come across anywhere it's not available uh like if you look at the kernel commits like it goes yeah i guess the question is how old are you talking about roman i'm yeah i'm not sure about that it's definitely maybe i definitely at some point it wasn't there i don't know what generation of cpu's are anything like that yeah okay so um ken so the other question is about timer tick and acting acting as an empty point into linux kernel schedule or will you be covering that yeah i will be covering that yeah okay so that'll be covered later um so i will clear that question for now in multi-numa systems how kernel how will kernel decide which cpu will be doing the master of the timekeeping i guess that's the question uh so typically it is the boot cpu that does it so um i'm not fully sure how that selection works there's an algorithm around how it selects which cpu does the timekeeping it depends on whether the cpu wants to turn off its periodic timekeeping like there are certain configurations where timekeeping is not needed at all because there's nobody reading the time like all the cpu's are idle so it doesn't nobody needs to update the time as well so there's situations like that so it's a little complicated uh so i don't have a straightforward answer on that sorry for the numa right yeah okay yeah so that sounds good um but it's a good question yeah right so there is another question um regarding hpad broadcast timer and cpu is in deep idle uh what you explained so far where is the interrupt handle does it mean that there must be at least one cpu probably where the kernel is running no interrupts don't need so this hpad basically fires uh when it fires you know it might be handled that by a cpu that is in deep idle right so the interrupt handling mechanism will wake up the cpu from its idle state so the cpu doesn't know cpu needs to be running to service not only the hpad but any interrupt typically so yeah hopefully that answers the running question yeah i think um okay so if you still have a question mayna please ask um yeah there is another question about um okay this is the might be different this might not be um it's in the scope of this topic this webinar here hardware that exposes some counter functionality i have a driver for it and in that we call devm counter add will user space use vdso for it to be accessed i know no that's yeah vdso is totally different um i did put a link to vdso in there um so please read upon that anish um that might answer your questions yeah drivers don't get involved i mean there is vdso doesn't handle driver part um correct me if i'm wrong bill yeah yeah yeah for system calls yeah it might do some things that a driver do like tsc read for example is directly done by vdso so in that sense if but it is important remember that vdso is not compulsory the c library actually checks if vdso is available if it's not available it directly does this call correct not all architectures implement all of this is to uh all of the system calls for vdso they pick and choose um some commonly implemented are the timer ones but even those um not all of them do implement those so looks like there is one more question we are almost at the end of the questions here so i think that's good news um there is one question the deadline mode is it automatically used if tsc is available yes it's automatically used and that is exact that should be the default for all the systems you can change i believe you can change it by both parameters but i have not come across even a single system where it's not already in deadline mode um so yeah so there is one question about clarification i think uh you said uh in code must be used to switch down i don't remember uh to the cardinal knob sysfs i don't know the context of this question maybe tsc probably hey gun pal if you you can i i i think i can't see the question in the chat uh but uh maybe we can go over it later yes we can do that okay come back with that question yeah yeah perfect so now let's jump into the timer wheel so before high-risk timers was available in the kernel there was just the timer wheel that was implementing timer events and the timer wheel basically runs uh whenever uh the scheduling clock interrupt uh runs okay so it's running at uh basically uh the timer wheel gets a chance to run every time you uh you take one of those uh scheduling clock interrupts so it's at the herd speed like one by one by herd so if your herd's value is thousand then every one millisecond you will get to check if there are any timers that need to run and so forth so let's go to the timer wheel design um now if you were to design your own timer subsystem how would you do it so you basically need to know uh you need to have some sorted list of timers because timers might expire at different points in time so you have to quickly know what is the earliest timer that is you know you have to go through it by by order you don't want the later ones you want the earlier ones first right when you're expiring it so another requirement is you want fast insertion deletion like so adding a timer removing a timer expiring it should be very fast so this is actually a data structures problem um now say you were to use a link list to keep all the timers together you could have an order of one insertion removal can you have order of one insertion removal and and expiry of a sorted link list and the answer is no you you cannot do that with a link list so if you were to insert it in all of one that is you put it at the end of the list then the it's this is not going to be sorted so if you want to know what is the earliest timer now you have to go through the whole list um that's not that's going to be slow right but say an insertion say you wanted to you know do all of one um removal or expiry you want to know the earliest timer in all of one then the list has to be sorted so insertion now is going to be uh expensive so either insertion or removal is going to be expensive uh if you just use a link list so that's where the that is the main key concept of timer view is that we we can optimize we can get both insertion and removal of all one by using arrays tradeoff space so so the other thing important thing about timer view is that most timers are are cancelled before they expire at least the ones that use the timer view uh they they are mostly timeouts and they you know the timeout is like an error condition most of the time uh so if if a timeout happens that's when the timer is expired but hopefully a timeout never happens mostly and the timer is removed so the timer wheel optimizes for that situation so as I mentioned we want to do the straight offs this is how the timer wheel looks like so I said as I said you use an array so how do we get order of one insertion time so we just find like so in the first level of the timer view every level is like basically an array so in the first level of the timer view we are at the jiffy granularity that is the lowest granularity that timer wheel timers can um can run and it's pretty simple you just find the correct one millisecond bucket and you just put the timer in the list right it's it's order of one expire is also order of one because you just have to go to the the the the bucket that is the the latest that is the next in line to expire and you expire all the timers in that bucket um but um we cannot do this forever right if you have one millisecond buckets forever then uh you're gonna you know run out of memory like it's going to be huge to to store all the timers in their respective one millisecond bucket so at some point we have to do something so that's where the second level of the wheel comes where now the buckets are not one millisecond they are 64 millisecond and so this is like basically the timer wheel is a partial sorting algorithm because you have you put stuff into the 64 millisecond bucket um but you're combining a lot of timers that some of them might expire like at 65 some of them might expire at 70 whatever you're putting them all in the same bucket so you've kind of sorted them because you've put them at the next level of the wheel so they're further away from the first level of the wheel but um but but within each bucket they're not sorted so to do to to solve that what we do is at some point those those buckets that are 64 millisecond big we take those all those timers and we put it into the first wheel we roll over the first wheel and um and basically when we do that a rollover it's also called cascading we then put the timers in their correct one millisecond bucket because they're about to expire in the next 64 milliseconds so we do that we do it's a partial sorting it's still insertion is over one but we have to do this cascading operation to to fully sort it so the second level is partially sorted the first level is fully sorted so it's a it's kind of a trade-off between but you have to do the cascading but you don't need a lot of memory um so uh you know you partially sort so you don't need a lot of memory and then you rely on the cascading to finally put those timers in their correct uh bucket in the one second bucket and everything I said as usual with the kernel community changes and so in 2016 the cascading functionality was removed and the reason for that is because uh the the an observation by Thomas Gleixner who who is actually one of the guys who wrote most of the timer implementation in the kernel a lot of people wrote wrote a lot of code but he's he's one of the top contributors he found that you know cascading is not worth it because most timers are removed before they expire so if you did all that cascading from like higher levels to lower levels it's and if the fire if finally the timer is removed right then the question why did you do all that cascading like you you never expired the timer so you know why did you move it from one wheel to another and so forth and that uh turned out to be also expensive because of dirty cache lines when you move a timer from one level to another you have to access the cache line that contains that uh that timer structure and so forth so the only difference uh between what I showed you is uh and what it is now is there's no cascading anymore the timers are put in their respective bucket depending on when they expire and then they're they're expired in place so you might uh wonder like okay what's the trade off trade off is accuracy so with timer real timers um the further the timer is away um the less accurate it is because you're not fully sorted the timer at the lowest level and if this problem gets worse so if you queue a timer real timer that's four hours away you actually have a 30 minute accuracy difference which is huge um and this is something I'm also currently looking into there are users of uh of timer real timers that um that actually expire after after a long period of time so with that I was going to go into the scheduling clock interrupt um we have we are about um it's a nine nine minutes after kind of do what kind of time are we looking at do we how long do we have left um we have about 20 minutes left of the presentation if we can do 15 minutes of questions after that okay that sounds good um that sounds good yes I think we're doing we're good I think we're good there is one question um in the in in the Q&A what what happens if the periodic timekeeper update is missed for many Ziffy's while the hardware counter keeps advancing example if CPU the stuck-and-stop machine for a long time yeah nothing happens I mean uh it just means that the delta is longer right it's it's not like it has to update there can be delays and things like that um but yeah obviously if somebody is like if if somebody reads the time in the meanwhile um well I think even that shouldn't be a problem yeah I don't see immediately an issue unless it's like really really long time uh then maybe there is a problem but as such it's not uh it's not like very critical to do the timekeeping update yeah okay that sounds great so there is one question about I think it's related to the cascading which is no it's not even there now um about um using the min heap or log n insertions so when you talked about inserting um timers I think it might still be applicable yeah I mean even though you're not doing cascading this is just one implementation uh it certainly is possible to do that but insertion and deletion have to be really fast and with the min heap when you remove the time when you remove the earliest timer then you have to do a heap rebalance um what do you call that up heap down heap you have to push nodes up and down and stuff like that and that's logarithmic but yeah I mean this is just one design and it's certainly possible that there's a better way to do this so yeah sounds good okay that's all we have questions wise okay so let's jump into scheduling clock interrupt I've been talking all about this so now it's time to go deeper into the primary function of the scheduling clock interrupt is basically preemptive multitasking so you want this the scheduling clock interrupt to multitask cpu time between multiple things that need to run multiple tasks um and so you can configure this hertz rate which is basically a trade-off between overhead because you you know every if you if your hertz is really high you have to interrupt a lot more and responsiveness if your hertz is high you also get more opportunities to look at if something else needs to run so there are tasks waiting on cpu waiting waiting for cpu having a higher hertz means that you will more often check if something else needs to run and make a decision so basically this is used to make that that decision the preemption decision um and this scheduling clock interrupt also updates this global variable called jiffy's which is the coarse grain way of looking at how much time has passed since boot since we have booted but it's coarse grain it's not like clock source just a global variable that's incremented every time um schedule scheduling clock interrupt goes off it's not updated by every cpu scheduling clock interrupt it's only updated by the cpu that does the timekeeping update okay so that's another slide nuance jiffy's is global um and i wanted to skip these slides but i think we're doing good on time so let me go over this um so this is a quick diagram on how the scheduling clock interrupt is stopped so the scheduling clock interrupt if it goes off in idle a lot then that waste power so there is a mechanism to turn it off um it's it's called no hurts so when a cpu enters idle and the cpu idle governor decides that i want to stop the scheduling clock interrupt now it um it first uh the no hurts code first sees if there is any timer event that is pending that is about to go off before the before the next scheduling clock interrupt the next periodic tick uh go goes off if if it is if there is um if there is such an event then there is no point in turning off the scheduling clock interrupt because the next event is not the scheduling clock interrupt it's that it's it's some timer that's about to go off so in that case we don't turn it off however if um uh if uh if that event is after the scheduling clock interrupt then um we just program that event instead of the scheduling clock interrupt into the clock event so essentially we turn it off um i put that as a separate block turn off the periodic tick but that's what happens so um this is kind of how you know the kernel decides if it can turn off it should turn off the tick or not and so that leaves me to my next uh uh topic which is deferable timers so certain timers don't want to be involved in that decision they don't want to stand in the way of the scheduling clock interrupt being turned off because uh they don't uh they don't need to uh need to need to run uh immediately they can run at a at a much later time so in in that part where it looks for the next timer event to uh run uh deferable timers are completely skipped so that's the point i was trying to make there um and these deferable timers they're they're in their own timer wheel it's a separate timer wheel that only has those deferable timers and uh when you when you set up your timer wheel timer uh you actually pass the deferable flag uh and uh you know to tell the kernel that don't worry about running this timer event right away if you have to turn off the scheduling clock interrupt feel free to do that um and uh so that's basically what i wanted to say here and uh you know uh when the next uh timer event that is not deferable goes off then we at that time we run the deferable ones as well so uh basically uh we run them later and we let the cpu uh go to sleep and not be disturbed by these timers if they're the only things that need to need to run and this is the soft irq code that uh joel are there any examples of deferable uh yeah i was gonna i was gonna go with that okay i can go with that first because it was the next slide uh so we had an example of the deferable timer is uh the idle timer of worker threads or worker pools in the in the kernel and there's this idle worker timeout function that needs to run to delete idle worker threads and this is not an urgent thing so it's programmed uh to go off but um but it doesn't need to run run run soon and there's only like two or three examples there's another one in memory management and i forget the third one but um but yeah uh this is an example i just wanted to mention that i just wanted to show you the code where the uh deferable timers run um this is the soft irq handler that executes the timer real timers and here you can see that we not only run the regular timers but also the deferable ones and we only do so if no herds is enabled because there's no deferable timers if you don't have no herds in your system it's not configured there's only one timer view so i kind of wanted to show that uh distinction here but these are just details it's not super important just the concept is important that there's two timer wheels one for the deferable timers one for the regular ones and the deferable timers are not serviced if the scheduling clock interrupt needs to be turned off um uh because that is prioritized over these uh over these timers so what uh ensures is there any mechanism that ensures that uh deferable timers won't starve well uh there is no there's nothing like that yeah there's no mechanism like that if it's a deferable timer like for example i i just did a get grab one of them logger time out logger uh timing in um yeah is a is a different deferable timer um so i guess in this particular case it doesn't matter if that doesn't run yeah yeah it's basically like a housekeeping operation where we can just not run those for for long enough um you know yeah so that's why you don't see too many because if you yeah that means it might never happen yeah yeah exactly okay and i'm planning to convert some rcu ones as well for yeah into deferable um i don't know how feasible it is but i'm planning to take a look at look at it and see if we can do that as well and save some power okay yeah okay uh so now let's move on to the higher resolution timers so so far we've only covered the timer view high resolution timers is a completely different uh beast so in higher resolution timer world all the timers are like nanosecond granularity and they're organized in an arbitrary so somebody mentioned heaps so that you know these are balanced trees that are ordered by their expiry and there is a separate arbitrary per cpu and per clock id so for every clock id that you're using say for your posex timer um that ends up in one of these rb trees depending on what clock id you used okay and uh hr high resolution timers have this concept called slack and slack is basically uh it's similar to deferable timers but it's kind of different the idea of slack is you can tell and you can tell hr timers that you want to expire something at a certain nanosecond but add to actually run but to actually give it some more slack and run and if it needs if it needs to delay running it it can do so after after a slack amount of time so that initial time is soft expiry and then the soft expiry plus that delta that slack is is now the hard expiry so you can these are also called range timers and you can give this a sort of time range to hr timers uh to let it uh you know reduce wakeups and save power so i was going to show i want to show this with an example so here's a normal hr timer without uh without slack um so at time 10 sec like this is a minute scale time 10 seconds the qa uh you know hard expiring hr timer uh to run at 30 and at 30 seconds we run that hr timer and uh you know everything is uh you know simple now let's see what happens when we queue a slack uh base timer so we at 10 seconds now we queue a timer that is soft expiring at 25 and hard expiring at 50 and at 25 the no interrupt happens the hr timer is just not run and then at 50 we have absolutely have to run it because it's hard expiring at 50 um yeah and uh now let's go and see we'll see what happens when we mix it so at 10 seconds we queue a t1 which is expiring at 30 and hard expiring at 50 soft expiring at 30 hard expiring at 50 and at 20 seconds we queue another one which which which is not slack it hard expires at 40 so at time 30 you see that again like the previous slide the the soft expiring timer doesn't actually run and then at 40 the uh the hard the second timer runs because it's hard expires 40 but uh we also check whether there were any slack timers that have soft expired if the if we see any of those we we run those as well so essentially we didn't delay the execution of t1 uh to 50 uh we we related to 40 in this case because t2 expired at 40 so uh it's kind of like we pick up the we pick up the work that we should have done it's a little like procrastination we procrastinated running t1 uh until t2 ran essentially saving keep letting the cp be idle and and saving power so to go a little more into the arbitrary for hr timer it's ordered only by hard expiring um this again for every clock id and every cpu you have a separate uh red black tree which is a balanced binary tree and uh it's ordered by hard expiring and the earliest hard expire is obviously the leftmost of the of the tree and this is the algorithm for uh expiring the uh the hr timer so uh we find the earliest one the leftmost of the tree we check its soft expiry time with the current time if it has soft expired then we remove it from the tree and we run the callback and then again we check the leftmost for the hard expired one so it's interesting we use both hard expiry and soft expiry in this algorithm um and that has some side effects but this is what the algorithm looks like uh just note that for normal hr timers hard expiry and soft expiry are the same so now let me show you an example of why why you know a soft expired timer may not always execute when a hard expired timer runs and because it's basically because of the ordering we order them by hard expiry but that also means that the soft expires are not ordered so when we when we find the next thing that is hard expired uh let me go through the example so here the third and the fifth timer are slack like the third timer was soft expiring at nine and the hard expiring at 20 and and the fifth one is also similar however because um so say when say the say the second timer is currently expiring and the time is now 10 seconds uh since the third third timer soft expiry is nine we expire that as well right because it has already soft expired but but the but timer five has also soft expired and is not considered because we break out of the loop in the previous page due to timer four so even though there is a soft expired timer hidden in the left in the right part of the arbitrary we we don't expire it because we we didn't even look at it basically we break out of this uh of this condition here we break out of the loop when we run this condition so that's just something to keep in mind that just because a hard uh HR timer interrupt went off and the CT woke up that doesn't mean that all the HR timers with slack will run it's not necessary so main takeaways with HR timers is it's high resolution versus the timer wheel once insertion and remolar obviously over higher overhead because it's using rv3 this is needed for real-time workloads because real-time workloads have nanosecond you know granular not nanosecond rather microsecond granularity needs and the one by Hertz jiffy's the timer wheel is not going to uh not going to cut it and as I mentioned the HR timers are different POSIX clocks uh have different arbitrary for different POSIX clocks and further all the arbitrary's are duplicated for each CPU so it's per CPU and timer slack is a feature in HR timers that can be used to save power by reducing the number of interruptions of the CPU and by coalescing essentially coalescing the the timers and then lastly soft expired timers may not always run even if they could that's by design and also HR timers are not deferrable unlike the timer wheel ones you have slack but you don't have anything like deferrable timers with HR timers so for time reasons I'll skip this slide or let's see there are a few questions yeah Joel if you have time yeah go ahead so the first one is does mod timer API use the timer wheel timers what API is that I'm sorry mod timer I think mod timer is modifying timer and uh so on I would think that they would still yeah it does use it yeah timer list should be used yeah it will be used okay yeah that's what I thought um then there is another one I think it's more of a comment that the timer slack is per task value yes not a timer value so setting the slack to 100 milliseconds for the whole process is often not usable but probably there are other use cases so it's more of a comment but yeah that is true that is one of the weaknesses of it I've brought it up in the past as well that you have to set it per process and not for per timer um and it's not as useful yeah it's not as useful yeah and it could be there could be better ways to you know do this coalescing I would certainly encourage everybody to look at the code and and see if you can improve it and use this information as a starting point for diving into it another thing we're looking into is actually turning off high res timers completely off for you know where I work from us and we actually find that we save more power by doing coalescing with timer wheel so if you turn off high rest timers the HR timer API still works but the HR timer API the HR timers the RB trees are all expired from the timer wheel or from the same interrupt that handles the timer wheel so it's it makes HR timers at Hertz it makes it at 1 by Hertz granularity if you turn off high rest timers in the kernel so depending on the use case I guess you know if you want to save power um say power saving is um higher priority for a use case I think that's would be yes okay so the question another question for new drivers is mark timer mechanism preferable recommended over HR timer mechanism should one mechanism be preferred or other if s yes when why and when yeah it depends on the use case if you want lower if you want like you know lower resolution then yeah you want to use HR timers but where possible you want to use the high resolution sorry the low resolution timers because those are more friendlier for no Hertz as well where the you know you don't need to keep interrupting the system and waking waking it up because high risk timers are high granularity right so they are less coalesced like they they can interrupt the system at very fine grain points of time so if your application doesn't you need it then you you probably shouldn't use it in fact in rcu we use the low resolution timers quite a lot in the rc subsystem because we don't usually need to do like some kind of housekeeping operation and we don't need to wake up at a precise point in time so you might suffer actually performance if you if a lot of yeah if you if you are using high risk timers a lot yeah yeah yeah it's definitely higher overhead yeah overhead so I there is a question about that any problems with the very number are you saying large number of HR timers in a system didn't this question was isn't very clear if you can type that again that will be great um that's not a very clear question at least maybe maybe Joel maybe you and you can understand it any problems with very number of HR timers in a system any ways to mitigate it uh no there's no way to mitigate it it's what you have large number okay yeah so there there is a default slack like even I believe you cannot put a slack of zero there's a minimum slack maybe the default is not zero there's a I know there's a minimum slack that you have to that all HR timers are programmed with so I need to look into that but that kind of essentially means that you can't have a timer go HR timer go off every nanosecond it will still be coalesced through that so that's one mitigation that I can think of I think the question is like what happens if you queue like a ton of HR timers that are going off very often and I haven't come across anybody reporting a problem like that um yeah okay that would be a good test yeah right yeah maybe case of test yeah sure we'll be very happy to take a case of test so we do have a bunch but yeah this could be another one um there is another uh question here is there any actual measurement control accounting of effective timer reading precision not resolution the regular not not non RT kernel does it boil down to GFIS yeah so I have done measurements uh off so let me understand what the so this is timer precision right um is there any actual measurement so like the kernel doesn't measure it but there are applications that measure the precision so if you run this cyclic test application for example it will measure the latency of timers so it will tell you how how a small of a granularity or timer has because it expects to go off at a certain time and then there's always always some kind of delay right so if you don't have high-risk timers um then cyclic test will show you that you know the latency is so and so off your timer so there's applications like that that measure it measure it but the kernel doesn't really care like it doesn't measure it on its own or anything like that I think that's all we have okay so just to look at the time I just want to see how many slides I have left see here yeah so I'm almost toward the end of it um I was going to go into the periodic tech uh but I think at this point people might be too overloaded with information but I can certainly I don't have shortage of uh of information um I was going to go over the internals of the scheduling clock interrupt I think the main then I was going to go over how know-how it works um yeah there's probably another 30 like maybe another 30 minutes of content if I wanted to go over it so do maybe we can open it up for questions and see if people have more questions because I know we we have answered some questions but um yeah maybe we can open it up for questions on the material that's covered right so far covered yes okay let's do that so just to say where I'm stopping so we have covered HR timers and that's where we're stopping right now so we yeah yeah we are gonna Joel um you plan to give us the slides too yes they will be uploaded to the yeah to the our webinar site yes so everything's going to be there and you can feel pretty reached me anytime for questions I'm very happy to discuss the stuff with anyone who wants to talk about it so uh do we have any questions new questions somebody wants you to finish up the remaining slides um but I don't know if that's a show of hands we do have only 10 more minutes I think though don't we yeah right um you won't be able to finish all of the content I think yeah there's only one person that said that that one person could be like how do you want to do 10 more push-ups yes let's do it yeah thank you sir flowing in somebody saying the cigarette sessions there's a lot of information Joel this is uh yeah a lot of good information everybody kept I mean a lot of questions um so if there are no questions maybe you if you have want to go over VDS so quickly maybe you could yeah that might be a scoped short section so yeah yeah sure so VDS so so yeah so some as I mentioned some time keeping such calls are available as a VDS so like clock that time this is a huge performance benefit and their benchmarks you can see in the public that show that um basically so VDSO is an elf object it is similar to a dynamic library you know so the only difference here is the kernel loads it when it loads the elf of the uh you know of the program that is about to run it also loads the loads an elf binary object into memory and maps it into the user space of that program that is about to be about to be run okay now this elf object also has a symbol table in it okay the symbol table is basically used for locating the address of the different functions in the VDS so elf object in the text section of that object uh you know so because we need to know where those functions are right so that the dynamics the dynamic symbol table has that information and the kernel has populated the addresses of the of the VDSO functions into that symbol table so what happens is uh so one thing to mention is a VDSO mapping also has this data page not only has text uh code it also has this data page and that data page has in the case of clock get time it has all of the information that clock that the VDSO version of clock get time needs to to calculate the time so uh the formula for that is basically it needs to know what the time is so far and it needs to add the value of the the delta between the last time keeping update and the current one uh with that with that time scaled by a slope so long story short all that information is in the data page and these functions that are in the text section of the VDSO they will refer to this uh data page section and it will and pull out all the information it needs from that and for for their for their needs and and run and I believe that data page can be constantly modified by the kernel because you know the time keeping update is being done by the kernel right uses VDS is only reading the time so the kernel will update the data page with like every time the timekeeper updates it will also update the data page so that whoever is using VDSO to read the current time will be able to do its job right and here's a diagram I pulled out of the ARM documentation and it's basically goes into a little more detail about how the VDSO is loaded so first you have a kernel loading the the kernel's elf loader loading the elf of the of the program that is about to run but while it does that it also maps the VDSO pages into the address space of that of the program and there's this thing called the auxiliary vector where it stores a pointer to that VDSO mapping okay and then we switch to user space and the dynamic linker looks at that auxiliary vector table gets the pointer of the VDSO elf mapping and it may it just makes a note of that VDSO mapping location and you know it hands off control to the C library now the dynamic linker also provides certain helpers that the C library can call to look up the VDSO elf images dynamic symbol table so the C library basically when it when the C library initializes it looks up the values of all those symbols the location of all the symbols and it sets some global variables to those values so for get time update is underscore underscore VDSO underscore get time update symbol it gets the address of that symbol stores it in a global function pointer and then from there on anybody who does get time of day from user space ends up calling that VDSO function so essentially the kernel has made it possible to execute a custom code in user space that it decided so the kernel decided this is what you know this is what you're going to run if you wanted to get time of day and then user space runs it so it's very kernel centric kernel controlled and user space doesn't need to do anything it's fully transparent to the user as well because they just use the C library without knowing that under the hood of VDSO the C library is actually using VDSO dynamically for performance the user doesn't need to care about VDSO at all they just have to use a C library and the C library will decide if it's going to use the syscall or it's going to call the VDSO does that make sense it'll make the switch by the way there is the dynamic flag you can turn it off and on if I can remember correctly on process based so you could turn it off and run it and with or without VDSO if you want to play with it it's it's a lot of fun playing with it that way too oh is it a compiler flag or it's actually a flag in the on the sys of a kernel or uh sys of oh yeah I think that's yeah VDSO flag that's kind of fun to play with if you are interest curious about that you could yeah benchmarks with that um have a process pick a process um that does maybe lots of time calls back to back in a loop and then start one with VDSO and without one without and see what kind of improvement you will get um yeah that's that's interesting I might do that too yeah yeah that's a fun fun process because you can flip that off and on yeah it's kind of fun fun experiment to do when you yeah and you want some fun to have some fun with figuring out which system calls do better yeah perfect direct versus VDSO so sure are there any uh I was planning to maybe add some tests to the kernel sources for timers like especially some of this timer real stuff I mentioned like some issues or the accuracy of the timer real um yeah I just wanted yeah we do have timers section and there is um you can just go ahead and add that um and if you want okay uh somebody to help you just we can talk offline okay yeah we can yeah we'll do that cool so um any other questions not that time yeah okay thanks for the nice comments uh yeah you know I hope like more people start contributing to the kernel like one of the things that I look at where you know when I look at all this stuff I see a lot of opportunity to uh you know make changes and um this explore the design of a lot of these things um and make improvements so I would certainly encourage everybody to to do that um yeah if it is something people are interested in yeah right we're kicking off uh another mentoring session in March so um I might reach out and ask you if you have any ideas like testing tests you want to be written or something for the mentees okay starting perfect yeah let me know perfect okay cool thank you Joel all right thank you all right we'll kick it back to Candice perfect thank you Joel and Shua for your time today and thank you everyone for joining us as a reminder this recording will be on the Linux Foundation's YouTube page later today and a copy of the presentation slides will be added to the Linux Foundation website we hope you join us for future mentorship sessions have a wonderful day