 Today we're going to talk about reworking the Zephyr clock control subsystem. My name is Moritz Fischer. I work for Google. I lead a team that works on firmware for in-house ASICs and I've been around in open source for a while. I've worked on the Linux kernel, maintained some of that. I work on Uboot and a bunch of other projects. Recently I started working on Zephyr for the last two years and built a team around that at Google. I hate reinventing the wheel, which is why I'm here. Also we're hiring, so if you're looking for interesting jobs in making Zephyr run in the data center, grab me in the hallway later. Why do I care about Zephyr and why do I care about clocks? Making ASICs is super hard, making ASICs is super expensive and making ASICs takes forever. You're trying to reuse IP wherever possible because once something is silicon proven it's really difficult to get your hardware people to change something. You make something that works and suddenly everyone wants to integrate it so you have a good idea or a subsystem that you built and suddenly all the other groups are like, the next ASIC should also have this and you end up with this sort of Lego system of building blocks. ASIC X might have a UART and UART A and ASIC Y might have UART B but a different spy controller and ASIC X might have CPU A and ASIC Y might have two CPU Bs and you sort of compose your ASIC from building blocks and I really don't have time to rewrite all my firmware every time and I think that's not where the engineering should go. If you look at the typical integration differences you have when you put together an ASIC, you have things like bus width and we saw a lot of good examples of that in the previous talk about system devices, how you stitch things together. You have different bus types, you have different IP revisions, which are almost the same but not quite. You have different resets, so the reset sequencing of the IP might be different but the thing I'm going to talk about is clocking really today. If you look at the example I put together here, it's just to illustrate the concept of producer and consumer or provider and source and sync. You have different names you could put together for that but essentially every IP or hardware needs a clock to run so it's what makes your flip-flops toggle and usually you have something that produces a clock, that's some oscillator of some form that might be outside of your chip, inside of your chip you might have a PL, there might be a fractional PL which divides and multiplies or it might be a fixed PL which just has a fixed frequency but the thing to take away from this slide is you'll have something that produces a clock and you have something that consumes a clock and in this case the yard is the consumer, the driver might need to know the clock rate that comes out of the PLL to calculate a divider for example or there might be a case where a DUR actually needs to turn on the clock first or the consumer driver needs to turn on the clock first before it can start using a different block in your chip. Ultimately a clock API should be very simple it really doesn't need a lot of different operations. I'm going to skip over the handle dependency stuff because that gets quickly very complicated for this talk but I'm going to focus on the simple operations like on, off, set rate, get status, get rate and all the other things that a consumer driver cares about like turn on the clock, turn it off, are you running? What's your status? Do you have an arrow maybe? And what's the clock frequency you're running at? So if we look at how things are today in Zephyr we have roughly three types and it's a bit entangled so they're fixed clocks which are described in device tree and for those we don't use the clock API at all we just directly grab stuff from the DT using macro-batics and I'll have slides that go into a bit more detail after this one. There's dynamic blocks which I call type A it's something I made up but I need to distinguish them and for type A we pass opaque hard-created data from the consumer to the producer but we tightly couple the consumer and the producer which generally you don't want to do when you have an API. Type B is similar but in that case we assemble that opaque data in the consumer driver and in that case you tightly couple the producer device tree binding and the consumer driver which again it's like this tight coupling which makes reusing things across ASICs really difficult because in both cases the consumer driver for example the UART needs to know where its clock comes from and that's not something that scales. So for fixed clocks this is roughly how it works top right you have like a device tree I hope it works with the colors but essentially your producer which is clock zero in that case has a clock frequency and we just use a macro to just pull that clock frequency out of the device tree and put it into our config structure that we have for our IP. This is nice because it's all compile time so there's no overhead really and you can somewhat use different producers as long as they define a clock frequency poverty in the device tree you can see how this is limited if you have things that change at runtime which is often the case for us building ASICs we have other hardware blocks that might be a sequencer or there might be different fuse settings that have different clock rates and we need actually runtime information so that one doesn't really work for us. Then we have the type A clocks and type A clocks again here in the device tree I just have the device which is the consumer and it references via clocks another node which is type A and those use the API with hard coded SOC specific data so that might be something like and that's made up but those examples do exist. It's like you just call the clock control on with you cast it to void star which is what clock subsist T is behind the scenes and what you do in your typical loop where you initialize your config struct you get the clock controller reference and the second part for the producer the SOC specific data is again coupled to the producer so you can't just reuse that driver with a different producer and that's difficult to reuse things that way. For type B it's similar you have again your device it has a reference to the clock producer and it might have some other data encoded in device tree in this case the 10 and then you just pull it out still at compile time using the cell name which is encoded in the binding and if we look at the next slide there's how it all fits together try to get that together you have three pieces and again Marty had a nice picture of how the device tree gets generated as part of the config but minus less nice you have a device tree binding for the producer on the top left which says I have that many clock cells which basically says next to the reference to the clock producer there's that much extra data you need to pass the maximum I found in the tree is 3 which is a bit excessive I think but you know and then you have the cells which basically say give a name to each of those extra pieces of info you passed to your producer and for the consumer or in your device tree basically you have top is again the producer and bottom is your UART for example you can reference the clock producer with the p-handle and then you have extra data which is 10 which then together with the binding using the generation scripts gets translated into a header you don't have to go into all the details but essentially I don't know if you can actually see that here those are the cells that then show up if you have those properties and that depends on the number of clock cells defined yeah so that's how it currently works and it's not very reusable so about a year ago I took a look at how we could fix that I put up a pull request first thing I thought needed to change is the fixed clock actually would need a driver so that you know with the fixed clock it behaves like every other clock and you have the typical clock control APIs you can call so clock control on would just be ignored clock control off you just ignore or you could return an error I think ignore would be the correct way to deal with it you'd have a get status which just as always on because it's a fixed clock that's always on and always running you'd have a get rate that just returns the clock frequency from the device tree once you do this you already want something because now you can use fixed clocks with drivers that actually expect to have clocks that support clock control but you're not completely there yet so I think what also needs to happen is we need to rethink the API a bit so this is what I currently copied out of the tree that's sort of the API we have you have a struct device which is your clock producer and then you have the metadata or the extra data that's opaque so clock control subsist T again it's just a void star and you just pass something and the producer driver needs to know what that is and the consumer driver also needs to know what that is to actually create that data and you know when you think about what you want to do with the clock back to the slide with the operations you really want to just turn on off you don't want to have to care about what's on the other side of the API so if you look at other projects like you would or Linux they have a concept of a struct clock and you try to sort of adopt that to you know what we're trying to do here but the API would change from having like your producer and something to you just deal with the struct clock and you just say turn on this clock turn off this clock and the driver itself can just use that doesn't have to care about the details and this struct clock would encapsulate all the info for a given clock which would be the producer plus the extra info which I hid in the struct clock DT spec now you're saying that's cheating because I'm not telling you what's in there we're going to look at that next one way to do that is to define just cell 0s L1 cell 2 which again I looked at the 3s currently the max that's better but for the sake of the argument let's assume we put 3 values into the clock DT spec and again our struct clock now becomes a pair of the reference to the producer plus one of those DT specs then we add a new macro and again we can refine that to some extent and we just pull the p handles assemble a struct clock from it and then that looks great now you're like what's that that doesn't exist you're making up stuff right so I did modify to make that work I played around a bit with the scripts gen defines and essentially my idea was we could just have aliases for each of toes and you can see down here for each cell that exists in the binding we just generate a generic clock cell 0 or generic clock cell 1 alias and then you can use that helper we have up there and that way you can assemble a struct clock with all the data you need for a clock to be used you can do all of that at compile time and yeah there's room for improvement obviously you know otherwise we have for every clock now we basically waste 3 you in 32 or 4 depending how much or largest number of clock cells is so one idea would be when generating this we could sort of keep track of what's the maximum number of clock cells in our system and then just generate generate the structure accordingly so we don't have to deal with extra cells and don't store extra cells another alternative that we could probably have some clever k-config magic to just say the maximum number of clock cells for this build would be 4 or 3 and parametrize that way what does this buy as well if we do that then you know in your driver down here you just say dt clock in get clocks by index rather than anything specific and then I can call clock control on with just my clock which decouples a consumer driver from the producer driver I can now use that with a fixed clock if I have the other change that turns fixed clock into clock control driver and I can do that with other clock control providers yeah it's not super long I want this to be discussion right I know people have opinions and the original CL or pull request that I put up already had comments about being wasteful so I wanted to get feedback from people that think that having the couple of extra bytes per clock is too much in their system for example there's things if we decide to go that route how do we even go about that this touches a lot of different things so two things I can think about is either we introduce a new API migrate everything and then deprecate the old one or we create one big pull request with sub pull requests and get up and then collect all the changes and emerge that I found some SOCs that don't completely encode all the relations for their clocks in dt yet so we probably have to fix that if we want to go that route they have sort of encoded info in the hell that would need some cleaning up and then yeah the overhead is something to discuss I think the basic fixed clock my math was you have the struct device plus data plus the pointer that you currently have in a new fixed clock you have the same for bigger clocks if you have multiple cells you go from your struct pointer plus the data pointer plus the data to fixed size 12 bytes extra plus the struct pointer I don't know I'd be curious to hear about from people that have systems with many clocks how many they actually have and if that actually matters warrants having an API that's really hard to use yeah and I'd like to chat about if people had ideas how to do dependencies going forward once we make that change that opens up to think about power management what happens if we want to turn off clocks but you have two things sharing a clock producer should we have some ref counting in the struct clock to deal with that ideas will come so yeah I want this session to be half and half discussion half and half slides yeah so I'm summarizing the question the question was can macrobatics fix that overhead that I was talking about I hope that's a correct summary I did spend quite some time trying to come up with something clever but I don't have all the answers so if people have clever ideas of how to do it more dynamically I think my best idea of how to deal with that was not using macrobatics but actually using the Python that has the whole system visibility and then generate something from that rather than trying to macrobat it but you know I'm open to better ideas I didn't write any of the code that does this right now I have my proof of concept that I threw up on github that works but it's probably not the best implementation so that's why I'm here here for I believe you know what's the trade off there five or six clocks a few bytes is a reasonable way of doing it we wanted to for some time have some stuff in the Python to kind of be explored like in this case like what's the maximal number of clocks that's similarly like what's the highest number because that's another one that we've had just hard coding came to think that we should be just trying to figure out why should we do some things like that and improving the Python to sort of you know the number of X, Y, Z that agrees a lot okay I'm going to summarize that for the microphone so basically Jacob once suggested to to look at what pinmax is doing I think as a starting point for how to do that and the second part that he suggested was to Python as sort of the way to get the maximum number of clocks or clock cells in that case that we really care about maybe we can generalize that so it applies to other pieces of the clock generation code so the statement was around dvfs and making sure we take that into account with the design so do you have any specific things that you feel like that need addressing in the context of dvfs I mean generally with dvfs you care about performance usually right yes sorry can you repeat I mean yes the thing that I had talked about so the question was if I had considered the case where a consumer needs multiple producers and I think you'd model them as separate p-handle plus data in your clocks array and I think our current logic they actually tried that for one of our systems so that seemed to work with my hacked patch to dvfs.py I'm rather new to the whole clocking in Zephyr but is there a way that a device can say I need a specific clock frequency because a clock can be on or off but can also switch in frequency and multiple devices can have multiple requirements on that how does the system deal with that the question was does the API support setting clock rates and how do we deal with multiple consumers wanting different things I think so that falls sort of squarely in the dependency bucket I would say the simple case of saying hey I want you to be 5 MHz are we support that right now this would still support it just in a less sock specific way I guess the dependency one is an interesting one but I think nobody solved the problem of having like two devices asking for different clocks deciding who gets it I need some sort of policy right someone needs to decide who gets what they ask for I mean generally I think even when you do that on Linux there's no correct answer it's a policy question I want to mention something here I don't know it's not quite a question I'm not an expert on clocks but have you look on the Linux side because on Linux we have a clock controller framework which has a clock structure and a set of operation callbacks where you exactly do that power on power off the clock set frequency wouldn't that be useful for the field also that's sort of what I'm trying to get to whether we call it exactly the same name the functions are not as up for discussion but the goal would be to have a similar behavior as Linux has this is a work in progress to be like Linux not quite something similar the difference with Linux is I can do a runtime inspection of my device tree which I can't do here so that's why we sort of need to do the code generation at build time where we create the start clock we sort of have to assemble at build time have we considered doing the same for I feel like parsing the device tree at runtime instead of doing it at compile time so there's always this question and effort about a dynamic device tree and that there's no issue in general supporting that obviously that work would have to be done by somebody but we have to maintain the ability to kind of have a build time sort of database because you know footprint in these systems is significant and that's why you know if you look at just what the blob size is for a DTB and then you add in all the runtime and just code just to parse that blob that's a very large footprint just for you know this purpose or you know so again there are systems where that's overhead is acceptable so you know we obviously support like Cortex-A class systems where you'd have you know megabytes if not gigabytes of memory and so forth so those systems it's okay to have that overhead but obviously you know the vast majority of devices were talking you know kilobytes of flash and memory so that those overheads aren't acceptable so have you looked at because so one of the other things I think as we look at kind of improving or kind of updating the clock interface that is how we encode and then how kind of we maybe look at having the how the generator sort of deals with the SOC specific kind of clock you know descriptions and then sort of the static information so there's by a lot of cases where you know people aren't necessarily changing the clocks but you kind of need to figure out how that like what's the frequency this thing is at and so forth or so on and you know everyone encoded it a little differently and so kind of you know wanting you know I think one of the things I wanted to see whenever we updated the clock interface is how we could encode that and generate it in a standardized way or something so that way that information is kind of becomes uniform across these devices and then I think the dependencies is something as well that we need to kind of support with whatever we do going forward. I don't have an answer. It's fine I wasn't expecting it was more just a statement of something we need to think about as part of an updated design for the clock support. Yeah and it's interesting you mentioned pin marks right because like pin marks sort of does this where it squashes more info that would be typically multiple cells into one cell by sort of having SOC specific bit shifts and things so that would definitely be an optimization we could look at if we were to give up sort of standard bindings right if we want to use the same bindings Linux or device trees that look similar then we could probably do something like that that was one of the things when I looked at the ones that had like three clock cells and say really do you really need like your 332 byte things to encode like clock information is it that much but yeah I mean I think there's you know obviously it's been left to the SOCs on how that information is encoded and it probably could be more compact for various ones because a lot of times it's just you know some bit you know specifying a bit in a register because it changes for the specific device or whatever so I think we could encode that better and then I don't you know just kind of off the top of my head like whether we'd have like maybe SOC specific plugins to the generator or something that could then sort of understand how to take that and sort of give back out like frequency data or something that you know it would be consistent across everybody. Yeah. Other questions? Yes. Yeah. Does it work? Okay. Did you think of providing some kind of primitives for upper layer clock usage like synchronization between clocks or discovering relative drift or something like that to be more think of what a counter needs for example or then even then newly introduced RTC stuff so it's more like a service layer for those of course I'm aware that clocks is not about counting only I'm sure. Yeah I haven't looked at that my goal was really as a first step but that's you know sort of sanitize the API so like we can do a couple things one example for me was really the UR like it can't pound define like if I'm on this SOC to that otherwise to that otherwise to that like if you have that in your bottom loop where you create your devices it just becomes like hundreds of lines of like if depth and it just doesn't scale so I wanted to address that first and then we can build on that I think. But do you think from an architectural point of view does it make sense at all to put something like that into a clock layer or is this too high level? I have the like a counter and I haven't looked at this and I don't know I'm sure there's others yeah so that may be something if it makes sense between the two to do something there that I can see that but yeah I think that this layer is obviously just about kind of the you know turning clocks on and off getting a rate setting a rate type of functionality so there but there may be some glue between that and then the counter system you know such that if there's dealing with things like skew and so forth like you're mentioning yes do you have any examples of device trees and device drivers that uses this new API something that is less abstract so it's possible I threw a work in progress thing on my github I can I don't know if I can live I pulled it up but ok no this doesn't work alright yeah sorry yeah I am M Fisher sorry yeah on my github I have a branch where I try to convert over the ESP32 c2 it breaks everything else so it's not a good example but I mean it did let me test on hardware so so one question I have with the clock and how we're looking at doing right now is if we do like clock set rate for example in this API you know a lot of clock producers maybe they're clocking off they have multiple sources they could clock off for example in a more complicated case and if you set up at you know early in it or compile time you're setting up your clocks to use a specific mux for example and then you've got a divider so I guess in the initial implementation are we kind of going to say if you ask for a rate that's not possible with the current mux that we have selected for this specific producer then you just you know get a error saying basically this is not possible the way you do that right is like everything ultimately starts off with like a fixed clock somewhere and you have a clock gate have a clock max driver and all of those call basically the parent provider until you get to the info you're looking for right and like your clock max would have to implement logic to do the selection I don't think we have an API for that right now to select A and B but yeah sure I guess that's kind of the point I'm trying to marry is you know rework this I think one of my principal concerns is that dependencies are kind of core to a lot of clock systems um you know if I go and ask my uh clock source from my UART to do a clock that I know it can but the mux isn't set right I don't want it to come back and say no sorry too bad your system wasn't set up right at um you know at early in it um the other thing I want to highlight also is if we're going to do kind of the pin control case one thing that's different there is pin control has generally been designed at least my experience around the idea that there's one pin control on SSE where there's multiple clock sources I mean there's not always one pin control on SSE and obviously you can screw with the macro that you provide at compile time to make it be like oh if you set this bit to this then now you're dealing with pin controller A versus B but I think it's something that we want to be aware of because like pin control is at UN32 I kind of think there's probably some clocks that are going to beat out a UN32 because they can get more complicated yeah I think the correct way to do that would be to like reflect the clock maxing as a proper device where the clock max shows up as a separate device um that obviously makes reworking things difficult right because you have those SSEs where it's sort of built into the sock layer rather than the clock in there yeah it's true and besides you have a simple clock tree where you would end up with ten more devices just for the clocks did you consider an alternative to have device results for clock maxes and each time you have some mixing to do because each time we are adding a new device driver at one point you can add some overhead that would be much larger than the overhead that we would have with just additional cells I mean I think one way you could work around that is by having a bigger driver that encapsulates your clock functionality in like a bigger producer driver and then you could have device specific overrides like how you can do API overrides where you say set max channel zero or set max channel one and until we have a max API if we go there then you could do that yes as an intermediate stepping stone and could still use the clock API for all the downstream pieces of it I just want to ask if you envision some possibility to set accuracy if you require if the consumer requires the clocks because sometimes it's quite of not possible to generate exactly the same clock so it's needed to tolerate some amount of say error of the frequency and if there are multiple consumers maybe it will be impossible to reach for all of them what they need so was the question around like getting almost the rate you want but not quite or yes yes because otherwise you end up with if it's just say one hertz difference in megahertz scale it will end up in error if it doesn't match right if there is exact number which we often end up with need to tolerate some error in case of some PLL and complex set of dividers for example well again the current proposal doesn't look at this but I guess you could encode that in your driver if the driver is tolerant I think even with the old API if you set the rate it assumes it said it I don't know if drivers actually go back and check usually if they actually got the rate they asked for or not but I suppose most drivers don't so at that point stop ok stop