 Good evening and welcome to our talk about ideas for final grade control over your heat budget. My name is Amit Kucharia and I'm presenting this with this talk with my friend and colleague, Daniel Laskano. Say hi to the camera, Daniel. So in this talk, we are going to talk about some of the common use cases in thermal management in the mobile industry and take you through some of the caveats and the limitations we have in the current frameworks and proposal solutions. So let's dive in. So here's the outline of the talk. We are going to go through some terminology and then discuss the goals for the thermal framework. We're going to talk about a little bit about the Linux thermal framework, how it is used in the real world and what the current limitations are. And then finally, we're going to get to what our proposal is to improve things going forward. All right. So here is a picture of what happens currently in the Linux thermal framework. So you have in the graph two colors. The purple one is for the high key board, which is a octa core Cortex A53. This is an older board and the green line is a newer board called as the high key 960, which has four A73s and four A53s. Now, both of them have a workload that starts roughly at about the five second mark. And what you see immediately is how quickly the high key 960 jumps up to around the 80 degree mark. So this workload is basically trying to generate as much work as possible and get the platform to heat up. So what you see here is that within about 100 milliseconds, the high key 960 is reaching a temperature of 80 degrees. While on the older platform, it takes roughly about 30 seconds to reach that same temperature. And then what you see after that is the platform or the thermal framework, throttling the CPUs, throttling the frequencies to keep the temperature in check. So this is the stage for the stock and how we want to prevent these kinds of situations and hotspots. So let's get on with the terminology that we're going to need for this stock. So the first one is junction temperature limit, also known as DJ. So that is DJ is essentially a threshold, the surface of your chip or the silicon or the CPU. So beyond which your silicon can get damaged. This depends on the manufacturing process of the silicon. So we've seen boards or SOCs where this temperature can be around 1890 degrees. And I've seen Qualcomm silicon, for example, that I'm familiar with where this number is about 120 degrees. Now, the platform in question may or may not have a firmware that may trigger an emergency shutdown in the case that the temperature reaches this point. But this temperature, the junction temperature is a threshold that you cannot allow your platform to cross. Otherwise, silicon will get damaged. The next term is the skin temperature, also known as T-skin. So this is a temperature that is perceived by the user to feel too hot. So when you actually hold a mobile device or a tablet or even a network or a laptop, this is the temperature at which point you start to feel that the thing is too hot in my hands. You've probably already experienced this when you're trying to run a video or play a game for any length of time and you see the back cover of your phone or a tablet heating up. So this is the temperature that you cannot cross beyond which not only with the user feeling that the device is too hot but if you allow it to go too high, there is a risk of burns. See here, burns cause to the user. And of course, this will lead to liability issues. What you see in the picture here is a photo of two phones. The one to the left shows no hotspots. So everything is yellow, green, blue. The one to the right has this big red hotspot where the SoC and the memory and the other parts of the chip are. So the one to the left is a better designed phone in that it's able to dissipate all this heat that's generated very nicely and uniformly so that nothing feels too hot to touch. Whereas the one on the right, there's going to be this nice hot spot around the top of the phone. There are various standards which define how long and on what materials you can hold a phone without burning your skin. Obviously, the larger the form factor, the easier it is to dissipate all the heat. And so the form factor plays a very vital role in controlling skin temperature. The final term we need for this presentation is thermal design power also called TDP. So this is the budget in Watts. It's the power budget in Watts to sustain performance on a chip. Typically in a high performance use case, you want to sustain the performance at the TDP limits for as long as possible without overheating. And that is the core of what this talk is about the TDP, how to control and balance the TDP. So things like higher the ambient temperature, the lower is your dissipation budget. So if you are using the same phone when the ambient temperature is 25 degrees versus using the phone where the ambient temperature is 40 degrees, you've lost about 15 degrees of headroom and how quickly your SOC will reach the T-skin and the T-junction hotspots. So everything in the mobile world and electronics is anything that a user will touch is designed around TDP. Things are a little more flexible in the server PC world and I will come to that in a second. So what are the goals for TDP management? I'm going to divide the type of devices to do one which I call fixed TDP devices and the others which are called flexible TDP devices. So when I say fixed TDP devices, these are typically passively cool devices. They have a small form factor, there's no additional fans or anything and all they have to do is operate under the TDP budget. These are typically battery powered devices but not necessarily, but typically they are. And so you want to operate under the TDP budget without hitting the T-skin temperatures and going over and above that. So you want to sustain the performance at the TDP budget. And depending on what use cases you're running whether it's GPU intensive or CPU intensive, you want to dynamically balance your dissipation budget among the different components on the SOC. Then there is the flexible TDP devices. These are actively cool devices. So they share all of the goals of the fixed TDP devices above, but they have some flexibility in that they can have internal fans, liquid cooling or even external things like in the case of a server, they can be in a conditioned room which helps them with their TDP budgets because it cools the overall environment down. So the goal for TDP management is to operate as long as possible at a sustained performance while not regressing your use case. The use case in question can be anything whether it's playing a video or what not. So obviously most people are familiar with one of the major sources of heat. So it's typically the CPU, GPU, any connectivity or modern Wi-Fi. DSP is accelerators and lately neural processing units and the camera and memory. These are also sources of heat. So with these sources of heat in your typical device, what are the challenging use cases? So I've listed a few use cases here, mostly around the VR augmented reality, there's some gaming, the typical phone use case where I'm taking a phone call, I'm streaming a movie or simply web browsing. Each of these has multiple IP blocks on the SOC that are generating heat while they are running. And the trick is to balance all of them so you can keep within the TDP limits and control the junction temperature as well as the skin temperature on the device. So let's take the virtual reality case in some detail. So if you've ever used a headset, typically the headsets of the current generation are 1080p resolution and maybe 60 frames a second. Since these are battery powered, they have a limited dissipation budget because they can only supply a limited amount of power and the processors can only run at a certain frequency, not more than that. So the interesting thing with virtual reality is that as soon as you stop dropping frames, people start complaining about dizziness. So this entire use case is built around having extremely high resolutions and extremely high frame rates to mimic the eye in the real world. So under that limited thermal budget, if you start throttling your CPU or your GPU or your video decoder, you will start dropping frames and that may cause dizziness. So you basically regress your use case in that case. So the trick here is how do you allow the GPU and the memory, which is most often the big consumer in this particular use case, how do you allow it a bigger share of the thermal dissipation budget compared to the CPU for example. There are other use cases here like streaming movies. So in this case, your connectivity IP. So that could be your 4G connection, 5G connection or Wi-Fi. That's constantly streaming in network data. You have your video decoder that's taking this data and putting it to a display through the GPU and the CPU might be doing background tasks or participating in frame rendering in some form. Now all of these are competing for running and all of these are causing heat. So you have to balance the thermal budgets of all of these for a nice experience where you don't exceed the T-skin or T-junction. So having discussed the challenging use cases, let's take a look at what the current Linux thermal framework looks like. So here's a picture of the thermal framework and unfortunately I've lost a couple of text blocks in slight conversion, my apologies. But to the bottom right, the box, the red box is essentially your sensor. And the one above that is the thermal zone and then the light yellow box above that is essentially your governors. So what we have here is there's a bunch of sensors on your SOC. These are physical sensors that give you same back temperature data across the SOC. These are connected to the Linux thermal framework through thermal zones. That's the bigger red box. And then there are governors which are basically algorithms, heuristics you use to try to cool down the device when you detect a hotspot at a given sensor. There are various algorithms like stepwise, IPA, user space, fair, bang bang for different use cases that you can choose from. The interesting bit is the cooling devices. So what the thermal framework allows you to do is configure for each thermal zone some cooling devices which effectively throttle the performance of certain IP blocks or activate external fans and whatnot. So you have again two types. As I said earlier, active cooling and passive cooling. So in the active cooling, you might have a GPIO or PWM or you just turn on a fan directly to cool down your device. Whereas in the passive case, let's take the example of a CPU and a GPU. So you might want to slow them down. The CPU you have two options today in the kernel using the CPU fret framework and the CPU ideal framework. And with the GPU, there is a depth fret framework that you can use to do something similar. The kernel also has the concept of the energy model which basically is a table containing the performance state of a device or an IP block on the SoC and its power cost. So the CPU fret and the CPU ideal subsystems can look up the energy model table to make the right decisions about what would be the right performance state for a CPU. And in the GPU case, there are now patches floating where the depth fret subsystems also starting to use the energy model. So with this look at the thermal framework, let's have a look at something in practice, some measurements that we've done. So you have two diagrams here. The one on the top essentially shows a use case where we were setting the operating point from user space for on a given platform. So let's assume that the platform has eight operating points. So we started with the bottom most operating points, let's say 500 MHz, and then kept increasing the operating point or the frequency of the chip. And you can actually follow that with the red dotted lines that you see above. So you had an operating point, let's say 500 MHz, then you bumped it up to, let's say, 700 MHz and you see a slight jump in temperature from roughly 30 degrees to slightly above 30 degrees. And so and so forth, you go around the 600,000 mark on the time scale. You see that the temperature has gone up to roughly 65 degrees. And then going further, the temperature is reaching nearly 80 degrees. Another thing to notice here is that the length of the width of the red lines is decreasing as you go forward. That's because the workload being given was dry strong, which was basically told to do a given amount of work as quickly as possible. So obviously, as you increase the frequency of the processor, it takes less and less time to do the same amount of work, except when you reach around the 600,000 mark on the time scale. And at that point, if you see the last big peak, you see that the red line is actually quite wide. The reason for that is that we had reached the 80 degree limit, which was the junction temperature for that platform in question. And we started cooling down by scaling down the frequency. So what is happening here is you're constantly throttling this frequency and then going back up and back down and back up and back down. And so you're basically, you might be going, if you look at the diagram below, you seek a corresponding change in cooling state between 2 and 3. So you're basically switching between cooling state to cooling state 3. And that can mean anything on a platform. In this case, I think it was like three different operating points. So you might be going from 1.2 GHz to 700 MHz and back and forth and back and forth. So that's the correlation between the cooling states and the temperature. And the real-world usage of all this, how the thermal framework is used. So currently it is used to throttle any available IP blocks for both junction temperature and skin temperature mitigation. Then you have user space thermal demons that rely on these thermal framework interfaces. They read their temperatures for each sensor from CISFS and then they have a way to set the cooling device state. So they can say that I want to go to cooling device state, 1, 2, 3, etc. These are opaque numbers. They don't mean anything other than they map to a certain performance point. Then more sophisticated demons might actually use the knowledge of the use cases running on the system, whether it's a gaming use case or it's a web browsing use case. So they have ways of detecting that. And they might use that to throttle specific devices. So you might say things like, I'm running a game. It's mostly GPU bound. I want to throttle the CPU. And so the thermal demon can do that from user space. It might send a hint to the GPU to drop resolution if we are maybe heating up too much or dropping or running our battery. Then there are non-linux devices, devices that are not known to Linux. They don't run Linux, but they are present as part of the SOC. So something like a modem. They might send hints to the modem through a mailbox API to tell it to try to reduce performance if possible. So these are some of the ways in which the thermal framework is currently used. And so let's go to the limitations of what we have here. So the number one limitation today is when you throttle a CPU, if this CPU is sharing a clock line with other CPUs in a cluster, you might end up throttling the entire cluster and you might lose performance as a result unnecessarily. So one recent addition to the thermal framework is the concept of idle injection, whereby you can inject idle on just that CPU without throttling the entire cluster. So that might be a good way to deal with CPU specific hotspots. The other problem limitation with the framework today is that the user space demons set opaque cooling device states. So they just say 1, 2, 3. What this 1, 2, 3 corresponds to nobody knows except the person who has actually characterized that particular device. So there's an engineer who's basically gone and tested this device at various use cases and said that in the gaming use case, I want to throttle the CPU to a cooling device state 2. That's it. And on a new device, it could be a completely new number. So it's really opaque. This doesn't help in the actual configuration and in configuring the system for different types of devices running the same SOC. So the same SOC could be running on a phone versus a tablet and it will have completely different configurations. The other limitation currently is that the user space demons and the internal governance are competing for decisions. So in a lot of cases, the user space might simply override the kernel governance and confuse them. So if you have something like a PID controller in the kernel, it might get completely confused because the user space just randomly programmed the cooling device state for a certain device and now the internal governance needs a while to catch up with that. Another limitation is that the internal governance don't actually talk to each other or share information. They're using completely different interfaces. The key observation here after all this is that the skin temperature management is essentially a balancing of the TDP. It's a balancing act wherein you have to share your TDP between different IP blocks. And this TDP is in watts, not in direct temperature units. It's in watts. You want to be able to say something like, this GPU has a budget of two watts, not that go hit 70 degrees and then I'm going to ramp you back up. So this is the key observation that we are taking into our proposal. What we are proposing is very simple. We are saying that we should separate thermal management from power capping. What is power capping? Power capping is essentially a use case based throttling to hit certain TDP numbers. So any kind of TDP balancing is basically capping power of an IP block. And so what we are proposing is that junction temperature management should happen in the kernel and should be taken care of by the thermal framework whereas T-skin management, so the skin temperature management should needs hints from the user space and so should happen in the power cap framework. And with this, I'm going to hand over to my colleague, Daniel, to take us through the proposal of how the power cap framework can be used. So, yeah, my name is Daniel Escano. I'm working in Linaro and I'm co-maintaining with Ryuzang, the thermal framework in the Linux kernel. So, yeah, so we have a proposal and this proposal is to use the power capping framework existing today. So let me introduce the power capping very quickly. So the power capping is a CCFS interface. It's actually an empty shell giving an unified API for the user space through the CCFS control files. So we have a directory and this directory is a node for a device and this directory contains a set of files. You decide which files you want to create but they are basically divided in two categories. The first one is the power limit and the second one is the power consumption, back to a power consumption. And we also have the information about what is the maximum power consumption for this device. The power capping framework is an empty shell. So it's empty and it's up to the different controller to write the backend logic behind for power capping given the different platform we have. So today we have the Intel Rappel register which gives us the possibility to limit and read the energy consumption of the processor or the memory and set that through the MSR register. So that's the only controller we have today. By essence of this directory creation we have automatically a new hierarchy and we can set a constraint on a node and the child will inherit the constraint obviously with this approach. On the other hand we have also the user space tools. The user space tools we have a command line so that gives us an abstraction instead of having to deal with the CCFS internals and we have a library so that simplifies the connection between this power capping framework and applications. So here we have a figure that shows a typical SOC where we have DSP. So these are different sources of heating on the system. We have the SOC, DSP, GPS, Modem, GPU and Package. The idea is when we create this hierarchy the sum of the different numbers we have at the same level is equal with the parent as the same number that the sum of the children. So the W number here we have is an example just to give an example that when we sum the children of Package it's 100 so the number for Package is 100. So by doing so we can measure the power of each devices and then sum them and say okay the power consumption for Package is the sum of my children and read directly the consumption of the Package. So we have the power capping framework today and we have on the other side we have the energy model which was introduced for the EIS which is the energy aware scheduler and this EIS is asked to deal with the big and the little and it's power information, energy information to take the best decision to put tasks around. We have the big is almost eight times consuming more power than the little so the little are very energy efficient and the big are very powerful, have a very high compute capacity and so we have to take the right decision otherwise we cannot save any power in this case. So we really need here information about the power consumption in order to take the right decision. So that's what provides today the energy model which is basically a table giving the performance state we can call that OPP and on the other side we have on the table we have the power consumed for this performance state and with this information we have enough information to know at least what is consuming the device because we can know the performance state and thanks to the performance states we can know the power consumed. As we have this performance state and the power we can do a model of constraints with the power copy or key and show not necessarily the connection between the devices in terms of hardware we don't want to represent the floor plan but what we want to do is to show the constraint between these devices in order to balance the power budget across the different nodes. So as we saw before in the figure we can assume the number of 8 children and we get for the parent the sum of the children. So the parent max power is the sum of the max power of the children and the power limit of the parent is the sum of the power limit of the children. In this hierarchy the leaves of these three are the devices themselves being able to give their power consumption through the energy model. So of course the devices are not consuming the same amount of power and we need a way to characterize that and for that we use the weighted node in this tree for each node of this tree which is the component of the power cap directory we can show a weight for each of them and that characterizes the amount of power they are consuming when they are running and that is based on the maximum power so it's just a percent yield so we have the sum of at each level which is equal to 1024 the 1024 is a common practice in the kernel for optimization purpose when dividing and it's a way to show a percentage of usage. So here in the example of the package we can see that the big is consuming much more power than the little and for this reason the weight is higher so that means if we set a power budget at the top level the big node will get the highest power limit power budget from the parent. So now let's show an example on how all these can work together so we do a redistribution of the power limit so we restrict the example on something simple and already existing today so we have the package, the big, little and the memory and then we decide to set the power limit on this package of 2800 mW so that leads us to this table so we saw on the left column we have little, big and memory on the column in the middle we have the weight of different nodes the little, the big and the memory the weight we had just before in the figure and then we use this weight to distribute the memory limit the power limit we put on the parent and that leads us to these numbers in the static limit column so the next information we have here is the dynamic limit so at the beginning we stick to the static limit the constraint we have here is the dynamic limit the total in the dynamic limit is always equal to the total of the static limit so it's 2800 mW here always then we have the column with the different devices using power and then we have the free power column and we can see here that the big the big device which is the big cluster has a very high power limit but it is actually consuming a very little number of power in this case so it remains 1450 mW for this device and it's a waste of power we should be able to give this power to all the children in this case so giving them the opportunity to work to work in a higher performance state so for that we take the free usage of the free power we have for the big and we apply the remaining weight we have here we have 128 and 256 which is roughly 33% and 66% and so we give 33% of the free power to the little and 66% to the free power to the memory here we have the dynamic limit which change but the total is always the same as the static limit which is what we want the user the used power is always the same it does not change and the free power is the same of course because we don't have any power usage change here so now the little actually it has a small room for power and one small performance so it is increasing its power and it happens that it reaches the point where it has an OPP of 766 milliwatts so it remains some power which are unused but it's okay it's fine because we don't need more power and the user power is under the limit which is what we want and the power is equally distributed among the different children and so that's fine now the big wants more power and because it wants more power it reaches the point where it is consuming 1450 milliwatts and we break here the power limit so we are above this limit so we have to do something now for that so first step is we reset the dynamic limit and now the big is okay it's in the limit but the little is above this limit and we see that we still have 300 milliwatts to be distributed so we apply the same now so 33% to the 300 which is 100 and 66% to the 300 which is 200 and we are still violating the power limit but we still have 200 which is not needed for the memory so we can take this 200 and give them to the little but we are still above the limit so there is no free more power and we have one which is above its own limit so we don't have the choice we need to cap it we cap it then that reduce the user power and now the maximum, the user power is equal to the dynamic limit which is equal to the static limit of the entire package so we saw that by doing that we are able to put constraint on the node and let the children to balance their power and give their power and taking them back but also we might want to create some bandwidth to the children so by setting for example a constraint on the parent and setting a constraint on a child only one child then we can't force this power limit so the distribution will be against the ones which did not set their power limit so we dedicate a bandwidth to the children also we saw that we can create for the same node different constraints several constraints for the same node and that is unlimited we can put any constraint we want and each constraint we can define a semantic for it so we have constraint 0 which is what we just described right before setting the power budget but now we can say I set this power budget but I will let this power budget being break but during an amount of time so that allows for example the patch rendering without capping the power we let the page rendering we absorb the peak of load without capping everything so that gives us some room it's like the turbo mode of the SEPI track also we have the power limit timeout so if we have a lot of constraints we can say ok we set this constraint and this constraint will automatically be removed after a while and that might help the user space to manage the constraint so today the status is we have the energy model the energy model is relatively new it was introduced with the EIS which is I believe is coming from the Canon number 3.4 not sure but it's recent and it is still evolving there is currently a work of Lucas Luba which is trying to generalizing the energy model to the strip device which is a very good thing so that means we give the opportunity to every device on the system to define their energy model if they want to what would be needed is the power QS because we might have different devices belonging to the same voltage domain and we might want to set the power for this domain without impacting the other devices and we want to do that like the Freq QS we have today in the PMQS framework obviously if you want to have something unified in the power cap with the energy model something consistent we should get and set power callbacks inside the energy model structure in order to abstract this operation and let the power cap energy model being as much as as much as generic as possible also the GPU support is relatively new and we need to improve the metric with the load and make sure that they are consistent across because let's imagine we have the highest OPP on the GPU it's consuming a lot of power but actually the dynamic power is 80% because the load is 80% if we don't take care of this 80% and we consider that we are consuming 100% of this OPP it's 20% difference and 20% on the GPU it's a big amount of power and we need that to balance the power across the other children concerning the work we are doing with the power capping using the energy model we are working in the prototype and right now we are at the level of automatic power rebalancing which is a bit complex it's working but we are still we have still issue with that and the next step when that when that is fixed is to do power versus performance measurements and and of course we have to compare that with the existing solution we have with the thermal demons in the future we need of course to have more devices supporting the energy model and that is challenging because we have to include non-linux devices on the SOC so we have for example the modem the discussion is through mailboxes and also we need a way to describe these constraints in the DT so every SOC vendor is able to define this constraint via the DT and use their logic for that so as a conclusion the thermal framework is a bit abused by the different tools to manage the TDP on the entire SOC and for this reason we think that we should restrict the thermal framework to its primary goal which is protect the silicon and prevent hotspots on the other part we saw that we have more and more complex SOC having a lot of source of eating and this is a problem to manage that because they are getting more and more complex and the thermal framework does not fit very well in this and we want to preserve the temper for the skin that means there is a hole here there is a missing feature and we do believe the power car framework with the energy model is a good solution because we can model the constraints but we do not pretend to know how the SOC is working and we can delegate that to the user space where the SOC vendors are the one which have the best knowledge of all work their SOC also the system is what is centric and because we can have more constraints we have an extensible solution so poor capping with the energy model we believe it is a poor solution and that can make the life easier for the SOC vendors thank you we welcome any questions if there are any at this point we are slightly over time but we still have 4 minutes of questions question do you expect the user space implementation to be open source so if you are talking about a thermal demon that would use this we don't see why not so you could still have an open source implementation which would provide user space to the power cat framework having said that there is nothing preventing somebody from a closed source solution as well I think we don't have any more questions so if you want to reach us offline I ask and ask any question we are available for that can reach us by email also so thank you very much for attending this session and I hope this proposal raise some interest from you thank you Amit do you want to say something we are available we are available on Slack and you also have our email addresses and the PDFs are available on the shed.com website so feel free to contact us thank you