 Hi, my name is Mara, I work with my work primarily focuses on power management and thermal management frameworks in Linux kernel. I will let my co-speaker Ram introduce himself. Hey everyone, this is Ram, I am from Qualcomm and I actually come from Power and Thermal Software team. So I specifically work on thermal software solution for all Qualcomm Snapdragon chipsets starting from Linux kernel all the way to proprietary Qualcomm thermal demons in the user space and Android world. I don't think we could hear Ram. Okay, so I don't know, I somehow couldn't hear Ram. So anyway, so this session is about why some modern SOCs cannot handle lower temperature, really low temperatures by low temperatures. I mean, zero degree Celsius or below. So Ram will walk us through some of the hardware issues we face at these temperatures, and then I will take you through the how to extend some of the Linux frameworks to handle these scenarios today. So over to Ram for now. Hey everyone, so let's straight jump on to the problem statement and like you know what's the problem that we have in the hand. In order to understand this problem, we need to understand this concept called time enclosure. So if you take any logical circuit, we know like you know it's going to be comprised of different gates and individual gates are going to have their own delays. And when you put together like you know all those gates in a logical circuit, the total delay that the logical circuit is going to experience to convey an input all the way to the output is going to be an accumulated accumulation of all the delays that are actually involved in those individual gates. So when you're actually designing a logical circuit, this becomes important concept where this particular total delay of this logical circuit should be always less than the time period of the synchronizing clock pulses. If that is not being met, you would actually start seeing this logical circuit like you know, start to behave like you know unexpected, you know, if you will see an unexplained behavior or like you know any unexplained logic that will be happening there. So in order to ensure the correctness of any logical circuit, this timing closure has to be met. And this is actually like you know kind of a widespread concept which is prevalent across all the process nodes like starting from 28 and like you know all the lower processing nodes. We have this issue and in order like you know pretty much all the Qualcomm Snapdragon chipsets starting from premium tier all the way to the low tier pretty much actually like you know try to meet this time closure. Let's see like you know how this temperature actually influences this whole timing closure. So normally in a normal operating temperature range like you know 30 to 40 C range, like you know we try to actually categorize or like you know find a combination of voltage and frequencies which will actually meet the timing. But this same voltage and frequency combinations may not be able to meet the timing at really low temperatures when I'm talking about really low temperatures, this is less than 0 C. So in order to solve the problem like you know probably we can just like you know increase the voltage for this particular frequency throughout the operating temperature range but this is not a power efficient solution you will know. So in order to solve this we just like you know split the whole operating temperature range into two. One is a normal operating temperature range which is above 0 degree Celsius that we can actually still have a lowered voltage for this particular frequency combination and then the temperature goes below 0 C. That's when we have like you know either we can increase the voltage for this particular frequency or if there are like you know multiple operating levels for in particular IP and if only certain up levels or operating levels are having this issue. We can disable those operating levels in the less than C range. So these are like you know those two solutions we can actually try to do in order to achieve this timing closure at really really low temperatures less than 0 C. In a way if you think about this one, this is entirely opposite to what we are normally using the thermal framework for. Normally the thermal framework you would actually like you know use it for like you know to monitor a high temperature threshold and then it crosses we actually like you know place a cap on a particular cooling device. But this is completely opposite to like you know what is being done right now that actually when the temperature goes below this particular or like you know the 0 when we reach the 0 C. We want to like you know try to put an operating voltage floor or like you know operating level floor on a particular cooling device. With this I will actually hand over to Tara. Yeah, so I would like to now take you through the existing frameworks in Linux kernel that can support these hardware requirements. And then I would like to walk you through our proposal on extending how we can extend these frameworks to meet these requirements and the probably and talk a bit touch up touch a bit upon the status of these of this work. Okay, so Linux kernel already has a thermal management framework that allows you to monitor temperature and monitor temperature and as well as initiate any mitigating action. The only caveat here is that the entire framework is built around mitigating rising temperature, which means the framework is coded in such a manner that by default it checks if the temperature is rising above a stated threshold. And in case it rises above the threshold mitigating action is triggered. Whereas on the other hand for what Ram just the scenario that Ram just explained, where we should not let S O C's drop below a certain temperature. What we need from the Linux thermal management framework is that the framework should also handle falling temperature, which means that if the temperature falls below a stated threshold, the framework should trigger a mitigating action. And in this case, the mitigating action should be a warming action with another cooling action in like it is in case of rising temperature. So the two points what we the two points what we will touch upon in our proposal is the other framework to handle falling temperature and for the framework to support warming devices just like it supports cooling devices. Moving on to the next slide so before we go into the proposal I would like to walk you through some of the fundamental pieces of the thermal management framework. So the Linux thermal management framework itself is built on this concept of thermal zone, which is a software concept it's nothing but an area or device in S O C that has some kind of thermal constraint. And S O C can have multiple thermal zones and the thermal core can be considered as the manager or driver that manages the temperature requirements of these thermal zones. And then there are the sensor devices, which are devices with temperature sensing capabilities and then there is the associated drivers with the sensors and these drivers provide the thermal the temperature data to the thermal management framework. Moving on, there is this concept of trip points in Linux kernel. So the trip point is nothing but it's a point in the temperature domain, or it's a threshold point upon crossing which the software is supposed to initiate a mitigating action. So the point itself doesn't state any mitigation action or it mitigating action or it does not. It doesn't specify anything. It is just a point in the temperature domain. It's just a threshold point. Now associated with each of these trip points, there will be cooling devices with actually any which actually controls the mitigating action. So they are the devices they provide control and power dissipation they can be actual hardware devices like fan or they can also be software initiated cooling actions. And the cooling devices you typically also have a range of cooling states associated with it. So in the kernel lower the cooling state of a cooling device lower the state of a cooling device lesser as the lower is the mitigating action and as it as the cooling state increases the effect of the mitigating action also increases and to tie all this together. There is this concept of thermal governor, which is nothing but the algorithm that manages the temperature basically the algorithm which decides on how to monitor the temperature the frequency with which to monitor the temperature and finally what mitigating action to take at which point. So a thermal zone usually is a combination of all this thermal zone consists of the sensor that gives it with the temperature data, it will have a set of trip points associated with it and each trip point will have a set of cooling devices associated with it, which can be triggered by the governor to initiate a mitigating action. If the temperature goes above the threshold or the trip point. So what we need to support this particular hardware requirement that the SOC should not fall below a certain temperature is that the instead of monitoring for falling or instead of monitoring for rising temperature the framework should monitor for cooling for falling temperature. So moving on to the next slide. Okay, so now we go through the specific parts of this framework the thermal management framework that needs extension to support this. So the first part is the trip point. So in the existing framework today. We support hot trip points so like I explained trip point is a point in the temperature domain upon crossing with the system will take a cooling action today that I'm the various kinds of trip points in the kernel today. There is active trip point and passive trip point so that is nothing that active trip point would mean that the cooling device that is called is active cooling device and the passive trip point means the cooling device that is called as a that that initiates the mitigating action is a passive cooling device and then there is a hot trip point which means that the system is an emergency and there is the critical trip point which tells you that the system is unreliable. So what we are proposing as an extension to this is adding a cold trip point. So what it would mean is that cold trip point would be a point in the temperature domain upon crossing which the system should undertake a warming action, not a cooling action, but a warming action. So there are multiple ways to handle this. We think the best way to handle this would probably possibly to introduce a new trip point type called cold trip point. The other way would be to not introduce the new trip point type, but to add a new property to the existing trip to the existing trip point structure, which would say whether a trip point should be monitored up or down, which means a trip point should be monitored when it crosses up, when the temperature rises above the trip point or when the temperature goes below the trip point. But we think that this is other than that the best way to handle this is probably to introduce a new trip point type called a cold trip point. As far as the status of this work is concerned, nothing much is done. There are a few prototypes, but we have not posted out any code assets for now. Moving on. Okay. So then the next part of the terminal framework that would need extension is the part of cooling devices and drivers. So in today's existing framework, we have two kinds of cooling devices and drivers. So one is active cooling devices, which are actual physical devices like fan, sort of GPIO line and all that. And then there are software based cooling mechanisms like CPU fret cooling where you say that if the temperature is rising, CPU cannot operate at certain high frequencies or high operating points. And also idle injection. So for the Calcom, Snapdragon, SOCs that we have been working on, the kind of warming mitigation that we need if the temperature is going really low, we would want to stop that from happening. So the kind of warming actions that we have seen so far, one is software based warming mechanisms. So that would be something like a generic power domain based warming where you will say that certain power domains have to be turned on. It has to be kept on or have to be kept at a particular state in order to maintain the system at a particular temperature or to stop the system from dipping down. The other thing that we have seen, another feature that we require is to disable certain devices from going into lower voltage levels. Like for example, CPUs and GPUs, we would restrict the minimum operating point of these devices. We would say that devices cannot operate below a certain voltage level. So these are some of the software based warming mechanisms that we have seen. Other than that, there are also resource specific warming mechanisms, which would mean just turning on or off of a resource, which is sometimes not even controlled by the, which doesn't reside inside the SOC. Maybe it's an outside resource, but turn it on in order to maintain the SOC at a particular level. As far as status of this work is concerned, on the Colcom SSEs, we have already upstreamed a portion of Colcom AOSS based always on subsystem based warming devices. They are just resources that needs to be turned on. So those have been upstreamed. We are in probe. So there is a patch and code which has been posted out for generic power domain based warming device driver, which would mean that if a power domain can act as a warming device or can be used as a warming device, it would just need to register with this particular framework. And it will fit into the thermal framework. So those patches are already out there, which they are being reviewed. And then there is this big piece which would prevent, which would disallow the devices from operating at lower voltages, on which the work is still not started. And then once we talk about all this, I think we should also talk about the term cooling in thermal framework because the, like I said before, the entire framework is based on the premise of monitoring rising temperature and hence the mitigating action is cooling. Whereas now that we have this particular need where we also need to monitor falling temperature and the mitigating actions no longer cooling, the mitigating action can also be warming. So maybe the words like cooling devices and cooling maps and all that needs to be revisited. Those terms probably need to be revisited to see if it makes sense and if we can reword them into something probably better. Okay. Sorry. Okay. And then the other piece in this framework that would need extension, which is a big piece would be the thermal governor itself. So the existing framework supports various governors, stepwise, IPA, bang bang, they're all different algorithms on how they'd all differ on how they monitor and how they initiate the mitigating action. But across the theme across all the existing governors is to like I said is for monitoring rising temperature. So if the temperature goes above a threshold trigger or mitigating action. It could be in Trump base, it could be polling base and the frequency of polling and frequency of monitoring differs across these governors. The way the mitigating actions are triggered differs but it just monitors for temperature going above a certain threshold. Whereas the extension that we would like to see in the thermal governor is support for monitoring and mitigating falling temperature, which is, which would mean that the governor should monitor if a temperature, whether the temperature is falling below the stated threshold or the stated cold trip point. And if it is, it should trigger the warming action as specified. So it could be a separate governor. It may not be any of these governor governors that I mentioned that and that already exists in the kernel. It could be a separate governor, but our proposal here is to extend stepwise to do this because we think that stepwise is governor is fairly straightforward and can handle this. Our initial proposal is to extend the stepwise governor to handle this. So as far as status of this work goes, then some patches, I don't know how we missed the link here, but then some patches have been posted out for this a few months ago, but then nothing much has been done about it. So it is still in a not yet started state, but there was a RFC proposal a few months ago on this. So yeah, I think we missed the conclusion slide. So in a nutshell, what we are trying to tell here is that there is this requirement, there is this hardware requirement that the circuits and the SOC can face issues with timing closure in extreme cold temperatures. And we think that the Linux kernel already has frameworks that can support this and mitigate this particular constraint. We just think that the thermal management framework needs some extension to handle it. And that is I think this was a short presentation. So that is that is it from us. Thank you very much for tuning in and also we will be happy to take questions. We are there on Slack channel we can chat there and we will also be interested in knowing if anyone else is facing such having such requirements and is there anything that we did not cover that you would like to see. For this particular requirement to be extended in the thermal management framework. So I see a question from Anup. Have you guys tested the performance drop? It says that have you tested the performance drop when the system is in warming mode, because the performance goes down on the warm spectrum. I am not sure what do you mean by performance goes down on the warm spectrum. But no, we have not tested the performance drop. We have not done any performance testing with this per se. That is the question. Hey Jess, can I get a table here? Another question which says, will the slides be available online? I would say yes to that. I want to say yes to that, but you would want to consult with the ELC team for that. Okay, what else? Okay, there is a question which says how do you plan to handle power consumption during a case when you are trying to increase heat? So I would say that this is a situation when you are in real, so the situation we are talking about is that we want leakage here. We want static leakage here. We want the system to heat up. We are not really concerned about power managed power savings here. We are concerned about spending some power so as to keep the system running and keep the system alive. So I think that unless Ram you want to add something more here. Yeah, so we just wanted to distinguish between what we are trying to do here actually. So this is not something to enhance the performance or anything of that sort. This is more about the reliability of the operating device. So at that point actually restricting those lower level operating points, it's all about making sure the hardware is reliable and at that point we are not concerned about the power usage. The power consumption or the power savings can actually be in the normal operating range, but when you go to the lower operating range, at that point it's all about the reliability. So that comes in, that takes the frenzy when compared to the power. So there is this question, is there a risk of overcompensation when you keep oscillating? Is there a risk of overcompensation? I'm sorry. Good, good. Where you keep oscillating between cold and hot thermal zones. I'm not sure if you have seen this, but we are also, there is also some kind of hysteresis loop which we plan to put around, which we have around the cold temperature, the monitoring of temperature so that we don't see the ping-pong effect that I think you are mentioning here per se. And also just to add to that, if you are referring the cold and hot as the cold trip points versus the hot trip points, on an actual system, if you think about it, the temperature ranges will be really, really high because the hot tip, hot, you know, these trip points might actually have thresholds in the range of, like say, anywhere about 90 C or something. But this cold trip point would actually have a temperature in the range of less than 5 C, 10 C. So you won't really see a ping-pong between this cold and hot trip points, but moving in and out of this cold trip points, like Tara mentioned, we will actually have a hysteresis in mode so that we make sure we don't really relax the mitigation till we actually clear the hysteresis temperature. So there won't be a ping-pong effect in general. So there is this question from Kyle. For most systems I have worked on, the minimum operating temperature has been determined by non-processor components. Is this work intended for, intended use for all specialty systems with dedicated warming mechanisms? So there is no dedicated warming mechanism as such, like, you know, there is no external warming mechanisms that we are talking about. All we are talking about is just the operating voltage of all the digital rails that a particular SOC is using. So when this cold temperature reaches, the proposal is to just increase the operating voltage of all the digital rails. That's all we are trying to do. There is no external warming mechanism that's involved in here. So there is this other question, like, if the power consumption increases due to warming, how long does it take before battery-level estimation adjusts? I'm not sure about this actually. I don't know how soon the algorithm can actually react to this. Probably I can get back to this on this one actually. I don't really know the answer right away. So there is no other questions probably we can actually end the presentation at this point actually. Thanks everyone. Thanks for joining us.