 My name is Srinivas Panduwada, I am a software engineer at Intel Open Source Technology Center. I work with Jacob Pan and Len Brown, they helped me to prepare some of this presentation. Today I'm going to talk about Power Capping Linux. I'll present some approaches and relative performance and also some of my recommendation at DEN. The agenda we will cover is we'll start with the context, then we'll look at system power management issues in general. Then we'll look at power capping overview, then we'll look at power capping participants recommendation and finally brief overview of Linux power capping framework, RFC. Let's start by looking at a very typical problem faced by a power planner. The major issue is the load is unpredictable, so once the load is unpredictable, the power demand is also becoming unpredictable. So that may lead to a tendency to plan for the worst case. As you know that's not the true figure, because plan for worst case is not the right thing, because that's not your true load consumed for computation. So that results in over budgeting. Let's look at how this over budgeting affects power planning. This particular news clip is from New York Times dot com. As it says that the 30 billion watts of electricity is used to power digital warehouses, which is equal to 30 nuclear power plants. Out of that only 6 to 12 percent is actually used for performing computation. And rest is just a safety margin just to take care of the worst case that something happens in the load and systems will not crash. Even if this numbers to look so pessimistic, but when I googled around I see this numbers very close like the utilization rate is very low all the time. But the power is lots of over budgeting power. And the same thing true in the mobile arena also. If you look at one of the four user experience is called by the battery life. So if suppose some device says it can last six hours and you cannot predict at least based on the load it may last for one hour to six hours. The window is very large. And also it can cause safety issues and I think this report is from Daily Mail. This is for iPhone 4S, but I am sure this may be in other devices also. So there is when there is an unexpected spikes in the load is batteries can become overheat and maybe cause safety issues in addition to your poor user experience. So what are the system power management issues in general? So we have a limited power capacity and limited cooling capacity. Both are either limited by the financial aspect or there is a physical limitation. And we know that the systems can have a lot of unexpected peaks. And when there is a lot of unexpected peak may cause lots of power swings and that power swings can basically trip your circuit or probably down your systems. And as you say in mobile devices also there is unpredictability of battery life. And now there are too many consumers of power. If you look at any system and if you come up with a power by adding the total rating of each of the component you will be lots of over budgeting. But if all those consumers turns at the same time then the system cannot handle. So that is one of the issues. So even if you have all the components all components are rated with high but you really do not have to have that much power. So what is power capping? So power capping is just a way to limit the power consumed by devices. So it can be server or mobile devices anything. It is not only for the total system power but it can be power to limit a subsystem inside that. So initially you may come up with some static limits but as you know that one system is operational you may have to readjust those limits because there may be thermal issues or other things may force you to readjust those limits. And now another important aspect is now you should be able to redistribute powers among server systems. So if suppose some server is running very critical time critical load and some other servers are little relatively idle maybe you can distribute additional power from one server to the other so that the other server which is doing the time critical computation can run with full throttle and give a better performance. As you know redistribution can always maximize performance. This is not only true for server systems but if you have in a single system also this redistribution is possible. Like you have a DRAM you have CPU and you have GPU and you think like your load is high CPU intensive load you probably can redistribute some of the power from your DRAM to your CPU so that CPU can run at high speed. Similarly like you may have a gaming engine probably you can distribute some of the power from others and to the gaming engine into a GPU. So this diagram represents the typical usage model. At the lower level there is all Linux kernel and CFS so you have like power capping CFS which I am going to present at the end. Also there may be other CFS present and there may be a thermal zone or thermal sensors and like you may have a power supply interface to read the current power or charge levels and all these readings are sent to or pulled by some user agent like something like power agent or thermal management agent. That may in turn get power from instruments or reading from temperature and the user space thermal agent may communicate with other thermal agents on different servers and provide you a management interfaces. So there are lots of solutions already exist at that user space level to really redistribute power or set some limits or manage and provide a management console. We will limit our discussion to only the kernel interfaces part. So what are the power capping participants? Participants are some device where we can power cap. These are just examples it is not a complete list. Like we can power cap CPUs, we can power cap GPUs, we can power cap DRAMs and there are lots of other like multimedia system can be one of the power power like wireless. Subsystem can be one of the biggest power consumer. But we will limit our discussion to CPUs, GPUs and DRAMs because as they can consume together more than like 55 to 60% of the power. So we will just concentrate on three. But if I Google around I see there are lots of technologies already exist for wireless to power cap. So this system can be you know go just beyond CPUs, GPU and DRAM and can cover lots of other components also. So how do you power cap CPUs? So we can adjust the P states. We will see what are those. We can adjust the T states. We can do offlining of processors. We can inject some idle time and something called ACPI power meter which is already used in many places and ACPI process aggregator. For CPUs DRAM and GPU there is one method introduced by Intel called RAPL. You will see what is that. Also if you use Intel 9515 you can do P states in GPU also. This is our test setup. So we used an Intel IV bit system with Linux 311 RC3 and we use an external power meter. And our test load for all the slides is open SSL speed test with short 256. And one of those slide we will also use some server system where we need to have lots of CPUs to offline. For majority of the slide this is the test system. So let's look at what is basically a P state. The P state should not be confused with the frequency. In Intel processor P state is a voltage frequency pair. And if you look at the right hand side diagram the operational state from P1 to P1 are basically under OS control. So OS can request any of those states and probably if we get it. But between P1 and P0 there is called turbo or hardware control states. So that's totally under hardware discretion to give you those operational points or not. But that's very important because as number of cores we start increasing that P1 to P0 number of sub steps goes on increasing too. So let's look at how this P1 to P0 plays a role in it. Like this gives you example what are the, you know, how can between P1 and P0 there may be so many sub states. Like if you look at this line dotted line that may be the guaranteed performance which you may request fed by OS and can get it. But between P1 and P0 all those like F1, F2 and F3 and F4 are depends on how many numbers of number of cores are active. And also what is your thermal headroom? If you have more thermal headroom probably you can get more. So those things are dynamically decided by the processor what performance to give. The reason I discussed this slide because what happens is between P1 and P0 the performance is not beyond OS control. So if you need for short term solution for some reducing the power maybe you can disable those that range. But it will impact the performance but it is an option. Then how do we control P states? Intel introduced Intel P state driver. So that has, that provides a C surface for limiting the performance. So there is a by percentage. So you have a max performance and a minimum performance percentage. You can set some value so the performance will be under that window. And also you can disable the turbo. And the advantage of P state driver is it can do for all the, you know, it can apply to all the cores in your system. You don't need to independently do for each of the CPUs. And there is CPU freak system which is, which is there from in Linux from long. There is a way to set the scaling frequencies from max and min. And also you can set the current frequency. But this CPU freak allows you to set for power thread in the, for CPU. So, but you know as in Intel processor they are shared, the frequency is shared. So it's not really realistic to really control the, you know, frequency of one particular CPU. It depends on the, what are others requested to. Also you can control the P states using thermal cooling device C surface. So there is a C surface cooling device called processor and it has states. So you can have 0 to 3 for the P states and less to the T states. So you can use thermal cooling device for controlling P states. And finally you can also use the RAPL which I will present. It's called running average power limit. And the way to control RAPL is the power capping framework which I will introduce soon. Let's look at how P state performance. The left hand, your left hand side, the graph shows the power versus the max in a limited P states. So as you limit the P states you see the power gets dropped. If you look at from, there is a big drop because that's where the turbo desk is engaged. Because that's where the OS is directly controlling the OS. So that's where you will see the drop. Then after that it will be very linearly drop. And on the right hand side the graph shows the performance of open SSL speed touch performance. With respect to the P state, we're using the P state driver by limiting power. As you limit power of course the performance will drop. And the performance drop is pretty linear. But after that you will have different, depends on the actual P state at that time. Its performance is a percentage there. It is a number of operation of Shah operation. Yeah, that's, there is basically an Intel P state driver you can specify 0 to 100 percent. So Intel introduced a feature called Rappel which was first introduced during Sandy Bridge. So the Rappel provides a power meter capability. Actually internally the processors had some power meter ring. But it exported that power meter ring capability. And this power meter ring is actually is not really analog power meter here. It's a digital model based power meter. And also it provides you a power limit which is also used by the internal processor algorithm to limit your P states. Those are exported outside. Since the accuracy of this power meter is really comparable to the real power measured from actual power meter. See it gives you very good ability to limit the power. As you know if you can measure a good power you should be able to limit and dynamically change it. So this graph shows how Rappel power meter is actually compared to the physical in measuring. So what is predicted is by the Rappel. And what is actually measured is by the power meter. If you can see, yeah, but for each component. So they actually match very, you know, together. So it has very good power meter which can be used to limit power also. And at the same time it gives you performance feedback. So if based on when you do limit your power it will affect the performance. It has the ability to give feedback. And since it's implemented so close to processor it's very reactive. It takes very less time to do any operation. And the interface to Rappel is described in Intel ISE 64 software, you know, user drive. So you can refer to it. And its interface is exported via MSRs, model-specific register and PCIA config space. The way Rappel exports its capabilities is by using set of domains. So here when I say domain means each domain is capable to do Rappel operation. Like a package domain or DRAM or CPU and GPU, these are different domains it offers. So each of these domains you can set a power limit or you can monitor a power. And there can be physical relationship because package can have DRAM, you know, CPU and GPU inbuilt. So it will also respect those measurements when actually you are enforcing power. From what each package gives you can give an energy status so which you can convert into joules with some other registers value. And for limiting power it has two ranges which is called short term and long term. But these ranges are continually getting enhanced. So each power limit is defined by actual power and a time window. So currently there are two time windows which are one is short and one is long term. The idea is for shorter term you may want to run your system with high power limit. But over at a longer term probably you want to average it out a little lower because your thermal will take it over and maybe you may need to reduce the performance. And as we say it gives you very good feedback so that the user space, age, power agent or something can take decision how effective was its power limit. So for example it can give you an interrupt if suppose your power limits are violated or you can get some counters that how long your system was performing under a reduced, you know, because of your constraint how much the performance is reduced. So you can get those feedback and probably you want to dynamically change your limits. And this particular slide shows two graphs very similar setup as the previous P-State driver. On the left hand side as we say the power limit limited by the triple and the actual power of the system. So you can see that the power limit and the actual power varies by similar amount. So as you limit the power from the socket which trough will be reduced by similar amount. Yes there is a saturation range. So there is an effective range. The apple cannot go from top to bottom but there is a range between. So there are some registers which will tell you what is the minimum and what maximum. It cannot go beyond it. After minimum there is no effective, not effective anymore. So on the right hand side you see the performance. Now we compare with the performance with the P-State driver and performance on a particular limited power. You can see that the red line there is the apple. You can see the performance is much higher and from compare to P-State driver. Because it is not implemented in OS. It is done directly in the processor. So once you set limits it can dynamically, very reactive fashion. It can adjust those limits. Because you can set more, you have more final granularity. P-State even if you say 0 to 100% but they are not 100% performance point between in a processor. But Rappel has a lot of range. And one good feature of Rappel is you can also power limit the DRAM. Because DRAM can consume almost 20% of the 20-25% of your power. So it allows you to throttle even DRAM. So the way it works is it throttles the DRAM bandwidth. But based on experiments it may be power inefficient if you throttle DRAM because it is possible that your CPU may be going extra wattage because of your throttling memory. But it is effective. But you first exhaust other methods then probably you can even throttle DRAM. Now let's look at P-States. Because I am discussing P-States and P-States because this is already used by current many some of those higher level software to control the performance. The P-State is actually throttle state. So the way it works is it gives you some access to write the duty cycle. And the duty cycle basically allows you to control the clock modulation. When you say duty cycle it does not actually change the duty cycle of the processor clock but duty cycle of something called stop clock. So for example in the bottom we have a 25% duty cycle. So it makes sure that the clock is supplied to the processor only of your duty cycle. So the rest of the time the processor is not supplied through any clock. And the way to control T-States you can use the same processor cooling device you can control T-States. Do you have a question? Yes, TM1. Yes, yes. Sorry? Suspense, no. It is their P-States basically. This graph shows the performance. And you say as it goes T-States the performance will be limited. But we will see the performance of T-States. You cannot power cap using T-States when you are in very high power because first of all there is no interface to do it. Current interface will let you exhaust all the P-States before you can even T-States because performance will be badly affected by T-States. So on the right hand side you will see the blue line. You can see the performance is basically much worse than other two. But there is an advantage. It can limit a power in a range where others cannot limit. For example, as the file pointed out before, there is a limit that Rappel can operate. You cannot operate at the low power area. But T-States can still operate in that area. To reduce the performance and power cap is using idle injection. The idle injection driver was implemented by power clamp cooling driver. This is implemented as a cooling device driver. And it was introduced in Linux 3.9. And basically the idea of this driver is you can specify some idle ratio between 0 to 50%. And it will make sure that your CPU is at least in deep state C state for that much time. So the way it works is it starts very high priority thread on each of the online CPU and monitors the activity and based on the user idle ratio, it will at least make sure that system is idle and it has a feedback loop. So it will monitor continuously and enforce that idle ratio. And this slide shows the performance. As you can see that if you index more idle, of course, the power will drop. But performance wise, if you can see the idle injection is not as good as Rappel or a P-State driver. But if you look in the lower power zones, it has much better performance than a T-State using controlling via T-States. In those ranges, you might want to use an idle injection rather than using a T-State. So T-States tend to be inefficient here. And the CPU offline is one of the methods which is probably used somewhere. And basically it migrates the activity on current CPU to the newer CPU. So it will migrate all the processing traps and timers to the new CPU. And the way there is a C-surface for CPU, online CPU, there is something online. And if you see that C-surface, that CPU can be offline. But there is some limitation. You cannot offline all CPUs, particularly with CPU 0. Now it is implemented, but you need to turn on some special kernel config to the CPU 0 offline. And when we see CPU offline, it may not mean that it's physically off-lining because physical offline also needs support from BIOS and also needs some special kernel build flags. So this graph shows the performance. As you can see that first two points, there is no drop in power. It's because in our hyper-threaded system, you have to make sure that you make offline both the corresponding threads in that core, otherwise you won't see any difference. And they may not be linear. As you said, this numbering system there works on CPU. So you have to find out what is the corresponding core to offline it. And once you offline whole core, you will see the drop in the power. And if you look at the performance here, the purple line is the performance using the off-lining. And if you look at that, even in the high power range, the performance is worse than any of this, like your rappel or P-state or idle injection. So where you're using offline, you probably can even just use idle injection and probably get better performance. And even in the low power area between 15 watts and low, its performance is very similar to T-states. So I think instead of using processor offline, idle injection may be better than that. And now let's discuss ACPI power meter. The ACPI power meter was introduced during ACPI 4.0. And it depends on the bios. So you may not have that capability. And it's used by some of the node managers at this time. And the way it allows you to measure power between two intervals and it also has some trip points that whenever those trip points are violated, you get a notification. And there is some optional power capping parameters you can specify. So there is a power cap mean and max. So once you specify that range, it will try to power cap in that range. But that's also optional. Even you may have an ACPI power meter just for monitoring, but may not have any power capping capability. So it all depends on whether your bios has that ability or not. And the ACPI processor aggregated driver, which is basically it's not under user control, but if your bios can do a processor off-lining in a way and by sending you some ACPI notifications, it's not really meant to do any power capping, but it is to solve the short-term thermal emergencies. It's actually called CPU off-line online, but it's not actually doing any off-line and on-line. It does not affect any CPU set. You will still see the CPU. It's just putting that CPU in the very deep sea state. CPU set. Oh, oh, sorry. I wish that's correct. And you say that CPU off-line or ACPI pad can be used, be used, either of them. They match very similar in performance. So I think there is a lot of contention when this driver is getting submitted. So this graph shows that they are logical, very similar in performance. So based on all these measurements, this is my recommendation. So this is not independently verified by Intel QA or anything. So this is not any recommendation from Intel. It's my, based on our results. So in applicable range, you may want to consider order in a way like RAPL, P-state, idle injection, and CPU off-line, and then T-state. So in the low-power areas, you may want to use idle injection instead of doing off-lining and, you know, using T-states. So probably if you have RAPL, P-state, and idle injection, pretty much you can cover the entire range. So since we have seen so many different way to do power capping, but the way currently it is, they all scattered in C-surface everywhere, like ACPI pad is located so many other places, like ACPI power meter is somewhere else. So there are different way to do idle injection. So we want to have a consolidated way that all power capping devices look similar. Like, you know, like you have, in Linux, you have power supply or regulators or so many things. They all have a similar behavior for all those particular regulators who will behave no matter what, implemented by what method. Looks very similar from user space perspective. So we want to introduce such power capping framework to Linux. So basically it allows you to have a set some power limits and also it has an ability to read the power from a system. And we'll define a client driver API and it will avoid a lot of code duplication as a class driver. And this link shows to the RFC. So if you have comments, you know, you can look at it for more information. So it's very simple. It's just that it has very class driver and it provides some registration and callbacks. So majority of the, you know, the interface work is done in the class driver, but actual hardware, you know, interface can be implemented in the client drivers here in this case. So that from user space perspective, no matter what method you use, it should look similar. So the way we export this interface is we, at the top level, we have something called control types. So the control types in this case is a method used to do power cap. So like you can have a control type like using RAPL, Intel RAPL. You can have a control type of power clamp or you can have many control types. For example, you tomorrow you have a wireless device which can have its own control type. And the control type will act as a container for all the power zones in it. So in this case, each power zone is basically something to do with, you know, it has independent unit which is can measure and enforce power. And you can have parent-child relationship. So as you can see here, suppose let's take an example of Intel RAPL. You can have Intel RAPL 0 and RAPL 1. In this case, in this example, we used a two-socket server system. So you have two packages, CPU packages here, RAPL 0 and 1. But as you know, each package can contain multiple cores and each of this core and each of this GPU, they can have their own power zone because you can even monitor and enforce power or measure power there. So like in Intel RAPL 0.colon 0.colon 1 is actually child of one peak Intel RAPL 0. So you can have a hierarchy so that at a given level, you can see how the power propagates. So this hierarchy is based on not logical relationship, but some physical relationship. It's like if you like, for example, power cap a package, it will affect in all the cores also. The relationship should be based on that. That's one of the conditions to be in chair appearance child. And each of this power zone, you can have option to have energy measurement and also some range. So energy measurement, either you can measure energy or if you have inbuilt power measurement interface, you can have in power directly. You can have a CFS measuring giving you power. And the limits are done something called constraints. So you can have up to 10 constraints so that you can define with different time windows, you can have different constraints. So each constraint is you can define a time window and a power limit and you can give name to the constraint. So you can have number of constraints so the system will act on those constraints based on your time window. And you can independently enable each of this power zone or you can basically enable whole control type or disable at the same time. So what are the basic takeaways from my discussion is we prefer to use Rappel driver for power capping from P0 to Pn range. And whenever you're using a CP offline, probably you want to consider using idle injection or using T-states. You know, idle injection can give you better performance with the same results. And as we say, we have published an RFC. Our aim is to improve it so that it can be used across all the processors and all the devices like wireless or multimedia. So if you have suggestions, please comment on that. So I think I concluded my all the slides so this is time for question and answers. So if I cannot answer, you can send me an email. I'll try to get an answer from some other experts. Yes, power top is a user space. Power top can use this interface. So once this interface is published in the upstream kernel, we'll modify the power top also so that it can use this interface. But this is not upstream yet so it's not changing power top yet. Kernel code. So based on the CPU just gives you some ranges. You should be able to do control between those ranges only. Like there's a max... Sorry? Yes, yes. Yeah, CPU, not kernel now. Yeah, I think it should be doable but that's up to the, you know, there has to be some event and that event should, maybe some user space software should be able to now change the limits based on the state of your... But it's doable. But it's not done by... It cannot be done by kernel but it's somewhere in the user space lead close or something. In this interface? Now, first we are trying to do others. Yeah, but we can do that. But it will be a step by voice. Initially it will start with Rappel. Probably then we'll migrate to others. Yeah, it will but... No questions? Then there are no questions.