 Hello everyone, my name is Vitaly Wol and this talk is about energy aware scheduler and things that are related to its development and current status. And I'll try to provide an unbiased look on that very scheduler. We'll start with the introduction. Then we'll pass over to the EAS as such as it began. Then we'll talk about the main sort of rival of EAS called the Qualcomm HMP scheduler. Then we'll do some comparisons and figure out what the way forward was and wrap up. So nothing out of ordinary but still, I thought it would be nice to have some kind of a summary slide in the beginning. I'm representing Consulco Group, a services company specializing in embedded Linux in open source software, doing hardware software, build design development and also training services. It's working off of San Jose with the engineering presence worldwide. I think, Matt, do we have presence in our tactic? Not yet. So worldwide is a little bit of an exaggeration. We're not covering our tactic yet. Also, I happen to mentor a group of postgraduate students under the name of Interstate Labs and that happens mostly in St. Petersburg, Russia. Just for the record, these guys are not the guys that are tempering with the U.S. elections. Okay. So we're passing over to the more important part of the introduction and that is where the need for EAS, the energy waste scheduler, came from. And to be specific about that, we need to consider what's there in the kernel right now and what's prevailing, what's used by pretty much any appliance that has Linux Kernel inside is called the completely fair scheduler, which main idea is to maintain balance or in terms of CFS fairness in providing processor time to tasks. It maintains the amount of time provided to a given task to determine if balancing is needed and the main structure used in CFS is a time-ordered red-black tree which guarantees high responsiveness and performance and this is basically why when the decision was about to be made if to implement a scheduler from scratch for the needs of big little systems or to do something on top of CFS, the decision was made in favor of CFS. So this is an important detail. CFS is a good scheduler. CFS is sorting tasks in ascending order in that very red-black tree and the left-most task of the red-black tree is picked up next when the decision for the next task to be put on the CPU is made. That's because the left-most task has the least spent execution time and that is the very task that needs fairness the most. What we also need to mention about CFS operation principles is that CFS works with fairness not only to tasks but to CPU cores. So it considers all CPUs to be the same which works very well in SMP systems but in some more complicated cases that will be covering it doesn't. Okay, what are the cases? The main case that we'll be concentrating on is the big little architecture which is a heterogeneous processor architecture which uses two types of cores combined into clusters. So the little cores off of the so-called little cluster or silver cluster are designed for maximum power efficiency and the big cores off of the big cluster or golden cluster in Qualcomm terminology should provide maximal computing power and each task in the big little architecture may be scheduled for execution either on big or little core. That's no limitation to that and the aim is for high peak performance with low mean power. So big little architecture targets high peak performance with low mean power which is specific for battery-operated mobile devices well primarily Android devices. So if we take big little in a nutshell the key once again is task placement this is the key for high peak performance with low mean power because wrong distribution of tasks between the cores will most likely kill the big little advantages. So with that said big little puts high requirements on scheduler because the scheduler should be aware that there are two types of cores and well the cores are different the power consumption of the cores are different the performance of the cores are different. So the scheduler should be energy aware and it should communicate with the dynamic voltage and scaling subsystem and also while scheduling in fact is always a bit of a crystal ball type of operation because you need to make a decision based on the task's future activity right. But in this particular case it's even more like that because you need to account both for tasks demands and for possible power impact. Any questions so far? Okay. So scheduler for big little can we use CFS? Well CFS is a good scheduler but it's not really a good fit for big little architecture because it's not energy aware it doesn't know it cannot distinguish between big cores and little cores. So once again we want to use CFS because it's good but we can't use it directly because it lacks some important knowledge to make right decisions. So the idea was to extend CFS to be applicable to non-SMP and well first of all big little architectures and this work dates back to 2013 and there were two main competing implementations. First came from Qualcomm, Quadorora and the other one from Armlin Arrow and the latter one is called EAS and we'll concentrate on that one. So let's pass over to EAS in detail. The basic principles of EAS are that we need to schedule tasks considering energy implications and the decision should be made based on both topology and power management features and that of course implies workload calculation and workload calculation within EAS is implemented independently or mostly independently and in fact it's using the calculations that have already been there for CFS and those are called PILT or PLT per entity load tracking. That's the scheduling feature that is already in mainline it's been there since 3.8 and the main idea of PILT is that process can actually contribute to load even if it's not actually running at the moment. So it's calculated based on the geometric series with the decay factor when the total load is composed of the load taken from the last sample plus the decayed load from the previous sample plus the double decayed load from the sample that was before the previous and so on and so forth as you can see on the formula and Q is the decay factor and PILT itself uses very soft decay factor so that the load contribution halves after 32 milliseconds and this choice of Q is well and has never been obvious but EAS just followed what was there in the mainline and as you will see later in the slides that didn't work exactly that well. So and then we have a very nice picture of how EAS together with PILT is operating so if we are to schedule a task to a big little system we pick the CPU we pick the core with sufficient spare capacity and smallest energy impact so if we look into this particular picture then we can see that both little core number two and big core number two can handle the scheduled task without raising the operating frequency which is represented by dashed line. At the same time we do know that big cores are more power hungry so if the task is actually fitting into one of the smaller cores it will go into the smaller core as opposed to CFS in this case because CFS could equally well schedule this task on little core number two and big core number two because for CFS they would be equal okay and now let's take a quick look at what was happening in San Diego with Qualcomm and their scheduler so the Qualcomm HMP scheduler which is deciphered as Qualcomm heterogeneous multiprocessing scheduler even though there are alternative variants like Qualcomm high maintenance parachute that scheduler operates basing on similar principles but there are some significant differences to linear rows EES and we're going to concentrate on these differences in Qualcomm HMP scheduler the tasks are divided into groups by importance so there are less important and more important tasks and by size and well when we speak about size of a task we need to define that too right so size is a measure of load that this task produces and task may be big when the load is big little when the produced load is little or other when it doesn't exactly fall in either of the two previous categories and thresholds for defining which tasks are big which tasks are little and which are neither of those are parameterized so there is a huge possibility of customizing QHMP depending on the system that you're running it on and then scheduling a task well obviously should depend on its properties and then again we need to define task size in a precise way and it's done basing on the task demand calculation test demand calculation is calculated basing on the formula that you can see so it's delta time the time all task running on a core in a period of time now multiplied by the current frequency of this core divided by the maximum frequency across all cores so we do account for differences in maximum frequency between the golden cluster in the silver cluster just in case and then to be more stable we need to calculate task demand over several sliding windows and this is also a parameter called n and usually in most implementations we said n equal to five and then we either calculate average demand or we take the maximum demand or we do some kind of combo and some testing results showed that the best possible situation is when we calculate demand as being a maximum of the first demand in the series the one that is most recent and the average calculated over all the samples as I said we already account for difference in maximum frequency between the big cluster and the little cluster but we also need to account for high performance of big cores even when they are operating on the same frequency and then we add a coefficient called max possible efficiency and we scale demand according to that coefficient once again it is also a parameter and it's usually set to two so we consider usually we consider big cores being twice as efficient as small core on the same frequency so then we have a pretty good understanding of what big and small tasks in qhmp actually are so a small task would be a periodic task with short execution time and a big task is a task producing high cpu load well being high cpu load also a parameter and usually it means around 90 percent in android there are some background threads that can become a big tasks if we only take cpu load but we don't want to do that because we don't want to schedule background tasks on big cores so the importance of the task should also be accounted when we consider it as big once again it is important to note that some tasks are neither big nor small so we say they're other and the tasks can change their size over time while small tasks may become other big tasks may become other basically small tasks shouldn't become big and big tasks shouldn't become small and if that happens in your system then probably thresholds were selected in the wrong way also when we're speaking about qhmp well the high maintenance parachute we need to understand that it's tightly coupled with the cpu frag governor called interactive which has never made its way in the mainline linux kernel so it's basically an out-of-tree governor and then it was heavily patched by cocom coderora to communicate with the scheduler in the best way or well in the way they considered best so this created a pile of code that is not maintainable and not main lineable although all for the good reasons so qhmp is tightly coupled with heavily patched governor it says performance in the slides that's not correct it should be interactive sorry about that okay now we're passing over to the fun part comparisons and the first one the first comparison was to measure power consumption for youtube playback and as you can see on the slide there's a clear win in es es is almost 20 percent better in power consumption compared to qhmp when you do the youtube playback the other tests showed roughly the same power consumption but then we did another test for frame drops and it turned out that when there's a burst to load then es doesn't really work that well for instance if we take the chrome scrolling which is represented on the slide on the graph as the first two columns we can see that es is more than twice worse than qhmp in the number of frame drops so the quality of service is actually degraded if we take es okay here comes the grumpy cat and we're really upset together with the grumpy cat because we don't know what to pick up because on one hand es works best with a steady load showing excellent power consumption results and acceptable quality of service but when it comes to burst to load es don't doesn't cope that well and well then the qhmp seems to behave a lot better so what to pick up let's try to summarize qhmp has a strong focus on performance while es together with pelt is more focused on power conservation qhmp is complex out of tree has alpha skated code very flexible but not really maintainable and couldn't ever be mainline also it's worth to note that well with the flexibility of having many parameters to tune it comes also an issue of combinations that have never been tested and i think we got some feedback that it's around 90 percent of the combinations that have never been tested so well that doesn't look too good and es is looking good but it's not delivering the quality of service right so once again what to pick up yes that's a very good question you mean you mean the decay coefficient the q or right right if you if you don't mind i will postpone this question because i believe the answer is given later in the slides and if it doesn't then you're very welcome to ask again okay any other questions so far ben uh well well we uh we measure power well it's not exactly power consumption but we measure average current so the the result uh was showing average current uh per of the time of youtube playback what well given that we use what well it should it should have been probably said explicitly but we measured using the external battery which maintains a stable voltage of 4.0 and with that said you can you can use simple multiplication skills to to produce the the power result so this yeah this probably had to be said but yeah the result is in milliamps because we we measure average current but it's easily convertible into the real power consumption because the voltage is constant any other questions so far okay so the way forward for EES was to move forward with EES because well it's maintainable and the code was a lot easier to learn and once again EES stood the chance of mainlining while QHMP didn't but what what could we do to improve it to show better results with regard to quality of service and the answer was well we can use test demand calculation from QHMP so we can take it off QHMP and use it as a separate module replacing pelt so here comes the volt arrival to pelt which is window assisted load tracking and in fact it does implement in window demand calculation which was previously implemented within QHMP so here on the next slide you can see the block scheme of updating the task average demand explained or depicted in more detail so where the delta is the time of task running on the core in a period of time in a window so if we are in the new window then we update the history I think they know and yes here are mixed up sorry about that but at least we can verify it now so if if this is a new window then we update history we drop the last sample we move the all the other samples by one and then we get the new sample with update history function and if this is not a new window if we're executing from the window that we've done some samples before then we update the average and thus we give the runnable some variable we get it and it's the result that we use for estimation of utilization of CPU it's important to mention that we use samples obtained from the last window not the current window because for the current window we don't have all the results yet well what else walt is also tightly coupled with the cpu frag it provides data the cpu frag about the cpu utilization and it also notifies governor the cpu frag governor about inter cluster migrations because cpu frag governor operates only on a single cluster so it operates well there are two instances of each governor whatever governor issues one operating on a big cluster and one operating on a little one and they basically are agnostic about each other so we need to notify cpu frag governor if we're transferring a task from one cluster to the other and then we have a picture of cpu load tracking compared between walt and pelt and they're really looking quite much the same right especially if we take the upper ones but if we magnify strongly we can see that pelt is actually ramping up slower and decaying slower also they were also some initiatives to change the decay factor which were not accepted that well by the community so well pelt was considered to be obsolete basically and that's also the result interpretation so now we have a happy animal because now we get all the way to something that works equally well power consumption wise and performance wise quality of service wise because walt is ramping up and down faster and the fact that it's ramping down faster is also quite important for power consumption and the way the fact that it's ramping up faster is important for qos but still given the possible spikes due to less stable operation of walt we were concerned about power consumption possible increase in power consumption but they were tests done showing that there is no actual huge power consumption impact because we don't have the need for frequency boosting which we had with eas okay the wrap up there is a nice summary between pelt and walt put together in a table but that's like i'm really i don't feel the need to read it it's it's there for your convenience and for my convenience if you want to ask some questions about it you're very welcome otherwise i'll just pass over to the current status of eas and the current status is that walt has become the first and main choice for it due to the reasons that we have talked about just recently and first of all the better qos but also that there is no power degrade and eas walt is effectively eas plus the task accounting from qhmp so whatever bad words i had said about qhmp in the past well actually a huge pile of code from qhmp turned out to be useful and important and also it's always a good thing when two competing implementations converge and the resulting implementation takes the best out of two worlds so that's that's how open source should work that's how collaboration should work and yeah that's just a good thing there are still some small things some small deficiencies that we believe are there for the current eas slash walt implementation and as a funnier thing those are the deficiencies that were not there in qhmp for instance there is no notion of big and small tasks in eas and sometimes sometimes that leads to suboptimal results for instance when we consider eas and task packing as you know due to the algorithms eas wouldn't pack a task if that would mean raising cpu frequency but on the other hand as shown on the graph to the left the power consumption may be bigger if we have two cores operating on a smaller frequency as opposed to a single core on a slightly higher frequency that changes when the frequencies are rising but on small frequencies that is usually the case so sometimes it is actually better to pack tasks even though packing would mean raising a frequency for the cpu that the task ends up upon as opposed to having two cpus running at the same time for instance on the picture to the right we can see that the second small cpu is basically unused so it's better to switch it off and put the task being scheduled on the first cpu provided that the frequencies are low enough but we cannot do it in eas because there is no such algorithmic code in eas and we could do it with qhmp because if the task is small it would have been placed like that oops conclusions well we were speaking about big little and eas and eas is the primary choice for big little architectures and well big little architecture puts high demand on the system software and especially on the scheduler and also on the dynamic voltage and frequency scaling implementation so given those high demands it's hard to be perfect right and eas is not perfect but it is still the way to go and this I believe is the unbiased view on eas okay as the last thing I would like to thank people who are not there but who were helping me out making this presentation Vlad Resky I'm working with him in sweden on multiple projects related to android power and performance optimizations Antonogarov that's one of the students I've been talking about in the beginning of my talk Tanya Nikludova who made the pictures of happy not so happy animals that I used in my slides and then my wife who's also an inspiration for me and she also shown a lot of patients while I was preparing the slides and also I would like to thank you all for the attention this presentation is over and you're very welcome to ask any questions you have yes please well well preference to keep a CPU on a core to keep the cache as hot is definitely a good thing and it's implemented I think in in yeah it's implemented in the latest eas anyway I'm not entirely sure if it was there at the time we tested eas belt versus qhmp as far as I can tell the the main problem for frame drops was different and that was that the CPU wasn't ramping up very quickly because I mean we need to raise the frequency if the load goes beyond the certain threshold and the data that belt was supplying to the dvfs was lagging in time so we had burst to loads but the CPU frequency wasn't raised in a timely manner does this answer a question that depends on the architecture and it depends on on operating point that we're switching from and to usually if we need to raise the voltage then it takes a significant while well I mean significant in terms of microseconds if if we don't have to raise the voltage to jump from one opp to another then it's usually quite fast yes please well well the the energy costs are part of part of eas and and it was in in one of the first slides when when we were now talking about putting a task either on one of the available big cores or one of the available little cores speaking about shutting down ramping up a certain CPU that's also the thing that's related to CPU idle so that's that's another mechanism in the kernel and I believe this goes a little bit beyond the the goal of his goal of his presentation this this is a complicated well not that complicated but this is a separate thing and and we can discuss it later because otherwise I believe it will take some minutes to complete then that's that's a good question but I believe it needs some kind of extra definition are you talking about an hmp system that is running a basic cfs scheduler or are you talking about eas running on an smp not hmp system well it depends it depends on the load depends on the type of operation if if we take if we take the scrolling in chrome type of thing that eas was lagging quality of service wise compared to qhmp then there's going to be a huge difference in power consumption to go all the way up to the same quality of service if we just if we just use cfs for for other loads there can be a smaller difference so it's it's about energy awareness right any other questions yes please you mean you mean it's possible to configure qhmp so that it's more power conserving and less performance oriented right yes yes that's that's true of course eas also has some parameters even not in such large number as qhmp but to to go all the way up to the quality of service that qhmp was able to provide we needed to turn boosting for for the burst loads which is also I mean it's it's also a parameter but if we turn on boosting then the the whole advantage of eas that was there in terms of power consumption goes away and in fact in most cases it becomes inferior to qhmp in terms of power consumption because we we just no we basically just have all the frequencies up just just because we do not rely on getting the information about cpu load in a timely manner so we mitigate that by just turning the frequencies up if we see that the chrome is launching right so that that's that is basically a method but it's more of a workaround right yes please that's a very interesting thing to to try out but we were really concentrating on embedded stuff any other questions yes please I need to double check I don't think it made much difference I don't think so no any other questions well I guess not well thanks again for your attention