 Ο καλύτερος είναι Παντελίσσαντ Ιννύου. Είμαι πρόσφυρος να δούμε τι είναι το πρόσφυρο της ασυμετρικής προσέστησης, και ειδικά για το προσέστημα που βρίσκει με τη μικρή αρχιτερία. Δεν ξέρω αν έχεις δει, αλλά το σχέδιο δεν είναι πολύ καλό για αυτήν την σχέδια. Δεν είμαι πρόσφυρος να δούμε πώς μπορούμε να φτιάξουμε αυτή, ή επίσης να φτιάξουμε αυτήν την σχέδια, επίσης δεν είχε δει αυτήν την σχέδια, και όλοι τώρα τα μικρή αρχιτερική σύστημα θα προσπαθούν. Η ασυμετρική προσέστηση δεν είναι κάτι νέο, γιατί αυτήν την σχέδια έχει παίρνει για πολύ χρόνο, αλλά αυτή είναι η σημαντική, ότι έχετε πολλές προσέσεις στο σύστημα, και ουσιασμός με ασυμετρική προσέστηση, η προσέστηση, το πόρυο αιδικογραμμό, και η μικροαρχιτεξία της της σχέδιας, διαφορεύει πολύ. Η αρχιτεξία, όπως είχε δει, είχα ένα cpu και ένα dsp σύστημα, ή ένα ps3 σελ, τις pru και το am335x, ή τα σχέδια με την προσέστηση της σχέδιας. Τα σχέδια είναι κάτι που είναι νέο, δεν ξέρω αν έχετε παίρνει άλλες προσέσεις. Είναι η σχέδια που η αρχιτερική προσέστηση, που δείξει πώς φόρουν δυνατήρες και μιλάσεις. Τι έχετε κάνει, ήταν restarted by Cortex-A7, ένα υπροσέσιο με μία υπροσέσσο, ένας είναι η ικοδιασμένη της Cortex-A15. Now, they all can run on our own. The засinglecernals have the same user space, same micro architecture, and the micro-resin codes are cheap και οι κορδές και οι κορδές. Υπάρχουν λίγο τριβερός προγραμμούς, δημιουργές, like the layer 1, the instruction class line size is different, the performance monitoring unit has different counters, but most people don't care about that, just the same. So I copied that from one of the arms white papers and what it means is when you have low workloads, just run them on the A7, if you have high workloads you have to run them on the A15, but it's not as easy as it appears. So a very rough overview is that a single A15 core is about two times the performance of the A7 and the A7 core is about three times a more power efficient than the A15 and that's actually a very rough approximation because different workloads, for example neon tasks run much more faster in the A15 core than the A7, but the scheduler just thinks everything is the same and that's not good for us. That's a typical big little system and you see it's the same one that was in the big little switcher you have clusters of A15 and A7 cores you have the coherent interconnect and the rest of the staff interact controller which we don't care about much. The thing that we have to notice is that in most implementations the clusters are powered and have the same frequencies so it's not like you can select to have each A15 or A7 core run on its own frequency you have to clock them together and power them down together. So we have the problem that software normally should reflect the hardware architecture of the time so we need to find a way to measure the work done in a unit of time. Now when you have symmetric processing or a single core processor measuring the work done is okay. It's easy because you just measure the time between its scheduling period so you can easily figure out how many let's say MIPS or something else the processor was executing but when you actually try to measure the work done it's much more difficult because you might have class size effects IO bandwidth differences and the PMU performance monitoring units are not standard and they have a high overhead so normally you don't use them as an input to the schedule but when you don't care about power at all the work done in a unit of time is directly proportional to the unit of time that means you can measure the time a task takes to execute so you know how many MIPS that took and essentially you can figure out the power consumption of that task so to measure the power consumption so to make things more simple we can try to have a very rough power scheduling example and it's a bogal example because things are very complicated and when you try to actually work on a real example you cannot figure out things easily so let's say you have a very simple system with one Cortex-A15 which can execute two bogal MIPS per second and will consume three bogal watts per second and Cortex-A7 which executes one bogal MIPS per second and consumes one bogal watt per second the workload that we're going to base our example is just run two tasks in parallel if possible that the first one will take 16 bogal MIPS and the second just 20 of course a real system is much more complicated because you have many other factors that affect you so you could have tasks that are memory bound, IO bound hit the cast more, whatever so let's try to come up with the most power efficient policy so if we don't care about performance you can just say run everything on the A7 so if we do that we'll find out that both of the tasks will execute in 36 seconds and the power they would require would be just 36 bogal watts of course that's not something that you would like to do because that means that you have the fast CPU just standing there not doing anything so you could only use that policies if your battery is dying down but let's try to see what happens when you try to run it as fast as possible so that means you'll try to fill up all the capacity of the A17 and the A7 should only be used when the A15 is not available so if you pack them together you'll see that it will take only just 13 seconds to run and the power would be 49 bogal watts now of course this is the fastest policy but this is what we want to do when you're connected to the mains power or you run some kind of high performance workload but you wouldn't want to use it on your phone or your tablet so if you just charge those that's what this would look like so you have one axis which is the power and one axis which is the time taken so ideally you would like to have a system policy that places your workload somewhere in the line connecting those two now you could say that you could have a point where your optimal system policy would be and that point would rely on things like workload that you're running whether you're running on Android and let's say on the home screen or you're running an application you know it doesn't take a lot of time so you could vary the operating point along that curve so that you find what's your optimal performance and power balance the problem is that all this is ideal there's no way that we can reliably know beforehand the amount of time a task could take until it gets to the point where it doesn't need the CPU anymore so what we can do we can use the history of the task execution to find out to use that as an input for our scheduling policy so the easiest way to do that is try to find out what is the average task load the average task load could be just the amount of time the task would run in one second period or something like that now we can measure the CPU load of the task but we don't have any way to measure power but if we have an idea of what the different power efficiency of its core is we can assign a number according to the amount of time that task would take the thing is the scheduler does not track either MIPS power or power efficiency so we have to use a different method to get the scheduler to do our work now so let's go to what's the Linux scheduler now I don't know if you know about that it's much better than it used to be it doesn't have any weird tunables it seems to work it uses a pretty simple algorithm comparatively and the idea behind it is that you have a virtual processor which you assign a slice of the real processor to it so it's simple to do what we want to do but the conceptual model is simple if you have three tasks to run you should give it's task one third of the CPU time of course you have priorities and all those things are calculated by math you don't have some kind of magic value that says you need to do this if this task is interactive or you need to do that if the task is IO bound or CPU bound the thing is the scheduler doesn't have anything to do with the power with the power management facilities of the kernel so it assumes that the power policy thing is something completely separate so you have CPU freak, you have thermal management or whatever you need to do but there's no communication between that and the scheduler the thing is when you start thinking about power it's pretty bad because that means that your system is reactive, it's not proactive it will try to do something after you've crossed some kind of threshold and it will try to find balance again but it's simple and it works now so we need to have a way to track the history of the load of a task so that's what the per entity load average tracking patches do it's a pretty recent development by Paul Turner and tracks loads by adding together the per scheduling editing which is a scheduling editing, it is a task too per task instead of per rank you so that means instead of having the load being calculated on its CPU it is calculated per task and then it is summed together to come up with a load of that rank you it is pretty accurate, it is not prone to artifacts compared but it has to do with the time where the sampling of that of the load calculation ends up and it is actually the basis for all work on asymmetric processing and the basis of the Linaro HMP patches since ARM uses big little and they want to sell a lot of chips there they have to do something with that situation so there's a very big activity there so they want to teach all these things that big little needs to do to the linear scheduler they do use the per entity load tracking patches to assign tasks and they also need more so they have some topology patches that assign the relative power of its CPU to some scheduler internal structure so that you know so that you can deduce the amount of power its CPU takes comparatively so I don't know if you've looked in the scheduler there is a maximum which is about 1024 assigned weight for its CPU and what they do they scale that according to the relative power of the processor so that means when you use those values in your load average calculations they are stable in time so what happens if you have to calculate the load average of a task you have to take into account where the task runs because if it runs on the fast CPU its load is different than the one if it runs on the slow CPU so when migration happens you need to make sure that the history of that task is invariant compared to where it's running and where it has been run and that helps us there so the thing is it's a little bit invasive there are lots of if-defs in the scheduler source code and a little bit big little specific because they have the concept of a down domain and an up domain where the down domain is the domain of the little CPUs and the up domain is the domain of the big CPUs and they track the load of its task when it crosses a threshold the task is migrated to the up domain and when it falls down below that threshold it's migrated to the down domain there are a few problems with this approach the first problem is that you are dependent on the latency of the load average tracking so if you're calculating your load average in a very small period that means that a task that runs for a very small time is characterized as wanting to use the big CPU and it will be migrated immediately to the up domain which is not really what you want you won't have a more slow tracking of that value so one of the first changes they did that they have a way to modify the load tracking average to something larger than the original thing which is just 32 milliseconds to something like 250 milliseconds so the task has to be running for a considerable more time until it gets put on the big CPU on the other hand that means that if you have a task that you know that's interactive it has to wait for that amount of time until it gets migrated so it's a little bit hard to find the right balance so ideally you would need to have a way to predict in a more fine manner what the future behavior of the task will be so instead of an average maybe you could have something that's based on where the task is waiting we have a per weight queue estimate of how much time the task will take after it wakes up well that's a bit into the future now there are of course other people working on that so we have the alexis power wary patches which is more generic and it's nothing big little specific but it should offer improvements on every powerware system it does rely on the per entity load average patch set and it has the assumption that run to idle is beneficial to power management which indeed is the case in symmetric processing systems but not quite right for big little systems it packs tasks in a few as possible core clusters so that it can power down the ones that are not running and I think it has a better chance of getting into mainline before the HMP patches will narrow now there are other things that people are looking at like the polymer kennis hot plug cleanup what happens is when you don't need a core you want to set it down but the hot plug operations after now were something that was only done on big server systems and they never really took into account how fast this operation takes so there are cases where it could take something like 500 milliseconds and people have been working on that and I think it's much better now I think it's down to a few tens of milliseconds so when you have a policy that allows you to put down put those CPU to power them off you can use that and it's much faster now we have the cluster aware idle patches because as I said earlier the ISOCs do not have its CPU on its own power domain so you might need to coordinate access between those CPUs in order to be able to put the cluster into power down and I know you probably know the big little switcher which was a previous presentation and it treats the whole big little problem as a problem of a CPU freak but it's not as efficient as MP Scheduling but it's gonna work which is not something that we can say about MP Scheduling so what happens when you actually want to try all these things big little systems are pretty hard to come by the only known big little system that you can get right now I think it's ARM's Versatile Core Tile Express which has two Cortex A15s and three A7s it is available but I think out of the hands of the community it's a few thousand dollars next one is the Samsung Exynos 5.8 should be available soon and there should be something called OMAP 6 on the way but it's not gonna happen but that's the part that I was working on So how do you simulate that? Well you could use ARM's Fast Model Emulator the problem is that it's quite slow for doing anything with a scheduler you can only try synthetic benchmark or very small fragments you can use it to verify the hardware design and the software design but it's pretty slow for collecting data Now since I was doing this for TI a very cheap way to simulate it would be on the PANA board Yes which is two Cortex A9s it's not a big little system but we'll see if we have ways to simulate that pretty closely The kernel it uses is in the mainline which is pretty important when you're doing scheduling work and you also have access to some Android kernels which are not mainline but you actually want to use them because all the relevant consumer stuff that you want to do on Linux is Android and it's very cheap So how can you check the performance of the scheduler? You can use Perf which is part of the Linux kernel Perf is not just for profiling you can also use it to monitor things in the kernel like scheduling events and there's pretty good facilities there There's a problem with Perf though it's not portable between architectures and there are differences between kernel versions So you might have a kernel version that your Perf trace won't work on a different kernel version So that's a problem when you try to use Android because Android usually doesn't have the latest kernel and when you capture something there and you try to use it as an input on a later kernel, it won't work and the best way to visualize what's running on the CPU is just use Perf time chart Perf time chart uses Perf trace and it generates an SVG file that you can see what's running and where in your system So I don't know if you've used it before but that's all you have to do just to get a feel of what's running It's pretty simple Now what's not simple is compiling Perf for Android but the problem is that it looks something like this I don't know if you can tell what's running or not It's pretty complicated Okay Does it run on Android? Yes Really? Well if you get the data file then you can run it on X86 but the data comes from Android Okay Or not, yeah it just needs F trace Okay Now the SVG files generated are pretty big So for a run of about 10 seconds you might expect a file of a few megabytes You can use InScape to open it and then export it It is a mess when you see it but after a while you can see what's going on and it's easy to figure it out and you can also see power events So when you have a system that CPU Freak works You can see the power transitions so you know what the scheduler is doing and CPU Freak So it looks something like this and what we really care is about the per-CPU utilization Well you could write something that pulls the CPU load periodically but that's not accurate enough for what we want to measure And we have the problem that all the stuff that you want to do for a consumer would be from Android and like it or not it's the de facto consumer level Unix API So what I did just wrote a very small extension to Perf It just allows you to take that Perf output and have a pretty accurate per-CPU utilization Now as I mentioned when you do scheduler hacking you want to do it to the mainline You don't want to work on older versions You cannot do that when you have to run Android and you want to run Android because all the hardware works in Android You have your GPU working Again it's really hard to do work on the mainline and keep it in sync with your Android drop So what I did was capture data in all the kernel running Android and then did my work on the mainline So what can I use to capture workloads portably and I can use it on anywhere I want to run So I had to write another thing So SPR replay, scheduler playback replay I know what you're going to say there is already a facility about capturing workloads using Perf and then replaying them but it's not exactly what I needed What you can do you can use that on a Perf capture and then you generate single instructions that have encapsulated the way that the scheduler works So instead of having a trace that says a scheduling out happened at that point of time you have a program that explains what the system was doing at that time So it's just a simple text file which has the name of the process running, the PID and a number of instructions for what the scheduler was doing at that time It's pretty portable and if you see a capture it's easy to understand what it is doing So what are the instructions? You have an instruction that says burn CPU for a number of nanoseconds which is what the CPU bound program would do We don't care about anything else The scheduler only cares about the program running A slip which means you should slip for a number of nanoseconds A spawn which is capturing what happens when a task does There's a fork, a clone parent which is the parent side of the fork A clone child which is the child side of the fork A wait ID which is waiting on an event which another task would use to signal with a signal ID And that's all So how do you do that? Just run, record, capture your workload Execute time charts so that you have a graphical way to see what's going on Use that to see the task that was running Its line is the name of the task, the PID and its runtime You could also capture the behavior of a specific task When you do that, all the other tasks just go away and you have a simple example A CPU cycle just runs for two seconds If you just do that, you see that it's burning CPU for two seconds After you have that capture, you can execute it on a different system and there's no portability issues there So do you have any questions up to now? So that kind of handles the way of capturing data about evaluating a big little system or a simulated big little system Now we have to see how we can simulate a big little system on a platform that's not big little So the problem when capturing data from an interactive workload is that you have to have a pretty close approximation of the user experience So you cannot use a synthetic task It's not what you want to do You want to see exactly how a real user would use your system So if you're running on an x86 system, you're pretty lucky because the CPUs have individual CPU-free controls but you're not as lucky on ARM systems because its cluster has its own power domain and CPU-frequency So the CPUs are dependent on each other So I had to find a way to simulate that What I come up with is this It's a virtual CPU-free driver So I don't know if you've ever seen a system under an IQ storm where when you have too many interrupts your system just hangs or if the interrupt rate is not as large you might just slow down The thing is interrupts are invisible to the scheduler So if you had a way to generate interrupts directly rooted to a specific CPU you could slow down its CPU accordingly so that you could simulate a big little system So what you really need is just a way to route an interrupt to a specific CPU and a way to calculate what the interrupt rate would be and how much time would you wait in the interrupt A virtual CPU-freq has two back-ends One back-end is using DM-timers that are map-specific but it's pretty accurate and there's also a generic back-end that uses high-resolution timers It's not as accurate but if you don't have that you don't have a very accurate timer you can make it work A few config options Just how to enable it and stuff like that What it looks in practice So I don't know if you can see that but there's no line that says this CPU-0 is dependent on the CPU-frequency of the CPU-1 They're unbound now And if you use this that's when on the panda when you set the maximum frequency a CPU cycle just cycles for one second and it outputs a counter that says how many loops during that second happen It's similar to the way that the Bogomibs calculation is in the kernel So when you set them on both CPUs on the maximum frequency you have a number close to about 8-7 million and when I use that to set just the CPU-0 to a lower frequency we have a lower value and if we do the math there you see that if you were to use the hardware way to set the frequency you'd have a ratio of 0.76 while the ratio using virtual CPU-freak is pretty close And the thing is it has absolutely no effect on the scheduler As far as it concerns it's like running on having CPUs that are just different So as I said I mentioned the Lunaro MP patches It just partitions the CPUs into two domains fast and slow It uses perendately load average patches to track load for its task and a way to migrate processes to and from the domains The estimates are not always correct though And interactive workloads are not easily characterized by the load average If you were to try to figure out why this happens Interactive tasks have an event loop and according to the event that ends up in the event loop a different CPU time will be needed to execute the handler for that event So a mouse event might use CPU time for 200ms quite a key press and under for example in a web browser would use more CPU And when you use just the load average it doesn't make any sense Well I managed to do some work using the MP patches of Lunaro and some changes of my own And this is a run from the Facebook application on Android As you can see it's pretty messy but what you see is that while the system was idle and lots of work was being done on the little cluster CPU 0, a task that took more time appeared and after a little amount of time running on CPU 0 I ended up migrating to CPU 1 So it works pretty good but it's not ideal yet So that's the state of things right now But ok so how can we fix that There has to be a way to fix it And I think the way to fix it is to try to figure out what we want to do with the power scheduler The scheduler right now doesn't care at all about power It doesn't even track it The first thing that we need to do is track the amount of power the task took while executing It's a similar way but the load average patches work We need to have a power consumption average patches maybe Another thing that's very problematic is that the kernel is reactive So if you do something the load goes up and then you have to react to it and then do something else What we need to do is try to have a way to predict the behavior of the task and two changes beforehand So let's say you have the system that's idle and then the user presses a key We should be able to detect that something is going to take place and increase the frequency or switch to the big domain before the user experiences latency Now that's a pretty big change but it's not impossible There is something similar that goes on its time when you use a recent CPU and that's pretty similar to branch prediction It is possible to predict branches pretty accurately now It's not possible to predict tasks wake-ups too So how should we continue maybe keep an estimate of the predicted CPU load at its times and affect change to the power policy before that condition triggers But for all this to work that means instead of having all the power related facilities pretty much isolated we should do that have that work under the control of the scheduler The scheduler should know that I'm going to need more CPU time in a couple of seconds so trigger that change before that As you can see this is very much work in process We're doing something but we're not quite there yet So if you have any ideas just let us know I was doing my best but it wasn't enough It's still an unsolved problem So any questions? You are available from the scheduler itself that will allow you to track some of the scheduler events? Yes there is You will try to drink from a fire hose The events are generated at a tremendous rate So that's why Perf is pretty good at that You can get the data pretty fast You could write your own tool that connects to the Perf interface and just put them yourself But it's not the simplest, it's just a CISFS attribute file Yeah there are statistics But that's different Yeah it is useful but it's not events, it's different It's more, it's an average If you need the events you have to use something else As far as the anticipatory business there is a pretty large body of research that goes academic type stuff all over the place that goes back probably 20-30 years in the space of real time and trying to figure out how you could essentially anticipatory Can I ramp up power up the voltage and frequency anticipating that I'm going to need power on the CPU in order to execute my time variable stuff There is definitely some work that's already been done Yeah the thing is we have a multidimensional problem You have to balance power You have to balance performance, latency, thermal considerations It's like you have multiple dimensions that you have to Yeah but the thing is the scheduler hackers don't want to do any kind of matrix multiplications in the scheduler and you have to do all your calculations in fixed point and using lookup tables So that complicates things even more Race to idle makes sense in an SMP environment but in a big literal environment it might be better for you to run your task on the little CPU if you know that it's not going to take so much time It's a matter of finding the balance because the task when it runs on the little CPU consumes one-third of the power that it will consume on the big CPU and if you don't care about latency run it there Yes, that's part of the Leonardo patches They do do that and you have to do that otherwise Yes and that's it, that's the whole idea The thing is, you put weights in order to keep your load average metric valid because if you don't put weights and you just keep track of time that means when you migrate all your history is gone because it's bogus Yes If you thought any modifications to having a deadline scheduler might be for this but it could help for that but it could help as a component as a dimension of latency So if you put latency at consideration then you have a deadline scheduler something like a deadline scheduler You're talking about to be proactive and the concept of the state schedule at least is that you can actually ensure that your task has certain utilization Yes, the thing is you have to use something in your task to say how much time I'm going to take You have to configure it We're just talking about running Android applications and Android applications are out there There's no way that you can get an app developer to measure all those things that are needed to configure the parameters that are out there Well, the good developers could do that You know, the other guys that just run write fart apps, there's no way that they can No, no, it is very useful but it's just one dimension of the problem It's like you have multiple dimensions now The scheduler was written for different time periods Things have changed and we need to change with them That's the part of what I'm talking about It's more than one dimension So you could have cache effects, cache affinity and task affinity and you should take that into account when you migrate It's not like there's a single thing that you can do and it will fix everything It looks like we need to attack the problem from all sides Anyone else? Thank you very much for being here