 What is big little? So big little is an architecture that includes different CPUs But that have a an architecturally similar That are architecturally similar so basically the CPUs have different capabilities, but their architecture is the same So the code that runs on one processor Can easily run on the other processor and not only just run but can be migrate seamlessly on the other processor The fact that the architectures are similar also Means that there's a one-to-one mapping between one processor and another Okay And this is key to our design so The processors as I said they are similar in our architecture, but they differ in the performance that they yield The the chip that we have been working with is the TT2 from arm So it has a cluster of a 15 cores and a cluster of a seven core in In the clusters there's there's three a seven cores and two a 15 core Usually you want to have the same amount of chips in your clusters we did or the the the odd number of of cores in the a seven cluster in the little cluster Was on purpose so basically you design a bad system just to make sure you catch all the bad clauses But normally you want to have a system that has same amount of cores in both clusters So as I said the the the a 15 the a seven are architecturally similar they both implement the arm V7 a architecture The the caches are our snoop with a cache coherent interconnect So the CCI 400 and there is also an interrupt controller that allows Interrupts from any source to be routed to any destination so any processor to any processor so this is the This is a picture of the chip So we have our two clusters here the L2 the interconnect at the bottom that is seamlessly snooping the caches and the the programmable Source and destination controller here at the top There's also an IO a coherent master in our design. We didn't have to use it simply because we did not have An external processor like a GPU that needed its cache to be aligned with the rest of the system Okay, the idea behind big little is to provide maximum power or maximum efficiency with Maximum power saving so we want to have the performance that Like an a 15 a 15 cores would give you but with the power saving that the a 7 cores Would would provide in the system the original idea was to run Tasks that are CPU intensive on the bigger cores and tasks that don't require as much horsepower on on the smaller cores But that is very difficult because you can't predict the future. You don't know who's going to be CPU hungry and who's going to be light? Another way of Approaching the problem is to go with the system load. So how busy is your system if your system is very busy You move to you use the the bigger cores if your system is lightly loaded then you save power and you move to your smaller cores and that's the that's the the state of mind that we decided to to use in In in in our design So we currently have two projects in parallel One's called HMP. So it's a genius multiprocessing and the other one is IKS The cool thing is that They are present in the same tree So one can move from one scheme to another scheme on the fly Okay, so it's enabled in in the sysfs so you can turn on the IKS solution whenever you want Or you can switch it off in order to make different benchmarks if you want The in kernel switcher can also be turned on at boot time from the kernel command line or Via kernel config at compile time. So let's talk about HMP just a little bit In HMP all the cores can be powered on at the same time They can be processing instructions at the same time It doesn't have to but if your system is heavily loaded all the cores can participate In order for that to happen the CPU has to be aware of The difference processing capabilities of your cores in your system And it provides higher peak performance in some Some of the tests that we've ran Interesting is that everybody can participate in this project. It's open. It's done in the open Our tree is open and it's currently running The fact that it's being done in partnership with the community introduces delays that Well, we're all like familiar with Which is why we were also working on the IKS the in kernel switcher This is a Stepping-zone solution. It's not perfect, but right now it's booting. It's providing awesome performance and Most important it is readily available So it's providing like a solution that people can start base basing their products on in order to To move quicker. Okay In IKS what happens is that a big CPU and a little CPUs are coupled together into a virtual entity and That virtual entity is then presented to the kernel as if it was one processor and That allows us to reuse everything that the current SMP scheme is giving us So the kernel is not aware of the big little architecture that it's currently running on so All most of the code that we wrote was pertaining to the management of the clusters and Not necessarily with scheduling or anything else I'd say 99 of what's in the 99% of what's in the chrome right now has been reused. We didn't have to change it It's important to understand that once you have that that virtual core presented to the kernel You can only have one core so either the a7 or the a15 Processing instruction at any given time. Okay, so if you started in your system with six CPUs So three CPUs in your your your big cluster and three CPUs in your little cluster Once to switch your logic is switched on it's turned on you end up with three cores okay The the the decision within one virtual core the decision to to either use the big cluster or the little cluster is Taking at the CPU freak driver level and that's the only entity aside from the internal switcher That is aware of this virtual coupling that we've done Okay, aside from that it's seamless Everything that pertains to a normal core also applies in this case This solution as I said is readily available We released it to our members back in December and we are still continuing to fix bugs on it They aren't that many it's providing a pretty good results will we'll we'll talk about that later So yeah, it's pretty much done and it's it's working very well One possible solution in the coupling of our CPU was to go horizontally so basically take As I said earlier if your system is lightly loaded you use the the small cluster and as The man on the CPU increases you move on to the big cluster The problem with this this approach here is that the granularity is too coarse It's either all or nothing you're either on the small cluster or you're on the big cluster and There's some some power savings that can be done here in between It's also introducing a synchronization period that we did we did not want to take so if The the the first score so the a7 zero is ready to move on to the bigger cluster There has to be some synchronization done with the second Core in the cluster for it to be ready for to move as well and while that happens You can't really process information. You can't you can't continue The work this was doing and therefore it's it's increasing your your your blackout period a Better solution was this year to integrate vertically Okay, so by integrating vertically You create two virtual CPUs and they are basically independent from one another so at any given time you can have a Processing happening well processing can be happening on on the different cores independently So if on the virtual CPU zero you're running on the small core nothing prevents The big core to be used on the second CPU Okay, and so on for the rest of your CPUs that you might have based on the amount of CPUs you have in your clusters Okay, so the idea here is that all of the clusters all of the CPUs in the all of the virtual CPUs are independent from one none one another and They can basically Use whichever core within the virtual cluster that they want There's there's no rocket science behind how how the CPUs are grouped together all we do is basically in the switcher logic is we take the Sequential number of each CPU in the cluster and couple them together as I said earlier in the TC to implementation We have a straight a seven core So that a seven does not have a counterpart in the a15 cluster and as such we switch we simply turn it off so in order to use to make maximum gain or Utilization of your hardware and if you want to use the IKS solution You best have the same amount of of cores in both clusters. Otherwise The cluster the cores that don't have a corresponding Core in the other cluster are simply switched off One stack grouping has been done one of the cluster in the virtual in the virtual CPU that was created is switched off and The the current algorithm that we have is that if the CPU is part of the booting cluster You keep it on otherwise switch it off But it really doesn't matter because as soon as your user space will be loaded and the governor will start normalizing everything and then Based on on on how loaded the system will be Either the big the big core the little core will be used Okay, as I said Nobody else aside from the switcher logic and the CPU freak driver needs to be aware of that coupling and that allows us to just Hide the big little implementation from the rest of the kernel When the switcher logic in this lies is so when the kernel boots all the the processors are discovered by SMP DSMP ski okay, so up until the point where the switcher logic is turned on The CPU is faced or the switcher knows about all of the CPUs in the system switcher logic comes on and Synchronizes with the CPU driver core to basically tell it to unregister itself Initialization of the switcher logic is done The CPUs are coupled together presented to the kernel as one as a set of virtual CPUs and once that is done another message is sent to CPU freak driver to re-initialize itself But this time we'll re-initialize itself on the virtual CPUs rather than the real core that existed before Okay, and during that initialization in the CPU freak driver The the operating point of frequencies that can be tolerated or accommodated by the virtual CPUs is The aggregation of the both both your small core and your big core So if you if you had for instance eight Operating operating point on the a7 and eight operate operating point on the a15 Once the switcher logic has been turned on and the CPU freak driver is re-initialized You have 16 frequencies that are presented CPU freak core Okay, so this is this is basically it before the switcher logic You have like the a15 that can go from 500 megahertz to 1.2 gigahertz and the a7 core And these are frequencies that are presented to the CPU freak core Once the switcher logic has been turned on then we basically Exposed to the core 16 frequencies, which is the compounding of Our a7 or a15 here But you will also notice that The the upper frequencies have been split in half right these frequencies here Can be whatever you want it can be numbered from from 1 to 17 it does not matter These are simply indexes that the CPU freak core is using When the the governor is requesting an increase or decrease of of CPU performance So we can write whatever numbers we want here What is important to understand is that if a frequency of 300 is requested? The CPU freak driver will turn around and say 300 on the a7. That's basically 600 megahertz So the real frequency that will be that will be running is 600 megahertz all right When I've been giving this this this talk before people were saying well Why are you reducing the the the capabilities of of the a7? They're not really reduced right? This is simply a table that's being sent to the CPU freak core To tell it about the offering point that are possible by the architecture Okay, this is this is very key to To the solution any question on this No, it does not the question was if it matters if the numbers are sequential absolutely not Yes the question was if Depending on your implementation that that's that's definitely specific to our implementation All right in our case it was true Yes For a15 a7 it was half Right, so you could do twice as much processing on the a15 and you could on the a7 All right, so when we first did this what we did is simply double the frequencies of Of the a15 just but it's really the same thing as I said these are simply indexes All right, it could be 1 to 17. It doesn't matter Right. It's up to the CPU freak driver to do the proper conversion in the background Okay. All right Technology sorry about that Okay, so this is another way of looking at What I just presented earlier all right We have our real frequencies our real offering point and power on the side here The the blue line is the a7 and the red line is the a15 When the switcher logic is enabled This is what's being presented to the kernel All right one long contiguous range of offering points and in the middle here We have a chasm we have a bridge just because we go from the small core to the big core We'll come back to that graph later CPU freak driver is doing a lot in the solution but that lot turns out to be very simple and Going from one core to another core in a virtual CPU really comes down to these eight lines of code and is very simple so basically if you're on the a7 cluster and You're there's a request that comes in for a frequency that Can only be accommodated by the a15 Then your your new cluster is the a15 and the same is true on the a7. All right, so And in the end there is some housekeeping that's happening between the two But in the end once that that the CPU freak driver has determined that we want to move to another cluster it simply calls a switcher logic and For a request it simply calls a switcher logic to move the current processing from One core to another core and again that is still between that virtual entity that would you present to the kernel? Okay So so far we have seen that we have basically the switcher logic and the CPU freak driver That are aware of our little scheming, but nobody else right and that's exactly That's exactly what we'll go and present here. How do we go from one core to another core? If we take for instance At any given moment we are on a virtual CPU zero and we are running at 200 megahertz and There's a new request that comes in to run at 1.2 gigahertz CPU free core is allowed to ask for that because we have presented one big range Right CPU free core knows that we can operate that CPU can operate between one 175 megahertz to 1.2 gigahertz So a request like that is definitely sound and should be accommodated on The flip side the CPU freak driver knows that the a7 cannot deal with a frequency of 1.2 gigahertz All right So we will go through what happens when the CPU freak driver decides to go for one core to another Okay, so as I said running on virtual CPU zero at 200 megahertz, and we wanted to go to 1.2 gigahertz. So what happens? In this here the a7 is Will be labeled the in the outbound processor and the inbound processor will be da15 Okay, so when a request comes in The outbound CPU will switch on the inbound right so power up What we've done is Normally when a CPU is powered up in the kernel it will simply call secondary startup In our case, this is good But we need to switch on the cluster and switch on the CCI if those haven't been set up yet So we created a new start-up point for the CPUs that will do just that Okay, so Outbound powers up the inbound inbound comes up to life. Whoo-hoo. All right, we're executing It will look at the state of the cluster. We'll look at the state of the CCI All right initializing the cluster turning on the snoop interface if this is the first CPU in the set All right once this is done It will simply wait in a tight loop for instruction from the outbound processor to move farther and And this is because we want to have an exact synchronization Between the two cores in order to allow for cache snooping between the two Because you don't want to have to go back to RAM once you've had like a switch over You want to stay with your cache and this is why we have the enter the cache coherent interconnect We want to snoop our caches in order to minimize the hit that a move over will get you All right, so Interesting thing is that the outbound is still fetching is still processing While the inbound is coming up right again that is still to minimize the hit on on moving from one core to another and The outbound will simply wait for an inbound alive signal Okay, once that this one stat has happened We have an outbound that is ready to go out and we have an inbound that is ready to come in So a normal step of operation interrupts are disabled Migration of interrupt secure so interrupts that were routed to the outbound processors are routed to the inbound Right and this is when we start the blackout period Where nothing happens in the system Okay, the current context of the CPU is being saved and once again We go back to our one-to-one mapping that I talked to at the beginning All right, both CPUs are running the same architecture. Therefore. This is possible Once the context has been saved We simply instruct the outbound to start at secondary startup and Secondary startup will be fed the same CPU ID that the outbound processor Therefore the inbound will start exactly at the same point that the outbound left Once the inbound has received is executing secondary startup. It will simply enable interrupts Handshake with the outbound and continue normal execution This is exactly as if you're reopening the cover of your laptop and the CPUs are coming back to life That's exactly the same sequence Except that instead of being on one processor We're now on the other processor and this is possible because of what two things We fed the exact same CPU ID that the outbound left us and we have a one-to-one mapping between Our CPUs in terms of register sets This here once the outbound is alive kicking and executing yep Come again. We The context in the CCI The CCI is not involved in this all right absolutely the only job for the CCI is to snoop the caches right So at one point Earlier in the sequence we open the gate on the CCI on the inbound processor And now we're just about to close the gate on the outbound So once the inbound is alive and kicking It's very important that its stack and its contact is not disrupted by the the cleaning up of the outbound otherwise you'd you'd have You would basically step over the context that the inbound is acutely and lead to a crash Okay, so for that purpose a new stack is spun off as soon as secondary startup is called on The inbound CPU since a new stack is spun off then you can do pretty much whatever you want and Go about shutting down the outbound cluster So at that point the cache is flushed if if it's this last CPU in the cluster then the CCI for that that cluster is switch off the interface I should say and The cluster itself is disabled and we call wait for instruction. That's it. The outbound is dead Yes, well The last man is is is on the per cluster. Okay, so Right, so so so basically if if you're if you have a cluster of four cores Whether it's the big or the little right and it's the last CPU in that cluster to be switched off Then what you want is to these disable snooping for that interface and disable the entire cluster because these take power Right, so that's that's the case of the last man here So the catch is in the big and the little clusters around the same clock rate per snooping absolutely Absolutely. Yes, sir. I speak louder, please. I got the the other I wouldn't say so I wouldn't say so I Yeah, maybe I'm not getting your question properly, but I Maybe rephrase it. These are expensive Yes, it costs a blackout period for sure Yeah, I doubt that you will be able to see it from a user point of view I this this is this is 20 micro seconds Like that the switch takes over Right, and you might see it for like like that that 20 micro second, but after that it's gone, right? You're running on the area core So you might take one hit, but I can't hear you Well, it's important to understand that CPU idle is still is still We still have a CPU idle a driver that takes care that's running in the same in the system at the same time It does not do that. It does not do those are two different operations So CPU idle will switch off an entire virtual CPU rather than moving between the two Am I doing for time? Okay, it's not bad Yes Yep Yeah, so it's all seamless Absolutely, absolutely. Yes. Yes. So so two different things Absolutely, absolutely, so that's why so in at one point we had to stop right we said we're producing a model We're producing a reference platform, but you can go absolutely math crazy on on how to in fear that oh This next switch is going to take me too much and therefore I'm gonna you know Stop there. There's there's no end to the amount of algorithm that you can do for that All right, and at point you think well, do I have a product or do I not have a product, right? And and one one algorithm will probably work with for eighty eighty percent of your test cases And then what do you do with the next twenty percent? You have to start over Right, so it becomes extremely difficult Okay, a few things that we haven't talked about so far But if you start looking at our code that you will stumble on it pretty quickly The mutual exclusion that we have for setting up for configuring the clusters right this is This is happening in a real time. Okay, we cannot We cannot count on the fact that there will be only one processor at any given time configuring the CCI Right, so we have we needed to introduce a mutual exclusion Algorithm that works in assembler In order to avoid multi multiple CPUs try to configure the CCI in the clusters at the same time We talked about the last man standing. All right. This is just the tip of the iceberg that we talked about Okay Early poke mechanism so the handshake mechanism that works with IPI between the two processors We we can't go into details into this but it is there and is key to the to what to the solution Yeah, and we're also keeping state the states Tracking the states of our CPUs and our clusters. That's also important when When to know if a CPU has been requested to shut down, but another CPU is requested to bring it up, right? So it's very difficult Okay, so we know about the in kernel switcher We know about the CPU freak driver and super you freak core What we haven't talked about is the governor Right now we all our benchmarking had been done with the interactive governor because The benchmarked themself. We're working on Android. It was beat bench So full of that and and we're also targeted and enjoyed audience. So full of that. We use the interactive governor Any governor for that matter can be used with IKS we've used on-demand and it works pretty well as well Okay, so any type of governor as long as you have a CPU freak driver Will work But you will very likely have to tune the governor and this is what we did okay in its original form the The interactive governor responds to how loaded the system is If the system is loaded to up to a certain points Let's say 80 percent it jumps to a certain CPUs to a certain frequency and from there It moves it will slowly move up to the remaining ones This is working very well But in our case We not only have one CPU, but we have a couple of CPUs within the same within the same virtual core So we needed to shield exactly like the CPU freak driver is doing right now We needed to shield the the higher frequencies the higher operating point the costliest in terms of power We needed to share to shield the late the late the last offering point on the big core That's exactly and that's why we've introduced a second high speed freak Variable that does exactly the same thing that the first width and that prevents From reaching the overdrive point on the a15 All right So this is the exact same algorithm that we had for the interactive governor except that we duplicated it All right, and that allowed us to to make significant power say again our big chasm here okay so our first our first High-speed freak which is basically the the frequency that you Run at after your CPU load has passed a certain threshold. We've set it to Basically 500 megahertz which allowed us to run 85 percent of the time Or as far as long as the CPU is loaded at less than 85 percent We would run on the a7 and from there we would move up to the a15 Okay, so as I said We were working with vbench in terms of performance metric and vbench gives you a score of for how quickly your pages are rendered a web pages are rendered and We also use the power consumed at the core and that for each core Okay, and we threw in the mix Playing audio in the background. We thought that would be a good use case So browsing the web playing audio in the background Our goal was to get to a 6090 ratio and this is very key. This is when we knew that we could stop Okay So that is if you have a perfect system or the the most powerful system would be in this case here Would be a system consisting of only a15 cores So that was our best performance and the power target So we were able to achieve 60 percent of the power consumption of a system that would have two a15s And yet reach 90 percent of the performance of such system with our big little solution Okay, that's the 6090 ratio here Depending on the tuning that you do you're able to get different power points So again, this is this is our best most performance system This is a system that would have two a15 in it All right, so the the blue diamond here shows our 90 60 points So 90 percent of the performance that's 60 percent of the power If you tune your performance for a very crisp system You can have a system with the the red box here You can have a system that runs at 95 of the performance with about 65 of the 65 percent of the power This was very very cool. We wanted to continue optimizing But we had a divergence a variance in the results that was about five to ten percent. So we had to stop Yes energy So this here this is a system that would consist of two a7 core so this is our most performant in terms of power saving okay The green the green triangle here shows that if you're not tuning your solution properly You can do worse than Then your best most performant system in terms of power saving Right, and this is this is something I really have to stress for and I will stress for the rest of this This talk here is that if you take the solution and slap it on top of your own implementation You will spend as much time tuning it than you would bringing it up right if you do not tune your governor properly all efforts are wasted and This is exactly what this shows here Okay The interactive governor the one that we took has about 15 to 20 variables that allows one to tune. Okay, that can be tuned We mainly dealt with the high-speed load and high-speed freak Okay, which is basically at if the system is loaded at X go to this frequency As I said earlier we made sure that Below 85% of CPU load we would be running on the 15 plus on the a7 cluster and above that So between 85 and 95 we would use the first six operating point on the a15 Below 95% you reach the overdrive point right maximum power In the interactive governor we found out that there were a few gotchas Perfect example is the above high-speed delay and a timer rate So above high-speed delay we saw here is how much time it takes to go from one operating point to another Right, but if your timer rate, which is the amount every time But how often you check for a readjustment of the the operating point if that is twice as long as You're above high-speed delay, then you're missing half of the chances to adjust the system All right, as I said, you can really shoot yourself from the foot With big little if you don't tune your system properly Okay Yeah, so earlier I presented like the blue box that there was the blue the blue diamond There was the red box and that was the the green triangle So these are the configuration that we these are the numbers for the configuration that will yield these These are these performance point Okay Upstreaming We have started up streaming this solution all of the code that pertains to the power management of the cluster has been pushed by Nikola And so far I understand that they have received favorable reviews on the the mailing list Following that so this is really the foundation of the solution the power management So as soon as that is has been accepted who will move on to the next stages Our goal is to upstream it entirely Regardless The entire the entire code base will become public the entire solution will become public as soon as one of our member Releases a product with the solution in it And these are all the people that have contributed to this project and there's a lot of more people at arm That don't appear