 Thank you. So I am here again for those who were not here one hour ago I am a researcher at IBM research and I will present now another piece of the puzzle of this Epox project that we are leading at IBM So I will very quickly skip this knowledge and same same acknowledgement as before so same people. Thank you to all of them in the interest of time I will move forward so Again, I mean I have said this already and At this time of the day I think I have already seen maybe four or five talk trying to motivate this idea of hardware specialization, right, the heterogeneous chip. So I don't need to go too much in detail the hardware specialization era is already here. This is a interesting dye photo of a cell phone chip processor where you can see for example that in addition to the traditional general purpose CPU your today you have also GPUs one or more and you have plenty of Acceleration and genes for different functions in the cell phones. So this is a very nice example of a highly heterogeneous process but I mean the question is How do we schedule? applications processes threads in this Kind of heterogeneous chip, right? I mean it's not it's not straightforward because The conventional schedulers like linux linux scheduler is not optimized for that This is a true right in order to exploit the characteristics of heterogeneous chip So we think that there is a need there is at least a call to think Harder in terms of creating more intelligent and efficient scheduling schedulers for heterogeneous chips And how does that impact on new radio, right? So in my previous talk I presented this era application where I show this new radio a 2.11 p transceiver, but in the application There was also a rose part with multiple blocks doing different things, right bunch of stuff. So in general the message is today's applications are Very heterogeneous in their nature. So it's not just about the hardware the software is very heterogeneous so in other words is the heterogeneity at software level in the application is Driven the need for specialized hardware, right? Actually in this program we can see that there are many logs and all of them are doing different stuff, right? So If we have to execute this program, we may have already an underlying chip with some accelerators Let's say we may have an accelerator for the fft. We have we may have an accelerator for vTervi Decoding we may have accelerator for other stuff, right? But as far as we know and this connects I think very well with this morning's talk by Marcus The current version of the new radio scheduler is not aware of this degree of the heterogeneity in the in the chip, right? Which if exploited properly can provide significant benefit in terms of throughput as well as power performance efficiency problem Actually, I mean regarding the new radio scheduler some prior work This is an another paper from from bastion have shown that some simple tweaking to the scheduler can already provide some great Improvement in this case for a cash effectiveness, right? So we believe that there is room for improvement for example in terms of scheduling on heterogeneous chips And this is also the big the picture I presented before I mean before when I talk about the array I was moving in this layer now I will be moving in this operating system layer But we are it with this project. We are addressing the entire hardware software stack in this DARPA desktop program so Let's say go into this idea of test scheduling on heterogeneous platforms I will present an open source tool that we developed that is called stomp which allows us to prototype Scheduling new scheduling policies in a very easy and fast manner, right? So let me introduce a stomp a stomp which stands for the scheduling techniques optimization in heterogeneous multiprocessor I didn't want to put the H there. So this is but I wouldn't I wanted to have it in genius It's an open source tool that Allow us to Prototype and evaluate new scheduling policies for heterogeneous platform in a very agile manner That team is written in python and it's very easy to customize I mean a stomp user can very easy very easily create new scheduling policies and plug those policies into a stomp and Evaluate what happens if we want to schedule. Let's say a representative new ready application on a illustrative heterogeneous processor, right? In a stomp. There are basically three Main elements we have tasks which are as you can imagine the units of work you can call them jobs processes threads whatever and These are the things that are executed on our heterogeneous processor, right? In a stomp actually we talk about task types, right? This is the most interesting part We say this is an FFT task. This is a decoding task. This is a convolution task, right? Then we have servers or processing elements which are these blue boxes here, right? I mean these are your processing unit in your in your chip You can have also different types of processing elements in general purpose course GPUs different type of accelerators, right? This is all customizable and then the most important part is the scheduler this green box here Which will basically take tasks from this queue and try to schedule them across the available Processing elements or servers, right? And what we want to do is to allow the user to write to write in a very easy manner new Scheduling policies that can be plugged into the scheduler for easy. Let's say evaluation so The task can arrive to this central queue either probabilistically or we can generate traces of tasks from real executions That's also have some attributes like for example the service time of a given task type on a given server type Right, so we say let's see this is an FFT task that has this service time on a general purpose core on a GPU on an accelerator Or etc. We are also adding support for power consumption. So the task will also have some power information associated and the most important part is this I mean the user can extend the base scheduling policy Python class and create its own Scheduling policy in a very easy manner, right? This is the logic that will Instruct the scheduler what to do with the task at the scheduling time So it's a very simple idea But very very again, it's a very useful tool So how a storm works internally there are two Components one is the meta scheduler that red box They are meta and the other one is the scheduler this green box here called scale They communicate communicate to each other through these two cues ready queue and a completed queue and The idea is very simple meta basically will do some preprocessing on task. Let's say To connect more what with what you usually deal with that can be for example new radio blocks, right? So let's say meta will preprocess the new radio block to have to be executed in the system And will for example track dependencies and those blocks those tags that are ready will be put in this queue for the scheduler to place that block in Processing element right when the Task or new radio block completes then the scheduler will put that back on this completed queue So the meta is notified or this task is done and can keep tracking dependencies, right? The input of a storm a storm is all or this here plus this here So the input comes as you may imagine the form of a directed acyclic graph dogs, right? Where a duck may represent let's say an application in a new radio flow graph and the nodes are the task or the Block that we have to execute So as I said meta does some preprocessing on this task One interesting thing that meta does is to compute the rank associated to each task to each radio block to execute a rank It's basically a metric that tell us I mean How fast we should execute? that task right so High rank means we have to execute this very fast I mean, but there is maybe some real-time constraint that we have to meet Lower rank means maybe we have more slack. We can wait a little bit more So basically the the task are order in this ready queue by rank How do we compute rank that is implement implementation specific the user can define What rank means in he or she a specific implementation? But usually rank is a function of the task priority as well as the slack I mean the amount of time that the the task has to complete as well as for example the task work worst case Execution time which usually is running the task on the worst processing element. Let's say the CPU core, right? So this is one possible formula for rank could be many others and also meta can in some cases we can enable this Feature that if a dog didn't complete on time if the dog means it's deadline Then meta may decide to drop that that that that completely in order to reduce traffic and give more chances to other Dugs to complete on time. So we have also that option available so the scheduler Which is the other box in this In this diagram Is where as I explained before is where the user will plug he he's or she Scheduling policy, right? So this is very easy actually I mean the only thing we have to do is to extend this base scheduling policy Python class More specifically what we have to provide is an implementation for this assigned task to server function Which basically tells the scheduler what to do with task in the very queue at the scheduling time in this for example in this very simple implementation this is just for Illustration purposes only what we are doing is we are taking the task at the head of the queue this one here right, and we try to Place that task. I mean the scheduler will try to place that that that's on the fastest Processing element. Let's say an accelerator if that processing element is not available Then the task will remain in the head of the queue. This is what is implemented here Very easy stuff just for illustration purposes only we want to make more intelligence scheduling policies, of course, right? Then we can of course easily configure stomp with different parameters For example, we can indicate what scheduling policy we want to use right? So in this case, we are saying Stomp to go and look for a fight call Python script called simple policy version 3 under the policies folder, right? And then we can configure the number of processing elements We are saying that we have in this example 8 general purpose cores 2 GPUs and 1 FFT accelerator And then we define our tasks, right? For example, we say that we have tasks of type FFT with different service time for the different processing element types And we may have other type of tasks or so let's say the coders or or convolutions or whatever, right? So it's very easy to configure Let me show a very simple example of how this works, right? Let's say we have an input that of 5 tasks of 5 let's say new radio blocks which is that one showed there and Then we what meta does the meta scheduler does is try to determine what is the deadline for this DAG to complete which in this case is defined as the Execution time of the longest path the critical path in this case that is 0 1 3 and 4 In the worst in the worst case meaning running on in this case on the CPU so for that critical path running on the CPU that is Around what is 1,100 unit of time by the way stomp stomp is unit less so the user defines the meaning of a unit of time So then at time zero at the very beginning meta will take the first the root now there And we compute its rank using whatever formula for the rank has been defined in this case We are using that formula and for this specific Initial case the rank is is is very high infinite I don't want to go into much detail here because of lack of time but what for that first no basically meta will take that now will compute the rank will put it in the ready queue and that Task will execute after the scheduler places on on on the processing element that task will execute Let's say that since everything was available in the in the chip that task was executed Not in a seat in a CPU but on an accelerator So instead of taking 500 unit of time it took 10 unit of time, which is good because that gives us some Let's say time back for the rest of the task in the in the dark, right? So at time 10 after that task completed then meta will Update the rank office of task one and two which are the ones that now are ready because they satisfy their dependencies, right? so Using the same formula as before we compute ranks of task one and rank of task two And in this case in the simple example rank of one of that's one is larger than of that's two So the next one that will go into the ready queue is that that's number one the decoder there That will be a scheduled by the scheduler In a based on the on the scheduling policy that the scheduler is using at that time And this process continues for all the tasks in the dog until the full dog completes basically in a strong way to support Multi a dog execution meaning that we can have multiple of these dogs Running at the same time, right? In that in that case basically the meta scheduler will keep track of dependencies across all the dogs and will will Update the ranks also across all the dogs at the same time. There is no let's say sequentiality in that regard, right? But this is another feature of a storm So let me show you some Preliminary numbers that we have generated with a storm for an illustrative. Let's say a new radio example so What we did is with what we define the trace of dogs with 1000 dogs, right? Let's let's think about these dogs as 1000 new radio flow graph, right in this case two different type with five and two tasks With five and two blocks, right? So very simple ones and we assign priorities to these dogs one and two randomly and we define the deadline to for a dog to complete as I explained before like the length of the path the critical path running on the worst case a Processing element we in this case the CPUs right and we find three three simple test type FFT convolution and decoder, right? And the metric of interest is the number of dogs So let's say flow graph that met their real-time deadline during execution, right? So we created say fight simple scheduling policy for simple. I mean starting from simple one to more complex more interesting ones TS 1 and TS 2 are non-blocking Scheduling policies, which basically means that the scheduler will look Within a window of tasks in the ready queue not just the one in the head of the queue, but the window And we'll try to schedule all of them Even if one before couldn't be a schedule because I mean there was no available to say accelerator It will keep looking in that window. This is why it's the these two are called non-blocking and TS 2 is a it's a variation or a variation of TS 1 where At the scheduling time not only the scheduler will traverse that window, but will keep in mind what Was done with the previous task in that window So we know what what was a scheduling decision for the previous task and that allows us to make a little bit better Scheduling decisions, but in these two cases meta is only used for dependency tracking only, right? There is no rank computation in this case. So then we have Like an improved version of TS 2 using meta for both dependency tracking as well as a rank computation, right? And we have these three versions MS 1, MS 2 and MS 3 We are basically the difference is just the formula that we use to compute the rank, right? So for example in MS 1 we use the task deadline its average execution time across the all available Processing elements and the priority In MS 2 is again We use we compute the rank as a function of the task deadline the maximum execution time across the different Processing elements and priority an MS 3 the rank is computed as a function of the available slack and The maximum execution time as well as the priority. So five different policies. Let's see what we get So what we get is something that we what we expected MS 3 is the most interesting of these five Policies because it's the one that allow us to allow us dogs to complete In time in the most cases. So this chart what is showing is a percentage of that Let's say radio flow graph that complete within the deadline as a function of the five different policies for different arrival times of the dogs and As I said, we have two Priorities of that's right priority one and higher priority two So for example, what we see here is that MS 3 for this particular case allows all hundred percent of the high priority Dogs to complete within their deadline right compared to the other policy, which is good, right? And you see this is about by way, it's a very simple policy I mean we are not doing it's not rocket science, right? I mean some simple tweaks can help us to optimize the schedule So let me you have some time Let me show this this is a video recording of how to run stomp very simply also nothing that will impress you But it's just a two-minute video as I said we use these traces of Of that so we have a script called trace generator that would generate a synthetic trace of of DAX in this case We have 1,000 DAX that arrive at different arrival times So the second field is the the DAG ID and and this is a DAG type. So we have 1,000 DAX that arrived to our Simulated system, right? So synthetic very simple trace. We can generate that from real execution too Then we configure stomp using this JSON file we indicate for example the policy that we want to use We can indicate for example, well as I said before I mean what processing elements we have in our system and Lower below we also indicate the different task types, right? FFTs Convolutions and I think we have also something else is the color So one way to execute the stomp is with the stomp main script But then we have a more convenient script called run all which will run multiple configurations of a stomp that can be easily Define here in the in the in the At the beginning of that script we can say which which policies we want to evaluate What is for example the arrival times etc. So this script will execute multiple combinations at the same time Which is a very convenient So what the video is showing just is just how do we call that the script we can indicate well verbose Dump everything on a CSV file And this is the input DAG that you the input DAG trace that you have to use so right now It's in this example is only one combination. So we are actually executing one instance of a stomp and passing all that JSON string to to To configure that specific execution so it finishes and it generates an output file that we can easily process with another script that we call collect Py that will just parse the output and in this case is we print out for example We can we can print other things with in this case the average response time of all the DAGs For for that a specific simulation, right? So very very simple I mean it you can run that this length and how to run this in five minutes. I mean it's extremely simple so So let me let me wrap up here Stomp is a it's an afford that is inactive development, right and we are Considering thinking about some new features to be added For example, like support for more complete input trace format We want to also generate the more statistics. I think some interesting things that I want to mention is that As part of these new features that we will incorporate we will have a support for power modeling It's not just about performance throughput, but also power efficiency And something that I think we think is very very interesting. We want to explore more machine learning like Scheduling policies, right what the five policies that I show in the example below are Relatively simple heuristics. We think that we can exploit some machine learning techniques to do even better Not necessarily we have to go to complex deep learning, right? We can talk about, I don't know simple decision trees for example, right, so we want to go into into that area and I am emphasizing this because this is some some interesting area where Everybody can eventually collaborate or contribute if interest, right? so we believe that the machine learning can can provide some very good benefits when it comes to a scheduling of tasks in in intergenerous platforms, so We want to move from the abstract to the more concrete. We are adding support to characterize real Applications like for example real new radio worlds and generate that traces from those real execution instead of just creating Synthetic duck traces, right? But the stomp is already a very nice tool that provides plenty of opportunities To explore the the problem domain and Generate conclusions, so please I mean to check it out and play with it and provide your feedback You can check out the dev a branch if you want which provides more leading edge features So please let us know if there is any any feedback that you can provide or in any way that you think you can contribute or collaborate So thank you very much. You won't see my face again today. This is my last talk