 I'm muted. Okay. So, hi, everyone. I'm Joshua Mack. This is Nirmal Kumbhari, and we're talking about automating programming and development of heterogeneous SOCs with LLVM tools. So to give a quick background on who we are, we're a group of a collaboration between Arizona State University, University of Arizona, Michigan, Carnegie Mellon, and two industry partners, ARM and General Dynamics. And the overall, and the people specifically behind the work in this presentation, other than myself and Nirmal, are listed up here because without the work of the team together, we wouldn't actually have results to show today. And so some background on what this collaboration of ours is trying to do is historically, you know, there's always a performance gap between just general-purpose CPUs and ASICs, and you can kind of span the gap there with different kinds of heterogeneous platforms. And ultimately, all our collaboration is trying to do is kind of close that gap a bit. So we want to build something that is still pretty general, like still usable easily by programmers, but can help approach kind of the ASIC performance trend line. And the way we want to do that is by building a heterogeneous chip. This heterogeneous chip will have, you know, a variety of different accelerators, standard traditional CPU cores, and what we want to do with this chip specifically is we don't want it to try and be, you know, let's compute all the things. We want it to be focused and have kind of a purpose to it. So we're not trying to do a general-purpose heterogeneous chip. We're trying to take kind of a domain of applications like signal processing and you can see if by kind of restricting the set of applications you're using, you can come up with some new clever optimizations on how you then build for that chip. And so where the real emphasis kind of is in this project is we want to focus on making the tooling as easy to use as possible because from a custom hardware point of view, you can pretty easily build something that no one wants to use. And so the real selling point is going to be not that you can build a chip but that you can get a workflow that makes developers want to use it as well. And ultimately what we would want out of this collaboration isn't just a single chip either. We want some kind of repeatable methodology for giving a new domain, how do you go about finding what accelerators to include, and building a tool chain that will let you target that. And so kind of traditionally you can think of computing as like a three-stack cake where you have the hardware on the bottom and then there's some resource management like an operating system in between that provides nice interfaces for developer apps and then the applications sit on top. And when you try to apply this to a desock, you kind of have some questions that come up. So in the hardware side, what accelerators do you want to include? It's a bigger question now than just how many cores do you want to have or what frequency are they going to run out? What fundamental accelerators, what pieces of your domain are worth accelerating in the first place? How do you launch work onto accelerators in some kind of standardized way because this chip is going to need to handle a variety of accelerators and we want the interfaces for that to be simple. What exactly are we scheduling? So with a lot of heterogeneous programming, you might say we've optimized this for the GPU and what that does is just statically a compile time links in support for the GPU and it always runs there every time you run the binary. But in a world where everything, every application is heterogeneous and all using the resources of your chip, you might want a smarter approach where you have more of a flexibility that you see in CPU scheduling that you can fall back on to different implementations on different devices depending on the system workload. And then another question that you get with this is how do you integrate new applications to this framework because as we said, the usability is incredibly important and along the same lines as usability worth mentioning is how do you debug accelerators? There's no stack, there's no instruction pointer, there's no registers necessarily in a generic accelerator and so how do you provide interfaces that make that easy? In this talk, we're really only going to focus on just how do you integrate new applications but it's worth noting that we have people thinking about all of the questions on the previous slide. And so for the flow that we're presenting today, we have a prototype tool chain that uses dynamic tracing to collect an application trace and then uses that to try and recognize relevant kernels in your code and perform some additional analysis that allows us to re-target them for different accelerators in the system without the user needing to intervene. And so to step through the process here, we start with there's an open source tool chain called Trace Atlas that has been worked on heavily by Richard Urie at ASU and what Trace Atlas does is it is a whole tool chain for collecting runtime based dynamic application traces from LLVM code. And so the process of how it works is you, there's some libraries for implementing the tracing methods and you inject those tracing calls through an LLVM opt pass, you then compile the binary and run it and that produces a complete application trace and then because of some dynamic trace compression that uses Zlib that we've implemented along with some other clever techniques on choosing what and what not to trace, we're able to actually make this usable for a very wide range of applications and not just kind of trivial small examples without running out of disk space. So like I said, built on Zlib and compared to state of the art, it's between two and a thousand X reduction in time to collect the trace and two X reduction in on disk trace size. And so once you have the trace collected, then you can analyze that and see not just statically what the program's behavior could be but actually what the behavior was. And so to do that, we use this concept of kernels which are basically just groups of basic blocks that are highly correlated and they recur very frequently throughout the program. And there's this notion of kernel affinity which is basically the transition probability between any two kernels in your original source program. And what we can do with that is we can, because we have this application trace, we can collect or we can determine empirically what all of the affinities were in our given application execution. And then we can cluster basically all of the related basic blocks together. And then, so this provides a series of collections of basic blocks where you end up with, say, these blocks here recur very frequently in the source program. So those are related to each other and then these blocks recur very frequently. And then the other blocks are not considered kernels. And what we can do from there is say if the original source program had looked something like this where these were the two kernels that were labeled before. We can think of this as a directed acyclic graph representation where the transitions between each kernel or non-kernel boundary kind of gets grouped together into a node of LLVM basic blocks. And so what we can do on top of that is because we know that the kernels are important sections of the code, we can add additional analysis on that. And if we can detect that, say, this kernel was a fast Fourier transform, then based on our knowledge of the desoc that we're targeting, we can add in support for other platform invocations automatically without the user having to define what those are. And similarly, if this is, say, forward error correction, we can do a similar process. And the result of that is that you have this acyclic graph where the actual platform that you're dispatching on isn't determined at compile time. And so you build in support for any possible platform for each node into the binary and that gives you kind of a fat binary structure. And together with the DAG metadata, that is what we kind of output to be considered an application. And then to run this application, we could try and target Linux directly, but it was more advantageous for us getting started to use a user space-based application runtime. And so the way this ends up working is we essentially allocate p-threads to account as schedulable resources in our desoc system. We manage a ready queue of tasks so you inject some applications and then you can, because of the graph structure, you can tell all the dependencies. And then we are able to implement custom resource management techniques that involve different heterogeneous scheduling heuristics that wouldn't be present in just a standard Linux-based kernel scheduler. And so the question is, why did we do this in user space? Because it allows us to iterate a lot faster while we're still in this early pre-silicon development phase. So it lowers turnaround time and it allows for easy adding of accelerators because you can just directly use memory map to registers or whatever interface that you need to to access whatever is in your design. And really it just allows us to co-evolve both all the software the developers are writing along with the hardware designs all in kind of one easy environment. And so with that together, we end up with this overall flow that captures where the left-hand side is the dynamic tracing analysis and the right-hand side is the run time. And to give an idea of before we had the compilation process going, we had some handwritten applications kind of going, what do we want applications to look like? And so this is one of the test apps that we had written where essentially the source code on the left here doesn't actually contain any of the original knowledge about what the sequence of calls needed to be to recreate this original application. It's just each section of the code has been segmented out into basically a stateless function with all of the arguments in state passed in as memory references. And so what we can do with this is we can have the different nodes in the DAG and then if a particular node supports multiple platforms, we can have multiple different functions that that can dispatch to. And then this couples with just a JSON based DAG representation that includes information about essentially all the memory requirements of the application. What do variables need to be allocated as? What value is deleting to be initialized with? Along with the dependency structure in which arguments each kernel requires. And so with that together, we're able to run that through our system. But ultimately, we don't want to rewrite everything to be like that. We want to just take a simple C code that is used as the example here only because it's small enough to fit on a slide. And we want to turn that into something that can run in the system. And so we're going to work through an example with this. And for reference, this is the output of what this gives if you just compile and run it. So stepping through the high level process from earlier, we start by just compiling to intermediate LOVM IR. And once we do that, we renumber the basic blocks with a very simple opt pass that basically just allows us to coordinate which basic blocks belong to which kernels after we've done our analysis phase. So we see here that these basic blocks are labeled. We instrument with the dynamic tracing calls. So we can see here that we dump, we do things like dump IDs of basic blocks on basic block entry, dump loads in stores, and then that gets linked in in the back end. So we compile the tracer binary, we run that binary and running that binary produces the output trace of the application. And what we can do then is use trace atlases, kernel detection mechanisms to detect which basic blocks in the original code are considered kernels by their definition, as well as what the producer-consumer relationships were between those various kernels. So we can see here that we have two kernels, basic blocks one, two, three, and five, six, seven, and then kernel one consumes from kernel zero. And so together with all of this information, we're able to take the original LLVMIR that's been unchanged from the user. And we can use this to refactor the application to essentially outline each of the sections that need to be outlined and create a DAG-based application. So in this case, the kernels are one, two, three, and five, six, seven. And then the non-kernel blocks in between corresponded to these sections of the original code, where these two nodes were essentially clustered as kernels because they were considered hot enough, but say this for loop here wasn't. And so it was grouped with all of the rest of this here. At the same time, we are able to analyze the memory requirements in the application. So we essentially build kind of a simple table where we just determine, okay, how big does each variable need to be if it's initialized with anything within a reasonable search distance of where the allocation happened. Can we resolve that to be constant? If it's a pointer, can we try and resolve any malloc calls as constant, essentially? And from there, we use, we outline each of those sections of the code in two different nodes. And so the changed LLVMIR essentially looks like this, where nothing happens in between each of the node calls. And then we can generate the JSON based DAG that calls those same nodes in the same sequence, along with allocating all of the variables that are necessary. And so with these two together, we're able to compile this new LLVM into a shared object. And then with that shared object coupled with the JSON, we can hand that off to our run time and run it through the flow. So just to validate that all of this stuff we've done has actually preserved the functionality, we'll run five instances of this app here. And we'll note that the output here matches the output from before. And so with that, I'll hand it over to Nirmal to explain how we then use this for more advanced applications. Thank you, Josh. So as a part of this project, we have created a user space scheduling framework to rapidly evaluate different solutions that we are coming up for the target desock. So this framework is designed to run in the user space, which makes it portable across different virtual and hardware platforms. So for today's discussion, I'm going to use this particular desock data flow as our target desock. And this particular target desock is composed of a quad core ARM processor and two FFT accelerators. This FFT accelerators communicates with main memory using DMAP IPs. And these IPs are mainly used for bulk data transfer. And we have this ARM processor, which uses memory map interface to configure our accelerators and IPs. So we use our framework to emulate this target desock on top of a real hardware platform that is ZCU 102. And also we demonstrate its portability by running the same platform on top of Xilinx KMO, which is a virtual platform. So the ZCU 102 board, it consists of an Zinc MP SoC. And it has an on-chip quad core ARM processor and also on-chip programmable logic. We use the programmable fabric of the Zinc SoC to implement our accelerators and IP. And we use the ARM processor as is to replicate the emulation of this particular target desock. So benefit of using ZCU 102 is that it helps us in getting more realistic performance estimate compared to a virtual platform. And it also helps us in performing full system functional validation, especially validating the implementations of our accelerators and IPs. We use Xilinx KMO as our virtual platform. So this is basically provided by Xilinx. And what it does is that it basically emulates an ARM V8 RSA on top of X86 RSA. It runs as an independent process in the host operating system. And we implement our accelerators and IP in system C, which are also running as an independent process in the host operating system. These two processes communicate with each other using inter-process communication. Benefit of using Xilinx KMO on top of real hardware is that you don't really need a hardware. It can be used by multiple developers, application developers in parallel. So basically, it isolates the necessity of real hardware. So next, I will talk about how we use our framework to validate the tool chain that Josh introduced before me. So we take the sample radar correlator code. And this is the traditional flow of compilation and execution of a standard C code. We compile it, and we generate its output, which is the lag value. And then we send this particular C code through the tool chain, which is trace atlas that is being used for kernel detection, the FAT binary, and code refactoring, which basically takes the kernel identified by trace atlas. And it creates the shared object for the binary and the DAG representation of the application. And then we send this particular files through our user space framework. And then we compare the output of monolithic C code with the output generated by our user space framework. And if they are equal, we assume that, OK, the application integration has been successful with our user space framework. So once we complete the integration of application with the framework, we want to see how we can use this particular framework and the applications to do the search design space exploration of the target BSOC. So in order to do that, what we do is that we create, we use the real hardware platform, and we create real workload using more realistic benchmarks from Wi-Fi and pulse Doppler domain or radar domain. And our emulation framework basically supports two operation mode, which one is validation and the other one is performance. In validation mode, basically, we can inject multiple instances of applications simultaneously, whereas in performance mode, the applications are injected in time-separated manner. The time separation can be periodic or random. We also have a feature where a user can provide an input trace file to create the workload. For our target BSOC, we assume our resource pool is composed of three CPU cores and two accelerators. So in this slide, we use our platform and our tool chain to do design space exploration for different configurations and heuristics. So on the left-hand side, what we have is we run our emulation space framework in validation mode. And on the x-axis, we iterate over different configurations by changing the core count and accelerator count. The workload is created by injecting one instances of each application that I showed earlier. And on the y-axis, we have the execution time of the given workload. So basically, we do the design space exploration of which configuration will suit best for our target applications. And we then narrow down on this particular configurations to, depending on the performance requirement and energy requirement, we can select any of this configuration for doing further analysis. So for this particular plot, we have selected a configuration which is composed of two CPU cores and two FFT accelerators. So this particular analysis has been done in the performance mode, where applications are injected in periodic manner for 100 milliseconds. And on the y-axis, we have the execution time of the generated workload trace. And in this particular plot, we are basically evaluating different scheduling heuristics. So at the end of this slide, what I want to say is that we have a complete software stack and a scheduler framework which can be used for performing desock design space exploration during the initial phase of desock development. Other than that, I would also like to say that in this particular project, we have been trying to develop an ecosystem of tools for the early development of desock. And one of the tool that we have developed is a domain-specific DS3 simulation. It's a discrete event simulation tool. And it can be used for evaluating different scheduling algorithms, power management policies, and design space exploration for energy performance and area trade-off. So benefit of this particular tool is that during the early phase, when we don't have applications ready for the target desock, or even the accelerators are not being implemented for the target desock, this tool can be used for doing all the system-level design decisions as soon as possible. And depending on the outcome of this, we can target the required accelerators for creating emulation platform. Then I will hand it back to Josh to go over the demo. So I'm not going to trust the inline video to even play there. So what we're going to see in this demo is stepping through the full compilation flow with the radar correlator example that we had mentioned earlier. So what we're going to do first is just go through the application and just verify that it is a very basic standard C application. So there's some file I owe up at the top. There's some DFT calculations. There's some pairwise multiplication. There's an IDFT. There's a maximum for loop. And then you print out the value at the end. No pragmas, no anything. And then what we do is we pass it through the compilation flow. What it's doing here is it's collecting the execution trace. And what this also happens to give us is some reference points for the DFT1 and the DFT2 execution as well as the output value of 0.2516. And then what it's doing now is it's going through the kernel detection phase and analyzing the trace, as well as now extracting the producer-consumer relationships. There's a little bit of time dilation here to save on time in the presentation. But yeah, now what it's done after that is it went and it refactored the application into the blocks of non-kernel and kernel code. And as we can see, it just alternates kernel, non-kernel. And then what it did was it analyzed each node in the graph to see if it can swap in an optimized implementation. So in this case, nodes seven and nine were detected as DFT kernels. And so we were able to swap in an FFT accelerator invocation that we can use instead. And so what we're doing here is we're copying the output shared object in JSON to our hardware platform. And then we'd run it and show that the modified application is now able to dispatch onto our hardware accelerator without the user intervening. And we can see that those two DFT kernels are much faster. And just to show that this kind of scales, we do 10 back to back. And similarly, they are able to dispatch successfully. They don't step on each other's toes with multi-threading. And all of the outputs individually still remain correct. And so just to illustrate this with a diagram, we generate a Gantt chart that shows the activity on each core as well as accelerators. And we can see that the FFT accelerator sees some activity as well as the existing standard CPU code. And so yeah, that's the end of the demo here. Back to it, yeah. And so I think just the takeaway here is that this was no human in the loop, standalone integration from C code to running on an accelerator. And while it's definitely in an early phase, we're excited to see where it can go. So in conclusions, we're pretty happy with what we've accomplished so far. Having any kind of vertically integrated software in hardware stack is a bit of a challenging task. But for next, the upcoming releases, we do hope to have more mature system C and or RTL accelerators available in the Github repository for everyone to mess with. The current version now is only CPU support. And then also improve our integration of our compiler tool chain with richer and richer applications. And with that, we'll take any questions. Of the kernels, is that to remove cycles from the graph? So the question is to, for determining kernels, is that to remove cycles from the graph? So are you questioning kind of the, like is the determining the boundaries of the start and end of some recursive process so that the graph doesn't cycle back around or? Yes, and what happens if your kernel? So the goal is actually to have kind of parallelization of kernels because with, I guess to the second part there, with this idea of the producer-consumer relationships that the DAG might be a lot more complicated than just a simple linear chain. And the hope is that by having this knowledge of producer-consumer, you can say, oh, this kernel never consumes or produces anything that this kernel ever needs. They both consume from some common ancestor and so we can schedule them simultaneously. But to your first question, I guess the clustering of the kernels is essentially more so that we can identify what areas of the code are important. We're not necessarily trying to eliminate cycles entirely, but yeah, we just know that this area of the code is important in warrants further analysis. Yeah, I'm sure there's someone working on that. Right? What I do what? Oh, how do we calculate the affinity values? Okay, I actually defer you to the Trace Atlas paper here up on Archive. It goes into excruciating detail about all of the calculations behind the process. But yeah. So the question is, how do you actually identify the kernel? And at this stage, the answer is, yeah, we're manually identifying them. The hope is that there are some other people working on better, more generalizable ways to do kernel recognition. And part of the hypothesis is kind of maybe by restricting to it as a domain of applications rather than all applications, there are recurring ways that people code things within a particular area. They're using the dynamic traces, but then doing pattern matching on the traces. Okay. To figure out what kind of code it is. And that's a probabilistic match that then you can figure out with some high probability, if you can get a high probability match with some existing accelerator, then you know what kind of accelerator it is. Okay, and the comment here was that David Brooks's group at Harvard is taking a similar approach to kernel detection and probabilistic pattern matching of kernels. So assuming you don't have a library of hands tuned, optimized kernels. Assuming there also exists some solutions where you start from almost some code and it lowers to a programmable accelerator. Maybe you can look at an FPGA as a programmable accelerator. Would that still be a useful option to go to, or would you just lose too much efficiency because the tools aren't as good as the hand tuned kernels? So the question is, assuming you don't have a library of existing optimized kernel implementations and you still want it to target an accelerator automatically, is there any kind of process where you could just gradually, essentially generate an accelerator that's applicable for a given kernel? I don't know that any work has been done in that, but I would assume that some of the HLS tool chains would have a role to fill in there. And I think it would probably be pretty interesting to see what could come out of that.