 Okay, if everyone wants to get their seats, I'm going to get started. So I'm here to talk about a project in our group at Illinois called HPVM, Hydrogenous Parallel Virtual Machine. This is essentially meant to be a parallel compiler infrastructure that's built on top of LLVM. The goal here is to try and get both performance and portability for code applications running on Hydrogenous Parallel systems. First, I should say by the way that I have not written almost any of the code in this project. This was done all by my PhD students. Unfortunately, the lead students could not be here. One of them is just applied for a green card, and so he can't leave the country, and the other one has some health issues, so she couldn't leave the cell. I am holding the fort on their behalf. But Maria Katsifaku and Hashim Shari for the two main leads on this, Prakal Privastava Adalajee and the others listed here, and we also have two other faculty members, Sarita Adve and Sasha Misalovich involved in this work. So I don't think I have to tell you that Hydrogenous SOCs are pretty ubiquitous and moreover, it's not just that they're ubiquitous right now, but very important applications and increasingly important applications are going to be possible only because of these kinds of SOCs, because you really need dramatic performance improvements and energy efficiency to be able to make these kinds of applications possible. In order to make those possible, you need SOCs that can be programmed as easily as possible. In fact, the previous talk gave some good motivation for why you need both the performance efficiency, energy efficiency and also the programmability. These are some example domains of these kinds of problems. We are actually working with an autonomous car application that I'll say a couple of words about and we're starting to work with a mobile robot for agriculture that's being developed at Illinois, also as example edge applications. But in a project led by IBM, we are using a model application of an autonomous vehicle that has multiple different kinds of application components, a neural network for image processing, an FFT, a Viterbi decoder, some control logic as a model application of an autonomous vehicle. The goal of the project is to be able to do full stack development of both the hardware design, from the application domain code, and a programming stack for that. There's actually several talks happening today and tomorrow at FOSDOM about different parts of this in the risk five room and in the software defined radio room, and my talk is in this room here. Our role in this project is to use HPVM for programmability. We're basically in some sense looking at the compiler infrastructure on the left, and the development environment and programming languages, and how to implement them easily on a fairly custom SOC design with a wide range of different accelerators and host CPU. So just to motivate that, I'll use a slightly different SOC as an example, but the underlying problem that we're trying to tackle is that on a single SOC, first you have a whole number of different hardware instruction sets, which you want to be able to target. But that's not the only problem. You also have a number of different parallelism models. So there's different kinds of parallelism happening in the different components of the SOC. Even worse, you have incompatible memory systems. You typically have, so you may have a cache coherent hierarchy and one, you may have only local scratchpad kind of memory in another and you might have a DMA or some other data movement mechanism between the different accelerators, and you want to be able to take an application and run on an SOC like this. But what can make it even worse is that different SOCs have different combinations of this hardware, and that makes the portability problem far worse. Now, some domains have the portability problem, some don't. Mobile phones have it to a very large extent because you want to run an app on many different kinds of mobile phones. Think about Android, for example. How many different SOCs power different Android phones? Otherwise, SOCs don't necessarily have this problem if you're really custom compiling the stack for a particular SOC design. But you do always have this problem of heterogeneity. So we believe that the key to achieving performance and portability at the same time on this kind of system is to have well-designed abstractions for the underlying heterogeneous system, for the underlying parallel hardware, and to be able to use that to develop both compiler infrastructure and tooling to develop the whole software stack around those abstractions. Before I go into the abstractions that we use, I just want to say a couple of words about the current state of what do you have with LLVM. This is not a comprehensive slide, but it's I think fairly representative of what choices are available today. If you want to build parallel compilers with LLVM-based systems. So LLVM itself, of course, as you all know, is primarily targeted vector parallelism, so short SIMD kind of thing, SSE, AVX, things like that. Poly does polyhedral transforms in some scheduling. Taper is a project from MIT that targets heterogeneous shared memory systems. There's compilers for languages like OpenMP and OpenCL and CUDA. Each of those are pretty specific to the particular language and they don't really try to generalize in terms of the languages that they support. There's other projects like Legup for FPGAs, TensorFlow for TPU and GPU. I think MLIR is certainly the most recent and most well known, perhaps, addition to this list. They support tensors, especially very well, and have strong support for polyhedral transforms for high-dimensional, or actually any dimensional kinds of tensors. I would argue that none of these are really attempting to capture a diverse range of heterogeneous parallelism, which is what we think you need to build a flexible compiler infrastructure. That's what the goal of our work has been in HPVM. It's to develop a common parallel abstraction of this kind of parallelism, of the diverse range of parallelism that's available, and then use that to develop the programming environment. So that includes in our case a compiler IR, which is an extension of LLVM. A virtual instruction set which is, so just like LLVM itself is both a virtual instruction set and a compiler IR, you can actually ship code as LLVM as Apple does for watch and Apple TV and iPhone and so on. Just like that, you can ship code in this virtual instruction set form as HPVM code in a way to achieve portability. Third, we can also use this for runtime scheduling in the system. So this is a high-level view of what an HPVM infrastructure looks like. The idea is that you have front-ends for potentially a variety of languages. Right now, we have front-ends for essentially an extension of C with HPVM intrinsics, and then a front-end for Keras for neural networks. These translate into the HPVM virtual instruction set or IR, and then you have a variety of back-end translators that translate the HPVM IR into different hardware targets. This is schematic. We don't support all of these yet, but we do support a variety and you'll see which I'll come to those in a minute. But the point now is that we can use this HPVM representation for achieving portable object code by using it as a virtual instruction set. We can use it for building retargetable compiler infrastructures so that you can both do optimization and code generation in a common compiler infrastructure for a variety of heterogeneous hardware and use it for on-time scheduling. So that's the high-level picture of what we're trying to do, and the abstraction that we use that we've developed in the HPVM project looks like this. It's essentially a data flow graph, but with side effects, so it's not pure data flow, and that's important because many accelerators today actually support some form of shared memory. So even GPUs are starting to do that, other accelerators are likely to do that in the future, and shared memory is very important for achieving good performance on these systems. But a single node in this data flow graph is essentially LLVM code so it can be a mixture of scalar and vector code. So each node is represented as an LLVM function which can call other LLVM functions if you need to. Otherwise, it's a standard data flow graph except for one additional wrinkle, which is that we make this graph hierarchical. So a single node itself can be an entire data flow graph of its own so that you get a form of parallelism hierarchy. The reason for this is because it's very common to have multiple levels of parallelism in heterogeneous systems. So you might have coarse-grained parallelism across multiple different processing elements, different GPUs, or different accelerators, and the host processor, but you also have extensive parallelism within a single processing element like a GPU or a FFT accelerator or something else, and that gets captured nicely with this hierarchical graph. So the idea now is that the nodes here essentially represent either coarse-grained or fine-grained computational tasks. Graph edges represent logical data transfer, data movement from a source node to a sync node. It's logical in the sense that if two nodes get mapped to the same device, you don't actually have to do the physical data movement. But logically, there's like a copy happening in that case. Then you also have loads and stores which do implicit communication, or implicit data movement between the different nodes and it's hierarchical as I just said. One more important aspect of HPVM is that a single node is like a GPU kernel. Just like a GPU kernel gets instantiated into a grid of threads, which might be one-dimensional, two-dimensional and so on. Similarly, an HPVM data flow graph node can be given an index or a vector of indices, and we instantiated into a set of parallel instances that will execute in parallel runtime. For example, if you're given one index one and then you have n parallel instances or threads executing this the same code at runtime, and we require that these instances be independent of each other. That's expected or we rely on the programmer or the front-end to ensure that's true. We support one, two, and three-dimensional grids for these and conceptually there's no limitation, but that's what the current system supports. So this is an example of a gray-scale edge direction pipeline that we've developed in HPVM, and it shows the hierarchical graph structure. So you have a pipeline task parallelism where you're doing different computations in each of the nodes of the graph. You have medium-grain data parallelism within the pipeline stages. You have fine-grain data parallelism within each individual pipeline stage. You have a graph hierarchy so that, for example, inside a node that does zero crossings, you have multiple different, potentially multiple different HPVM kernels there, which the compiler can choose to fuse into a single kernel if it wants to. So there are details in this image which I go and go through if you're interested. I'm more than happy to talk about it offline. But the point here, actually, the takeaway point is that we can represent multiple different kinds of parallelism in a single parallel representation. And that's important because you have a variety of different kinds of parallelism in these heterogeneous systems. So the way this is actually implemented in practice is by using LLVM intrinsic functions. I'm sure all of you are pretty familiar at this point with LLVM intrinsics, but it's an easy way to add new operations to LLVM without having to change a large number of passes in the infrastructure. And so we have intrinsics for declaring the graph structure. So create node 1D, 2D, and 3D, create edge. Also bind input is sort of a special case where when you have a parent node with a graph inside it, you need to connect the inputs of the parent node to the inputs of the child graph. And those are just bindings as opposed to data flow edges. And so we have a different intrinsic to declare those bindings. And then intrinsics to query the current graph, so you can ask for your current node ID or the number of instances of your current node or the ID of your parent node in order to, for example, partition a computation or index into a parallel array in the threads that are executing in a node. And then intrinsics for doing memory allocation and synchronization, like doing HPVM malloc, which will allocate memory on the appropriate, well, so this is an abstraction of where memory will be allocated. And it will be copied where it is needed. And we also have intrinsics for doing atomic exchange, atomic add, and barrier. This is not necessarily a complete set, but these are enough to do a number of different applications that we have implemented. We also have some additional intrinsics to interact with a host processor that can launch an execution on a particular set of targets in a particular heterogeneous system. And so HPVM launch launches the execution of a single graph. And that's an asynchronous operation, so the graph will start executing while the host processor continues concurrently with that graph. And a program is made up of multiple graphs. So in fact, in theory, the host could go on and launch additional graphs at the same time. And then HPVM wait, stop blocks for completion of a particular graph. And then for streaming applications, we have two additional intrinsics, push and pop, which essentially push data elements into a graph for processing, for example, a stream of images or something like that and retrieves results back from that. So I'll just quickly show you a high level view of the optimization and code generation pipeline, sorry, infrastructure that we support. So we assume that there's some front end that's generating the HPVM code representation, which is LLVM extended with intrinsics. And I'll come back to that issue in a moment. And conceptually, that can be done and then shipped to a server or something else that's aware of where the code is going to run so that either at the user site or at a server that is aware of where the code is running, you could then do the final optimization and code generation in a target dependent manner. You don't have to do that, but that's one way to get object code portability. And now we have a build DFG pass, which is a module pass in LLVM that takes the LLVM intrinsics and generates an explicit graph data structure representation of the HPVM IR. So the nodes and edges representing the hierarchical data flow graph. And then that data flow graph is given to a graph optimizer, which does transformations like node fusion and tiling and other things that you can do as essentially high level graph transforms on the data flow graph itself. And then we do code generation and code generation has works by going bottom up on the graph hierarchy. So what I mean by that is we start at the leaf graphs, the smallest graphs or lowest most graphs I guess in the hierarchy. Where the computations are and move up the hierarchy through the parents. And every node can be translated for one or more target elements. So one of the key, there are two key features about the code generation here. One of them is that any node in an HPVM program can be translated for any target processing element in the heterogeneous system. In practice, it may be the case that you get very bad performance if you translate it for a system for a node where the application code or the algorithm is not well suited. But in principle, that translation is always possible. And practice you can get multiple different targets that do well with the same node. The second feature of this is that we do our best to be able to reuse vendor developed backends because vendors have often put tremendous amount of effort into optimizing code generation for their target hardware. And so being able to reuse that will save a tremendous, both a lot of investment and also get very likely much better performance than what either we could do or a new project team could do. And so in particular, we can, we support code generation for a host processor and to date, we've only supported x8664, but we could easily do other hosts like RISC-5 and ARM and AMD that are already supported in LLVM. We use Intel's Spear back end for AVX to do vector code generation. And they've really tuned Spear for doing very good vector code generation. We use the NVPTX back end to do code generation to PTX for NVIDIA GPUs. That is actually an older version of the infrastructure. I'll come back to that in a moment. This is what we use for our experiments and the numbers I'm going to show you. The open source release is used in newer version that uses OpenCL instead. And then we also have a back end for Altera FPGAs, which uses Altera's tool chain called AOC, which is an HLS tool for programming FPGAs using OpenCL. So we translate HPVM back to OpenCL in order to go through to FPGAs. And one more feature that this enables. So as I said, every node can be mapped to any target processing element in the underlying heterogeneous system. So if you think about this conceptually, if you have N graph nodes and you have K different processing elements in the target system, that gives you K to the power of N possible mappings, just static mappings. So possible different combinations of code that you could generate from a single HPVM program. And I'll show you an example of the different performance impacts this can have for the graph pipeline, the edge processing pipeline I was talking about earlier. This also enables dynamic scheduling that can be much more flexible than if you didn't have this kind of flexible mapping. So in particular, we can modify the mappings of graph nodes to processing elements at runtime. There's some restrictions on what we can map and when we can map. But essentially when you start executing a node, you can choose where to execute it. And the next time you start, you can execute it somewhere else, which is the dynamic feature that you can support. So I'm going to present some performance results. I'm not going to do everything in detail just for lack of time. But the target system we used is a system with a multi-core host, AVX vector instructions on the host, and an NVIDIA GTX 680 GPU with 1536 cores and 2 gigabytes of RAM. And the first experiment we did was to see what kind of performance impact you get if you compare handwritten or hand tuned, I should say hand tuned OpenCL code that's been hand tuned separately for GPU versus vector versus if you take the same code and run it on both GPU and vector hardware. And the benchmarks we used for that are a bunch of benchmarks from the Powerboil suite, which have multiple versions. We use the OpenCL versions there. And they have been hand tuned for both GPU and AVX. And this just lists which version we use for each one of them. So for that experiment, this graph shows the normalized execution time comparing HPVM to the hand coded baseline, which is the right hand bar for each benchmark. So each bar here, 1.0 shows the performance of the OpenCL baseline code. And that's the right hand bar. The left hand bar is the HPVM bar. And in this case, because it's normalized execution time, lower is better. So the hope is that HPVM doesn't introduce a penalty for being hardware agnostic. The best we can do or in theory, the best we should be able to do is to match the hand coded hardware. In practice, we come very close in most cases. I think the biggest discrepancies are in SGM and BFS, where there's an extra longer copy time that was not being optimized by the compiler and that led to about a 22% slowdown in BFS and similarly in SGM. And so the bottom line is that on GPUs, we are reasonably competitive with hand coded OpenCL code for the GPU. On AVX, the story is almost as good although there's a performance penalty in LBM that's somewhat worse. But again, the point here is that we're taking the same code that I showed you in the previous graph compiled for GPUs. Here, we're compiling it for AVX but comparing it against the hand tuned code for AVX. And in most cases, HPVM is competitive with the hand tuned OpenCL except in the case of LBM. And in the case of LBM, we checked the instructions that are being generated by our back end and they match very closely with the instructions generated for the hand tuned OpenCL. But there was something with the driver that was causing the performance penalty which we have not been able to track down yet. So I'm not making excuses but I don't think this is a fundamental issue in the approach or the abstraction itself. The second experiment we did is to look at the benefit you can potentially get with static scheduling. If you remember, we said that if you have k hardware targets and n nodes, you can get n to the power of k possible mappings of the code. And so in this case, we have three targets, right, CPU, GPU and vector and six pipeline stages in that edge detection pipeline we ignored the hierarchy and basically collapsed each top level node into a single node. So we have six pipeline stages and that gives us 729 possible mappings from one pipeline code. And we just picked, we picked one, two, three, seven, so it's been a while, seven arbitrary combinations of these mappings where what I mean by this is S is for a mapping to just the CPU, G is mapping to the GPU and V is mapping to the CPU with vector instructions for each node in this pipeline. And so the sequence, let's say, let's say GGS, GGS means GGS, GGS are the mappings for this particular case. And what we're showing here is the performance you get in frames per second, that's the y-axis. So higher is better for different mappings of this pipeline. All of these mappings come from the same code. So the point here is not that some mappings do much better than others, although that is sort of the corollary, no the corollary, sorry, the predecessor to the point. The point here is that you can have dramatically different performance with different mappings, but because you can get all of these mappings from the same HPVM program an optimizer, like an auto tuner or something else can optimize the code to try and get as good performance as possible so that flexibility for scheduling can be really powerful. I don't have time to show you the dynamic scheduling results, but I'll just briefly summarize the results there. The main thing that we showed there is that if you have a pipeline like this running on a combination of CPU and GPU and you interrupt the execution the application, if you didn't have the dynamic scheduling capability the performance would completely drop off a cliff. You would basically not be able to make almost any forward progress at all, but when you have the dynamic scheduling capability you can gracefully degrade and map the same some of those nodes that will map to the GPU instead you can map them to the CPU and execute there using the vector instructions to get at least reasonable performance. So just to summarize HPVM is able to get performance that's reasonably comparable to hand tuned codes for the different heterogeneous hardware targets from a single HPVM program and the flexibility of static scheduling gives you a lot of freedom to be able to optimize for performance and similarly the flexibility dynamic scheduling can also enable you to tolerate dynamic changes in workload or for example deadlines if you work in a real time system or something like that and one of the things we're doing in the project with IBM is I shouldn't say doing yet but we're going to do soon is IBM has been developing a scheduler called STOMP for real time scheduling of an application running on this kind of domain specific SOC and what we are planning to do is to integrate the HPVM dynamic scheduler with STOMP in order to take advantage of this kind of flexibility of being able to move computations from one node to another. So we've released HPVM open source which includes the IR and a verify pass this is HPVM is an extension of LLVM as I said this is right now using LLVM 9 we have a front end at lowest from HPVM C the C extension of HPVM to the data flow graphs and the open source release only includes the NVIDIA GPU Intel or AMD hosts we do plan to release AVX but there's some limitation with the SPIR drivers with the Intel OpenCL drivers that we weren't able to test really enough to be able to release yet and hopefully soon in the future the FPGA back end as well and there's also documentation for the IR for the C programming interface for this and an installation guide and a much better test infrastructure than what we've been using so far internally that includes both regression test and unit test but also the power boil benchmarks, the edge detection pipeline and a camera pipeline for image processing that we've been using in the IBM project as a sort of a pipe cleaner application and all of those come with the release and the main changes we've made to the code for this release are one of them was to port to LLVM 9 we were on a much older version for quite a long time and that's really the bulk of the work was that to make that more up to date we've also developed a new back end from LLVM to OpenCL based on the so long ago if you all some of you might remember there used to be a C back end for LLVM which had been abandoned for a while and turned out Julia the Julia project MIT had revived that and it's much better now it's in much better shape it's more robust we used the Julia C back end extended it to OpenCL as the back end and we used that to compile to PTX now instead of using the NVPTX back end from before because we found a significant incompatibility between the NVPTX back end and the current NVIDIA drivers and a better testing framework like I mentioned so the main piece of work we're doing right now there's a couple of things we're doing right now and then a couple of things that we are hoping to start on very soon the HPVM to FPGA code generation work our goal there is so first this is a bit of a research project it does a research project at this point but the goal there is to enable hardware agnostic programming of FPGAs and traditionally you all know FPGAs have been used widely for embedded systems they've been used for prototyping hardware designs but in all those cases typically you have a hardware designer on the team who understands hardware and can tune the hardware design really well things are starting to change in the FPGA world because you're starting to get FPGAs available in places like US and also in Azure where open software teams can try to use FPGAs to accelerate their applications but many software teams today don't have that kind of hardware expertise in house and so there's much greater need these days for doing more hardware agnostic kind of programming and even today's HLS tools don't come close to that even though they're much simpler to use than Verilog or VHDL you still have a lot of pragmas and a lot of hardware knowledge that you have to bring to bear in order to tune and optimize code for an FPGA so the goal in this project is to be able to use compiler transformations and optimizations starting with HPVM as the abstraction of parallelism in order to generate much better code for the FPGA now we are not likely to match RTL but if you can come in a factor of 2 or a factor of 4 of hand coded RTL with almost no hardware knowledge we think that's going to be pretty useful for many teams so that's sort of the goal we're aiming for if the compiler people among you will probably remember that John Bacchus used to say if we could come within a factor of 2 of hand written assembly our FORTRAN compiler will be a success how many of you know that quote from John Bacchus so this was in the 50s they were trying to prove that you could write code in human programming languages and actually compete against assembly code so this is a similar kind of excuse we have we want to come in some factor the second major goal we have right now is approximate computing which is the idea that for many of the applications at the edge where energy efficiency is important you also have quite a significant amount of flexibility for reducing accuracy in order to tolerate in order to get better energy efficiency and better performance and that's true in many application areas like machine learning and image processing and a lot of other ones and the problem in practice is that there are lots of potential approximation techniques or what we call accuracy aware optimizations but using them requires understanding them in detail and then using them when there are multiple ones combining them is very very hard so what we're trying to do is to automate that process to a large extent to make it actually accessible to ordinary programmers and that's been the major focus of the work with the IBM project we are also starting to develop a new DSL front end in order to achieve interoperability between DSL for edge applications and another step we've been looking at fairly carefully is what benefits they might be to integrate with MLIR because MLIR has some significant advantages for the kind of work that we're doing and I'll just say a couple of words about that although I know I have only a few minutes left so for those of you who are not familiar with MLIR this MLIR stands for multi-level IR it's a framework essentially for allowing or defining multiple different compiler IRs allowing them to interoperate and making it much easier to implement multiple drives like this and the emphasis in MLIR has been for machine learning or tensor-based applications so there's a heavy focus on high-dimensional tensors and on polyhedral transformations on multi-dimensional loops and so on and in order to get this interoperability MLIR defines different dialects and there's quite a significant number of dialects at this point so examples are things like the affine dialect for loop transforms and arrays the LLVM dialect which is used for code generation linear algebra dialect for high-performance computing programs, GPU dialect for compiling to PTX and so on and one thing we're looking at is whether it would make sense to make HPVM be another dialect within MLIR I think the benefit for HPVM would be that we would be able to get access to a polyhedral framework which we don't have directly right now but we might be able to get new front-ends in the future although I don't think there's enough there's anything really available today that's open source because the TensorFlow one I don't think is open source MLIR may benefit also from back-ends that they don't have for example I don't think there's an FPGA back-end today and we're also starting to develop a neural network back-end for Intel's myriad NPU and potentially also the accuracy of our optimizations that I was mentioning a moment ago so that's all I had to say really the project aims to get to develop compilers for achieving both programmability and performance for heterogeneous parallel systems and in particular to make it easier to build compilers for a variety of different programming languages including domain-specific languages and more general-purpose languages and target them to these kinds of heterogeneous systems and in particular we can use a common representation to do three different things, a virtual ISA a compiler internal representation and a runtime scheduler and that's sort of the philosophy behind the project and with that I'll stop and take questions that's true and I think in your presentation I'm not entirely sure if I saw anything about once you decide to map to a particular accelerator assuming the IR still looks as pretty generic IR right, right, right so this is no, so actually that is the reason for this there we go so this particular where the IR starts out being hardware agnostic because the front end is lowering it to an HPVM representation that's not necessarily specific to a hardware but to do any optimizations and code generation you really have to be cognizant of the hardware and so once you are targeting a particular GPU for example this is very, very GPU specific right and so this is no longer agnostic at all of the particular GPU and so you're absolutely right you're retuned to the particular GPU in order to do that I think we have a question here and then one here a quantum computing I should talk to you offline I would love to talk to you offline we do have a quantum computing person and this would be interesting that's a good question so the question is what is the benefit of using LLVM here as opposed to building the infrastructure from scratch right, I think that's the question so there's a couple of important benefits actually one of them is that there's a lot of investment in backends for individual hardware targets in an infrastructure like LLVM right, GCC is the same, there's a lot of different hardware backends and one of the important things we want to do is to reuse these backends and so if we built our own infrastructure we won't be able to do that so that's one thing the second is that there are a lot of LLVM frontends and so for example in the for the C extension we're basically using Clang and we're using LLVM passes to translate C with HPVM extensions into the HPVM IR internally and that's what BuildDFG does here and so the frontends also become much easier so the HPVM representation captures that explicitly either explicitly or implicitly depending on whether it's a logical copy or whether it's just shared memory so we do account for it for the GPU for example we actually can do tiling for scratch pad in order to optimize for local memory versus global memory so in that sense yes we do account for it I think that there's quite a significant additional piece of work we can do to have a better model for the overall memory hierarchy of the whole system so right now there's not really a good target memory model so we have logical memory copies and we have shared memory loads and stores but the architecture is not really modeled better than that there are other projects that do a much better job like the Legion project and so I think doing something that has a better memory target could do a better job of optimization across the whole program okay I'm happy to take questions yeah so I guess it is compiled into some LVM code which emulates or implements this nest loop so the understand correct is that once I compile this program further into GPU code some clever GP optimizer has to infer that this was a matrix multiplication and pick a very fast and efficient kernel to do this matrix multiplication so that's not what we do in HPVM in the sense that we're not trying to recognize that it's a matrix multiplication kernel and which means that we may not be able to do as well as targeting let's say a tuned if there's a QDNN operation or something else to do matrix multiplication that's hand tuned for a particular GPU we won't be able to directly target that instead we would just do standard compiler optimizations like tiling and other optimizations to get as good performance as possible in practice so what we've been doing for Keras and in particular what we think is the right way to do this for any particular important domain like tensors is we extend HPVM further with tensor intrinsics so we've added tensor intrinsics to the HPVM representation and tensor intrinsics now are a higher level piece of information that we can target to something like QDNN or a hardware accelerator or something else without having to reverse engineer what the kernel is another way to do it instead of adding intrinsics in this way is that another team on our project the Harvard folks David Brooks and his people are doing some work to use tracing to automatically detect by specific extracted dynamic trace of execution of basic blocks and do some pattern matching probabilistically to match that to similar traces for a particular computational kernel so it might be an FFT or a matrix multiplier or something else and that tells you that this is a matrix multiplier and so now we can directly target an accelerator in that way and so we're going to basically take input from their tool to figure out what the accelerator is to be targeting for so that's the integration with Harvard that we're doing in the IBM project well this piece is a bottleneck already if I run it on a CUDA you might mutinate the flip on there that you can make specific ints for so you know these were power point slides we don't have a intelligent scheduler the only numbers I showed you was the potential that you could build a very sophisticated scheduler in practice what we do is to actually add attributes just like you're describing to say map this node, DFG node to this hardware device or this set of hardware devices and a set of choices and then we use that to guide the compiler and so in practice yes you can certainly put in attributes to say which nodes should be mapped to which hardware but to accelerate it doing this in C but use this CUDA library function sure you can basically make those be the attributes instead yeah sorry I didn't see you earlier back there I know I noticed go ahead so I didn't say all at least I hope I didn't I think that what I okay so what I intended to say is that parallel applications that will that are suitable for heterogeneous parallel systems so accelerators in heterogeneous parallel systems tend to be data parallel or some combination of data parallel and streaming or pipelined these are not arbitrary threaded concurrent parallel computations and in particular I think there are classes of parallelism that are not well suited for HPVM or vice versa and there are lots of multi-threaded parallel computations that I wouldn't want to try and compile with this sure yeah so it's a good question yeah so I don't think there's an ideal way to do it because this is not there's an infinite set of possible parallel hardware designs and heterogeneous systems I think that this will only happen by accumulation of evidence and in practice what we hope to be able to show is that as you compile to systems with a few different combinations of accelerators you'll be able to get very good performance or some definition of very good and we're basically talking about some coming to some factor within some factor hand tuned code right and that's I think the best practical way to be able to make that or to provide evidence to justify that claim I'll be honest with you in practice what we found is that you start doing this research and we did that and there's many interesting applications of it and the students are moving on and so we're now moving to a phase where we're not trying to prove this anymore we are now trying to use this for interesting and practical projects so in the IBM project for example we're trying to use this as a way to design SOCs or make it easier to design and program SOCs and so I think there are many different applications or goals you can have and the approximate computing work we're doing is a way to make edge applications much more energy efficient for agrobots or autonomous vehicles energy efficiency is a major goal there and I'm happy to stay and keep answering questions or talk but the announcement as people are welcome to leave if they would like to well I think the reason you're SOC and cynicism honestly I have talked lots of not lots I've talked to some FGA programmers and experts is because what I usually find at least when talking to them and I'm assuming it's the same with you is that they are coming from a background of having hand tuned or comparing mentally against some hand tuned baseline and we are not trying to achieve that we are not claiming that we will compete with RTL so you're saying that basically it'll be so far off any reasonable performance that it won't be useful that's yeah so honestly I don't have the answer to that because we are too early in that process we are only doing so we what we've done so far for example is that image processing it's a camera pipeline we've compiled it to it runs reasonably fast and it we've been able to get for example the what is the initiation interval was very bad when we started with and with some compile optimizations we get it down to a well pipeline single cycle pipeline and that fits in that Altera chip is that enough I don't think so I think more complex applications we'll have to the juries will definitely out and I think in practice it's going to take many significantly more compiler transformations in order to make this possible but until we can do that and then see what it works out like I don't know but yeah this is literally a research goal and an aim so far yep yeah if you want to leave prefacing the question is every node independent you said yeah right so one of the things which I did not show in the intrinsic is that we use a runtime memory tracker to track the current location of every allocated memory object so HPVM allocates a memory object and for example if you compute on one device and move and then you need to access on another device we know that it's on the first one if you compute on one device and this next node DFG node computes on the same device we know that they're both on the same device the runtime tracker keeps track of that so the runtime memory tracker it's not very complicated but it's essential to be able to do exactly the optimization you're talking about yes thank you