 I welcome Thanos. I'm not even gonna attempt to pronounce his surname. I'm sure you can do that for me. Thank you very much Thank you very much Hello, everyone. I'm Thanos. My surname is Tragopoulos. I'm from Manchester and today I have the opportunity to present you our open source framework Tornado VM which is a programming framework that allows programmers to Accelerate the Java applications on heterogeneous devices like GPUs, multicore CPUs and FPGAs This is the agenda for today's talk I will start with a little bit of the motivation of our project then I will introduce you the insights of Tornado VM Then highlight a key feature of our system, which is dynamic application reconfiguration Then some key cases how Tornado VM has been used from applications to extract performance and Finally the current state and future directions So let's start with motivation. So why should we care about CPUs, GPUs, FPGAs and The answer to this question is because they are available. So from our system from Small systems like our smartphones we have our multicores with GPUs So why not Utilize them why not exploit all the available hardware that we have in the systems in data centers We have seen FPGAs recently be being deployed in the cloud They have been starting being available in AWS instances So starting from a CPU which is in the left side We can see a nice lake micro architecture with eight cores and an integrated GPU and This micro architecture can achieve with a GPU include the GPU up to one telaflop performance Which is good, but it is indicated for control flow execution So for branches and for low latency requirements Well, if the applications they require to they have a lot of data that can be processed in parallel they could utilize a GPU which have a high throughput to the memory and They have available up to three thousand cores to process data and lately it's the FPGA type which is a The nice thing with this chip is that it is a programmable So the same device can be reconfigured and be tailored to the needs that the developer needs so this is intended for a pipeline to parallelism and Low latency, but it comes at the cost of programmability because they are traditionally programmed by a Hardware description language So despite all this diversity in the hardware that exists and appears in the right part of the of the slide The question is how a programmer can harness this especially From high-level languages like cc++ even Java and the answer is by using a programming model because there's the whole magic The abstraction comes from the programming model So in this case there are programming models for heterogeneous systems like OpenCL and CUDA and This programming models they abstract the execution in two ways So they have the principles of the execution model in which the The processor the CPU the GPU and the FPGA the accelerators can be used In an abstract form. So the execution is the following At first you copy the data into the memory of your device From the map from the RAM from the main memory of the system Then you execute you accelerate the data and then you copy them out to the main memory And this way the CPU the GPU and FPGA they look alike. They are just accelerators that they process data So then the question is okay We have C++ OpenCL and CUDA that they can target all the devices available on the systems But what about managed languages? What about Java, JavaScript, Python? What about languages that they have been designed by nature to write once and run everywhere? Well the current frameworks the current JVMs they emit code for processors for mostly for x86 architecture Therefore currently there is no framework that allows Java transparently to generate dynamic code for any hardware device like FPGA GPU integrated transparently to the user and This is the main motivation of our work So when vision a system that will allow transparently these languages to exploit All the available hardware on the platform Let's go and have a look on the insides of tornado VM So in this slide, I will present you the software stack From in a top-down order. So let's start with API Tornado currently doesn't detect parallelism. So it doesn't know which part can be paralyzed So it relies on the programmer to specify that this method could be ideal for acceleration on a GPU And this is done by exposing our API. So basically we our API is a task-based API so we have tasks a task is a representation of a Method that could be offloaded on the FPGA or the GPU and We can have a group of tasks a group of methods that they could be Offloaded and executed on the hardware in a sequence So then the task they are forwarded in the runtime in which we have an Optimizer that can optimize the execution. So for example, if we have if we have two tasks And the second task is consuming data that they come from the first task Then this data they don't need to cope to be copied out of the the GPU So in this case we can optimize the data transfers and we can save energy and Then the runtime emits our byte codes tornado byte codes simple byte codes that orchestrate the execution the byte codes are initially executed on the Interpreter and then they are forwarded in the for lazy Compilation into the jit compiler, which is the gral compiler, but it has been extended to them to apply specialization for the devices for the execution on the FPGA or the GPU So we have different type of specialization according to the device that we target And this is essential because although open CL can be portable across any device The performance is not portable at all, especially when you want to run code For that is meant to be for GPUs and you target the hardware like FPGA, which is ideal for pipeline execution so then the compiler emits a specialized code and The code is forwarded to the device drivers Where there is the complete the compiler for example the NVIDIA compiler or for the FPGA It can be a high-level synthesis compiler that will compile the open CL into the final binary that will be executed And will be offloaded on the device So our system is modular and We currently support the NVIDIA MD GPUs. We support the CPUs micro CPUs x86 and AMD Intel and AMD and we support also Intel and Xilinx FPGAs This is now an example of how the user can use tornado VM So how he can specify that this code could be parallelized on the hardware on the GPU for example So this is a class compute that has one method the MXM that computes the matrix multiplication of two arrays A and B and the result is stored into the CRA so the only way that That the the programmer can paralyze the code with tornado VM is by using the at parallel annotation Which is an annotation exposed to the programmer in order to indicate that these parallips could be paralyzed So this is a hint and that's all done with the modification of the method So with this tornado is able to paralyze the loops apply specializations for the hardware devices and then execute it Get performance for free The only other change that we need the programmer needs to do is to To be compatible to be compliant with API that we expose So he needs to create a task schedule, which is a group of tasks in this particular case It is s0 and this task schedule will have one task in this example because we have one method T0 the name of the method and then the parameters of the method In our interface for the task schedule We have the stream out which in which the user can specify the the variable that will be will have the result of the actual computation and then we execute and Once it is compiled then the the programmer can execute the code on the GPU By just running tornado and then the class name tornado is an alias for Java and all the parameters of the JVM that he uses Let's have a look now into what is dynamic application reconfiguration, which I think it's a very nice feature to have in systems So dynamic application reconfiguration is essentially live task migration So the tasks the methods can be dynamically Migrated from one device to the other and this is really cool Let's have a look into the system how our framework is built in order to support this functionality At the top we have the task schedules Which are the the groups of methods to be offloaded on the hardware to be accelerated then tornado forks One thread per device so for example for the Michael multi-core CPU the integrated CPU the external GPU or the FPGA Including a thread for the hotspot, which will did compile the code in the open ZDK So each thread compiles the code if it is not compiled and when it is compiled The the code will be stored in the the codecast in order to be reduced the second time then the code is executed is offloaded on the hardware and Then we are waiting to see when it will finish so we have a barrier at the end in which all the threads are joined And after doing that we are able to to apply some policies that we call them so With these policies we are able to decide what we want to do Do we want for example the first thread that will compile and execute to be the first the only thread that will execute and then kill all The rest this is the the latency policy policy, which is intended for applications that they are very critical for latency The other one is the end-to-end which includes time for a compilation and execution and the other ones pick performance Which is the policy that has only the data transfers in out and the execution Let's go to to see some performance results of this feature of dynamic reconfiguration. So in this figure We present four systems. Tornado VM is the one that Will decide the performance well where to be migrated, which is the with the dynamic Reconfiguration then we have the CPU the GPU and the FPGA So in this figure we have two benchmarks two applications One is the DFT and the second one is the embody and we compare two different policies the end-to-end That has the jit compilation included in the time and the pick performance, which is the execution and data transfers The interesting part to see from these results is that for small data sizes for well, let me first to explain the axis so The way the x-axis has the data sizes and the y-axis is the performance against the hotspot So we start and we see that for small data sizes the performance is Ideal is the big performance can be achieved into hotspot So it doesn't make sense to get the data copy them out through PCI Express to the device and Execute small computation because the data are not significant large simply significantly large As long as the data sizes increase We can see that the the execution in the GPU or the FPGA it can be significantly higher than hotspot Then it makes sense to migrate the execution there Another another interesting fact is for example on the peak performance We can see that if the GPU was not present. So if the the pink Sports were not present then the execution for large data sizes Would be migrated to FPGAs and this could be significant because they could give energy savings significant energy efficiency Into the system the maximum performance that we got is up to 4500 x against the Java sequential code and that was on the nvidia 1060 Let's have a look about how tornado has been applied in real applications so At first I have to say that the tornado is maintained under the umbrella of European Horizon 2020 project the e2 data Which has as objective to create an end-to-end solution for big data frameworks that they want to target heterogeneous computing nodes This is the example of Accelerating a passive link, which is a big data framework So in this case the the clients are Java developers that they Create operators in Java They forward the operators into the job manager and the task manager eventually who will Distribute the operators across the available computing nodes on the system. So into the distributed heterogeneous nodes each node can have a GPU can be configured with different hardware capabilities and This is the goal of this case In the second case is a machine learning acceleration So this use case has been used by has been developed by exos which is their coordinator of the e2 data project and the the main problem here is that Patients are going to the hospital. They're hospitalized. They are admitted there and then they are living hospitals But there is a chance that they may be readmitted depending on their profile on the disease the conditions and other characteristics other features So the idea here is to create a machine learning model Which will accurately predict how possible it is for a patient to return back to the hospital? So exos so that by deploying tornado vm They can achieve up to 14 times higher performance for a data set, which is which has two million patients Data and the last case that I want to present is a deep learning acceleration. So in this case We took deep nets, which is a deep learning framework written in Java Demnest currently doesn't have support for GPU acceleration and we know that the deep learning has the potential to be Paralyzed because it has many networks many neural so it can be processed in parallel. I Would like to emphasize here that the current available solutions for deep learning. They are using pre-compiled kernels So static binaries that you can deploy also from TensorFlow and they have bindings for Java Python So there's no current framework that can dynamically generate code To the device for the devices. They are stick. They're stuck with the static Compilers so on the right side. I have an example of how we accelerated a part of deep nets for the backward propagation Method, so this is the original code of deep nets And this is the changes that we did so we added that parallel notation and Then we created the task schedule for this particular method one task and we specified the input and output of the methods to go to the hardware So we achieved up to eight times higher performance for large datasets Let's have a look on the current state of the project and the future directions that we have So tornado currently it is available on github. It is open source so feel free to go to try our examples to go through the documentation and We have also docket images available for NVIDIA GPUs and integrated GPUs I would like to emphasize also that we have tested with any IDE so you can debug the code for instead of going through the Vendor tools you can use the IDE in order to debug your code in Java instead of using the hardware Debuggers and all this painful procedure to develop for FPGAs for example So what's next in the current work in our work in progress so we were Becoming compatible with OpenGDK 11 We are doing optimizations for FPGA and GPU execution We currently run on AWS instances that they have CPUs GPUs and FPGAs and We're working also on NVIDIA ptx on a CUDA backend this is our team composed of academics staff and PhD students and of course we are looking for collaborations So feel free to give us feedback to talk to us. I'm here with Florin my colleague So we'd be glad to have a discussion about our project and feedback and Here as takeaways, I would like just to emphasize that our work is not meant to replace hotspot We just want to emphasize that the hardware capabilities exist on the hardware So we may want to leverage for large datasets, so it may work to To float a part of our program on the FPGA another part on the CPU, etc So thanks for your attention. We would be glad to discuss about our project and get some ideas and feedback So, thank you very much Any questions? We've got two minutes Sorry Sorry So you basically schedule the algorithm to one of the hardware Stacks that you have right? Yes We don't only schedule we Create the code you create a code. Yes, so suppose you so so my question would be What kind of workloads have you tested this on because suppose there's something there's multiple algorithms in parallel What which one would you optimize or which one would you run? How do how would you solve such a problem? It depends on the characteristics of the application. So It's not a fixed solution. So for example, it's not a specific answer that this will go there So GPUs are not intended for pipeline Executions where FGA can give you more performance improvements there So I think it's a trade-off depends on the characteristics of the applications So at first we profile the code and then we analyze it But thank you one more question very quick one You said you were using the growl compiler to jit compile We're using truffle to feed the your bytecode into that or were you feeding it directly? Good question. So so far we Don't do that, but isn't in the future work in our plans to do that, you know to become compatible with any truffle language Thank you very much