 Hello, everyone. Thank you for coming to our slot today with the Karm. My name is Christos. We will talk about our joint work with Red Hat in the IRO project. I have, as you see, a couple of jobs. The main one for this talk is that I'm coordinating technically the IRO project. So Karm, with me. Hello, I'm Karm. I work as a quiet engineer at Red Hat on making Quarkus compilable to native image. Over to you. Thank you. So let's have a little bit. So in this presentation, we're going to start talking about European policy. I'm going to go down to compilers. So it's going to be a rough ride, but stay with us. So what is all about? The whole story about how the European Commission sees the sovereignty of the European Union regarding to manufacturing started, I would say more or less along with the pandemic and the fact that the majority of the chip manufacturing in the world is concentrated in Taiwan. To solve these problems, both the US and the EU, they put forward legislation in order to create sovereignty. So what it means? The European Commission, in particular, they voted the European Chips Act, which is a 50 billion euro budget project in order for the Union to be able to design, manufacture, fabricate and procure processors within, made in the EU, let's say, which is a big task. It's a big task and requires a lot of investment, both from individual countries and also from the Commission. And not after this was voted, which was, let's say, a couple of months ago, in parallel, the European Union was already starting the prep work. So what it means? Projects in order to build both the hardware and the software. And as part of this collection of different hardware and software, it is our project, which is called the IRO. So the goal of the IRO is to create, optimize the software stack of a typical cloud application stack, like the Docker Kubernetes, different runtimes and operating systems, in order to be ready when the hardware of this EU processors will go into market. So essentially, imagine that there are projects building the hardware and the software, working in parallel. And at some point in the near future, hopefully, we're going to be able to use these projects or at least the outcome of these projects in order to run and help companies within the EU to migrate from Amazon or Azure to these cloud services. And in particular, in this IRO project, we are many partners. And the idea is that we're trying to get a collection of software frameworks that exist on current cloud deployments and optimize them. And this hardware ecosystem is going to be very heterogeneous. It's going to be course with accelerators inside, in the SOCs and accelerators like GPUs and FPGAs connected around these cloud services. And we try to optimize compilers and runtimes and different frameworks for managing cloud operations. And of course, it's a small project. We cannot solve all the software, but it's a good start to have a first implementation that makes sense for somebody to use this EU cloud services. So what is this hardware? We don't know. I mean, we know, but not exactly. So what is the, so what we don't know? We know that there are many projects of creating different designs. So there are projects under the umbrella, which is called the European Processor Initiative, which creates different designs, of course, of accelerators, of interconnections, packages, and these designs that are being funneled to other projects like ours, where we take those test beds and we start bringing up the software. So if you go to the European Processor Initiative website, you're going to see many different streams. Chips for HPC, chips for automotive, for IoT, some of them are based, some of them are risk five based. So it depends the project, the experiment with different designs. But nevertheless, no matter which design somebody will choose, at some point, we will have to run and prove its performance. We are not involved in the hardware design. We are, let's say, the consumers of these hardware designs where we bring the software. So our target, it is a hybrid of ARM and risk five, which is the first commercially available processor from the API project, which is from a French company called Cyperl, which is a processor, which is an ARM core inside the puts on risk five accelerators. And then you have PCI Express connected GPUs from Intel and NVIDIA. And of course, we have other, let's say, FPGAs to experiment with more, let's say, research stuff compared to current upstream software. I cannot talk about, I don't have time at least to talk about the whole software we target. So I'm trying to narrow down the discussion to managed programming models and runtimes, which is University of Manchester and Red Hats, let's say expertise. As you see here is the whole stack that we try to cover. And regarding the runtimes, we target managed programming language, like the JVM. So although my talk is called Java, in reality, is JVM. So anything that runs on top of the JVM will benefit from this work. And we target Java and frameworks from microservices and torrent of VM for accelerating Java or the JVM applications on GPUs and FPGAs. And of course, we have the other stream for native programming language, like SQL and DPC++ and one API, which is done by Codeplay and Intel in this project. So now I'm going one layer down the abstraction and I will narrow down the discussion about the Java or the JVM. So what do we do in this project? Normally, we try to optimize two main frameworks. Of course, we have the OpenJDK distributions that they now have risk five back ports from Alibaba and they already have ARM support and Red Hat is doing a great work supporting the ARM builds. And these are all, let's say, more or less upstream and people can download and use them. And now we take it a step further, let's say at least in our project, to try to experiment with this hybrid ARM risks five and accelerators. So how this would look like for the developers and for the hardware. And we focus on Quarkus, which will be the part of the talk from Carm and torrent of VM. Torrent of VM is a framework, probably you haven't heard of it. It's a framework from the University of Manchester, where we're trying to, let's say, bring a higher performance through heterogeneous execution on the JVM. So what is torrent of VM? In a sentence, it is a JVM plugin. Although it's called VM, it's not a new VM. The node of VM by itself doesn't run anything. It needs a host JVM. So it is a plugin. So you have, let's say, Amazon Coretto or Mandrel from Red Hat or OpenJDK, you download it, you plug in tornado, and it gives you a lightweight API which you can use to accelerate automatically code on GPUs and FPGAs. And I will show you later a little bit how. So essentially, it is an add-on on any virtual machine, JVM, that supports JVM CI, that you can use to accelerate automatically code on accelerators. And it has, let's say, two main features that we advertise. The first one is, of course, the API. So you don't have to do a lot of work. If you have done GPU programming, currently, you have to use JNI calls and CUDA kernels and do the memory management manually copying data from the Java heap to the actual accelerator. So torrent of VM solves all these problems. Torrent of VM does not expose any hardware to the developer. So we are very strong believers of the original Java idea that you write once and you run everywhere. And for us, this everywhere is not only multi-isers CPUs, but also GPUs in any platform or more than hardware platform that exists. And of course, we have the automatic hospitalization. Now, if we run, for example, OpenJDK on ARM or Intel, the compiler, the C2 or the ground compiler, whatever we use, will have a specific compilation chain and underneath we have the intrinsics so its company goes and puts their own specialization. This is what torrent of VM does also. If you compile your code for GPUs, it's going to be different. If you compile your code for Intel GPUs, it's going to be different than Nvidia. So we detect all those hardware and automatically we specialize the code to run better on its platform. So how do we do it? Essentially, we plug into existing JVM, in that case OpenJDK. So developers can use our API and then we take the byte codes, we go through ground compiler, we do the IR, and then we lower either to OpenCL or PTX for CUDA or SPIR-V for level zero. And each of those frameworks can target different kinds of devices. And on the top on the right, we have different kinds of distributions that we support and different kinds of hardware vendors that we support. I cannot, again, I would like to spend at least six hours talking about a lot of VMs, but I cannot. So I will try to give a small, let's say an idea of what is the model we use. So in any GPU programming model, essentially, we have to do two or three things depending how we look at it. The first one is we have to copy the data from the CPU memory to the accelerator memory. We have to run the code there, the kernel, the CUDA or OpenCL or whatever, and then we have to copy the data back from the GPU's memory, back to the Java heap. And this, let's execute a model, it is really simple to comprehend. I have to do three things, no problem, but in the JVM world, these three things require a lot of code, a lot of code that you have to do manually. So we solve these problems by automating everything that the developer shouldn't care about. How? By having, essentially, two types of code in the API of Tornado. The first one, it is the host code, the controller who does the playmaking, which data is going to go where, how, and all these optimizations. The second one is the compiler will generate the kernel or the code that corresponds to our Java method for acceleration. And we have the task graph here with the structure that we compose different tasks. So in that case, we have a method called method day in class, right? So this is like a lambda function. So we don't change the Java code, we just pass it there. And the compiler will pick it back or some annotations we put, and then we'll compile it and run it a complete transparent to Java. So there's no JNI or any manual work that has to be done. Now, we have two APIs for development, both called the loop parallel API and the kernel API. These are complementary. The loop parallel API, it is more for, if you have a for loop, let's say you have a for loop that it's very heavy and I want to accelerate it, you just put an annotation and then we do it automatically. But if you're a power user, so if you come from CUDA and you want to put your barriers, your local memory, all those GPU stuff inside, we can use the more advanced kernel API. So I make everything sounds perfect, right? Turn the VMs to solve problems. Is it? No. Why? Because we have to know when to use it. Not all applications require the raw power of a GPU. Only some of them they require it. For example, some use case we have computer vision, ray tracing, machine learning and phase detection. As soon as we have a lot of compute, a lot of parallel computation and data to process, then I think it makes sense to consider GPU acceleration. And turning the VM, it's, in our opinion, one of the easiest way to achieve that. So I would like to conclude now by hopefully showing the ray tracing demo running. We just ported it from Linux to ARM MacBook, so I cannot guarantee that it will run. Okay. So what do we have here? This is a scene written full in Java. There is no secret here, okay? This is full in Java that does ray tracing. So it renders in real time on the CPU, on the CQ thread, at one, two FPS. Okay. So if I go here and I try to zoom, it's very choppy because now the CPU now is struggling, that's working full. And if I zoom now, you could see the shadows, the reflection of the light source on the rays on its ball. So this is not real time. This is useless. And you can ask me why you write this. I'll show you in a second. So let's change the limitation. Let's go from pure Java single thread to another Java limitation that uses parallel streams. Now I'm around 10 FPS because it's a 10 core machine. I can scale out on the machine. And now it's easier for me to zoom in. Before I was zooming in, but you couldn't see because it was moving like a turtle. So let's now go GPU. So now the four magics happened. The first magic, it didn't crash, okay? The second, the second magic. We see here, now the same Java code, this is the same Java code that was running on CQ3D. Torrento VM took it, compile it to OpenCL and run it on the M1 Pro GPU, which is really powerful one for the record. And now we are at 60 FPS. And now we can actually start, it's real time. You can zoom in, zoom out. And also you can change the different shadows and reflection bounces. The more we change them, the more heavy it becomes. So again, this is pure Java code, automatically compile it to OpenCL running on the GPU. Third miracle. Third miracle is that now all the computation happens on the GPU, so the CPU is sitting idle. So let's develop a physics engine on the CPU, run the rendering on the GPU, the physics is on the CPU. So now all these physics engines here, you see all these balls that are bouncing, this code is simply emitting on the CPU. The GPU does the rendering. The fourth magic that I didn't have to stop is the application. I can change now the Java code between running on the CPU, running on the GPU, make combination while the application is running. And this is, I would say, the strength of Tornado VM, which is called dynamic configuration. And this is where the VM comes inside. Tornado VM internally has its own byte codes that can recompile the code for GPU or CPU, the same way that the OpenJDK or any JVM compiles the code between C1 and C2 without stopping the code. So we follow the same ideas, but this time for heterogeneity. And I think my time is up. Thank you very much. And I pass the microphone to Karm to talk about the beautiful Quarkus. You see the fourth miracle? We managed to switch the displays. And whether my setup survives it. One, one, one. I was talking about Quarkus, which is a Java framework. It's a part of the AeroProject. And it's a suite of libraries that are tailored to be cloud-native. And in our context, that means being very small, also in the footprint and in the resources consumption. I'll be dropping some buzzwords. So I don't know what's in the audience. If I say hibernate, does it ring any bells here? Okay, a couple of them. Spring. Some Java libraries. So this is Java on the server side. Like building Java applications, most of them on the server side. Gravium, it's JDK, custom JDK with custom JIT and capabilities to compile various languages in native executable. And Mandrel, that's what our team is focusing on. It's a distribution of Gravium, which is made smaller because it focuses only on Java. It doesn't deal with other languages. And its main differentiator is that it uses Temurin JDK as its base and it adds the native image tool to it. So while you are using Mandrel, you are using the Temurin JDK you would like to use without any additional patches. So native image is the tool we will be talking about today. It compiles the suite of your libraries in Quarkus in your application into a native executable. So you can have your Java application that uses several databases, has database drivers, it talks to Elasticsearch, it's got a lot of dependencies and native image will choose through all of that, construct some kind of close world with all your dependencies and compile that into a native executable, including all resources, all additional files, it's all going to be baked in a single executable. This closed world assumption is an important stone in this whole machine . Sometimes you need to specify that you are going to do at runtime something that is not apparent from your source code but Quarkus helps you with that and does this heavy lifting for you. So for instance if a library let's say Elasticsearch is doing something at runtime that it's not apparent at build time, there is a Quarkus extension you depend on and it's ready for the compilation so you can compile things in your native executable that wouldn't be surprised at runtime by a missing let's say class. I will jump right into trying it out on an ARM server. Hopefully we are now we are now connected to AMPERAltra ATCore ARM server. I got Jenkins running there and we will build Quarkus application. I got downloaded JDK but not our native image, not the compiler because that's going to be used from container image. While it does it thing, I will continue this slide and then come back to it. Quarkus is a huge project and it's a huge project. It's a huge project of extensions so while your end application is trimmed to the bare minimum and the footprint is as small as possible, the possibilities of all libraries are really huge and many of them package some native code in the JAR files . There are a lot of libraries with them and there is an example of all the libraries that in the core Quarkus got some native dependencies and there are some of them that don't currently produce ARM binaries. Those are usually loaded with Java native interface and have some Java fallback. Currently Quarkus runs on . We also have a . . . . . . . . . . . . . . . . . . . . . . . . . So there is a lot of testing involved to make sure that Quarkus is gradually more and more ready for ARM. We use various integration test suites for that. Some of those are tiny specialized apps or reproducers targeted at some special features. Some of them are huge suites or let's say Quarkus integration and some of them are artificial applications that jam a lot of stuff on top of each other to really stress the compiler that it could handle a lot of generated entities on Hibernate or stuff like that and that it wouldn't blow up and it would handle it. You can build Mandrel yourself and it's not a joke. It's not any kind of like obscure one of those projects you cannot compile when you download them. You can use any kind of manual in the same way and it will make sure that other users are written in Java using J bank and you can run it. You just give it Java home to Temur in JDK Graal VM GitHub repo and it compiles the native image compiler for you. The distribution. You got public facing Jenkins public-facing driver. And these are our precious loved ones. Over to Bar Metal servers we are currently most working with. They are photographed on my desk but right now they are safely in a data center. David looks like he doesn't trust me but it's really already done. It's not there anymore on the desk. Those were donated by ARM to make this effort accelerated and it's part of the Arrow effort we are doing. And they've got the same NeoVerse N1 architecture as the current Arrow spec or target. They are quite beefy machines with 80 cores. The compilation I started is done and the Quarkus is running. So I will just scroll back to see what took place. I started, I downloaded, started with Quarkus like demo project, unpacked it and just run Maven. That's all that happened in this Quarkus application. And it compiled the Java bits, it compiled Java bytecode and then it realized that I'm trying to do a native image build and that I don't have any native image or GraalVM Mandrel compiler on this system on the pass. So it resulted to check in whether I have Docker or Podman installed. It found I got Podman on the server. So it used Podman to download a Mandrel builder image which is found in artifacts. We are regularly updating and pushing to publicly accessible container registry. And it downloaded that image. It was already downloaded on the system. And the horrendous blob of text, it was constructed automatically by Quarkus for you to drive the compilation of your application to the native executable. So that was generated by Quarkus. It realized what your application needs to be properly compiled to native executable and that process started. The reason why I kind of sneakily outstamped elsewhere was that the compilation is by no means instantaneous. It takes some time. It analyzes the classes in the closed world. It analyzes whether there is any JNI access. And it finally got the executable image with machine code that also contains baked in resources. So for instance, if you got some properties files in your JAR files or something like that, that's all going to be packed in one single executable. And it took 40 seconds to build. And the Quarkus is running now. And I can access its default web page for this one particular demo application. The same binary, literally the same I built there can run also on my phone. I don't know how to connect it to the projector, but it runs there. It runs dev-debian. And Quarkus starts there in 58, 52 milliseconds. To assess the state of things and to gradually make them better, we are collecting metrics about build time and the runtime. Build time is what I showed here. How long it took to compile the application, bigger the application, bigger your closed world, the longer it takes or could take. So that's one of the things we are looking at. And also the runtime. So that's how the application behaves when it runs. We've got a collector for that tool written in Java using Quarkus. The reason why I mentioned that is that I don't have any Java installed on my server. I build the application in GitHub actions and I just push the executable to the server. So there is a huge, rich Java application talking to database running on the server, but the server doesn't have any kind of JVM installed on it. This is how we assess the build time metrics. And this is an example of some of them that are collected. The most important would be how long it took to compile an application. And also it's important what the target architecture. Runtime metrics are more interesting. There is a rough comparison of the same application that uses a lot of micro-profile libraries. In Hotspot it takes much more memory to run. And in native image, it's much more memory efficient. The reason for that is that it doesn't have to keep a lot of metadata for a lot of stuff, because there is no just-in-time compilation. There is no de-optimization. It just fixed what's compiled in the binary. So there is a lot of stuff that's not needed at the runtime. And it also starts quicker. That's not only thanks to the native compilation, but it's also thanks to Quarkus that pre-initializes a lot of stuff and makes it on the heap. I ran through that really quickly, but we agreed to leave five minutes before the end for questions. So we can go back, both of us, to anything that caught your eye or that you find weird or suspicious and you would like to heckle us about or ask about. So shoot. That's silence. The question was whether Tornado VM is using only public GDK APIs or whether there is anything to be proposed to the whole GDK ecosystem? The answer is yes in both of the questions. So at the moment, we are using normal Java APIs, but two things. First, we use normal Java APIs. The JVM has to be JVMSI compatible, so it has support in the interface. But as soon as we go to Tornado VM, as soon as you go to GPUs, the Java spec doesn't apply anymore, because that's parallelism. As soon as you go to parallel computations, then you cannot guarantee consistency or ordering any. But as Tornado VM improves, we would like to have to propose to the committee some changes that would help benefit Tornado VM, specifically for native data types and Panama integration. If that one early quitter, I would hear a herping drop in this room. So they are either stunned how awesome it was, or didn't understand, or I don't know. Thank you very much. Thank you.