 Welcome to my talk on Beyond CUDA, GPU Accelerated Python on Cross Vendor Graphics Cards with Vulkan and Compute. My name is Alejandro Saucedo. I am Engineering Director at Selen Technologies, Chief Scientist at the Institute for Ethical AI and Member-at-Large at the ACM. So today we're going to be delving into a very interesting set of topics primarily around the parallel and GPU computing ecosystem. We're going to talk about what is the Vulkan SDK and how you can use the compute framework to build Python GPU accelerated applications together with a set of hands-on examples, as well as some references that you can actually delve into if you're more interested. So to start with, we want to cover the motivations of why parallel processing. And I'm going to be referencing a research survey that I recommend checking out as well in itself, as it basically collects insights from 227 papers in the parallel deep learning space, which provides a much more intuitive perspective of the adoption and this trend of adoption on GPU for not just scientific computing, but also general purpose computing. And some key observations that are really interesting is to see how the emphasis around a lot of the functions and paradigms in these computational areas that can be abstracted into highly parallelizable steps. And others that may not in themselves be highly parallelizable can actually be reduced into equivalent structures that then can be processed with specialized hardware like GPUs or even more specialized hardware like TPUs. There's also concepts that are continuously evolving. A simple one that you may have come across is the concept of micro-batching, which is basically being able to not just process a single data point, but also being able to process several more at the same time to ensure that the computation is done at a single or in a way intuitively on a single sort of parallel clock speed or clock iteration within that perspective instead of just submitting each one separately, as well as different very interesting and innovative ways of breaking down computation that can actually be processed in parallel to then be re-aggregated. And the interesting thing about this paper is not only how the trend is moving towards this parallel processing and, you know, GPUs and specialized GPU type hardware, but also that it's expanding into distributed processing, leveraging this parallel capable type of hardware. So a really interesting area. Now, moving into the ecosystem and the motivations of why, you know, we're even having a new sort of framework, the Vulcan SDK, which we're going to introduce in a bit, is particularly given that there is this sheer range and heterogeneity when it comes to the GPU and parallel capable processing hardware like TPUs, etc., that involves multiple different players, multiple different architectures, different drivers, and consequently different frameworks available to take advantages of these parallel capabilities, as well as, for example, you know, the increase in very powerful mobile capable components that now we're carrying in our pockets as smartphones. So there is a sheer amount of need in this space due to the heterogeneity of the different hardware across different vendors. And from that same perspective is one of the key motivations of why Vulcan came to be. Vulcan is a cross industry initiative that brings in several of the leading industry players to create this open source. And I think that's one of the more exciting parts, the open source and open sort of standard that focuses on not just interoperability, but also performance. And this is reflected in the interface that is exposed by the Vulcan SDK. Now, in regards to the Vulcan C++ SDK, there are several advantages and disadvantages as you would have with anything. Some of the very strong advantages is that it has a very low level interface with rich access to components that comes with this explicit and verbose C style API within its core that provides you a, you know, no language sort of sophisticated verbosity or abstractions, pure direct access into not just the Vulcan SDK, but also the hardware underneath, right? And this is very important for optimizations that are necessary. There are a broad range of industry leading players that are contributing to the standards and to the SDKs and to the tooling, which has been very, very encouraging to see. And also there is an emphasis towards this interoperability and high compatibility across different platforms, mobile suppliers, AMD, NVIDIA, Qualcomm, et cetera. And, you know, this is very, very strong strengths, right? Strong advantages. You know, these advantages, you know, we also have that it's very low level with rich access to components, which means that there is a lot of complexity and, you know, a very rich interface that needs to be interacted with. And then similarly with the C-style API, even though it provides you what you see is what you get with the hardware level, that means that there needs to be a lot of domain-specific knowledge to be able to build the foundational layers required to start building the application components, right? With that, there's also the broad range of top players, which even though it's a great thing, always with many opinions, many voices, there will be a lot of sort of interactions that will be pulling towards different directions, even though they would have the sort of like best in mind and trying to push for, you know, what is best. And again, you know, with a high compatibility across multiple different platforms, that means that it needs to deal with a very rich interface that, and not just rich interface, but rich set of flexible, you know, backends that can actually interact with this sheer number of different hardware that is underneath. And now let's see what the architecture of the Vulkan SDK looks like. So the Vulkan application is the overarching component that we will be seeing. From applications, you can actually spin up instances. These instances are the ones that then allow you to, you know, talk with your physical hardware, so physical devices, the C++ component that, you know, actually refers to the physical graphics card that you have in your computer. And then you can create what is referred to as logical devices or windows or views that then allow you to interact with that physical device. Right? So that logical device, you know, you can have multiple logical devices for one single physical device. And we can have multiple physical devices for an instance and multiple instances for an application. Right? So this is where it starts getting a bit complex. But, you know, to keep it simple, you know, with this logical device, the way that you interact with it, with the graphics card is through a queue. And this queue would have multiple commands, right? You submit the instructions that need to be executed. Right? And ultimately, this is how you would be able to submit instructions to the GPU. Once you actually want to run more complex components, that's where the pipeline comes in. You know, there's the compute pipeline, the graphics pipeline, but we're going to be just talking about the compute pipeline. This is all just to provide an intuition. You know, we actually have another talk that you'll be able to reference and check out as well as a documentation that explains all of this in detail. This is just to give you an intuition once we delve into the deeper depths of everything. With the pipeline, this is what you're able to say, well, I want to actually run a specific set of instructions as per a piece of code or an algorithm that is often referred to as a shader, a shader module. This shader is basically going to look like a piece of C code that just so happens to run on the GPU itself. And this shader, this piece of shader in the pipeline, will also interact with data. And this data is referenced through the concept of the script or sets, which again, you're not going to really have to know all of these different things, but it's still important to get an intuition that this is what's happening under the hood. So the script or sets are just like a container that basically says, well, I'm going to be using this GPU visible data so that when you run a set of instructions, you can actually reference these different things. And you see that because we're interacting with this queue, the CPU in itself is able to send instructions to the GPU with its own specific memory address space. And even though there are shared address spaces, which means that the CPU may have access to see some of the data in the GPU memory, there is different address spaces for different computational areas in the actual hardware of the GPU, which means that the actual code, your C++, your Python, will not be able to see that memory space. And that means that you would be able to interact with your GPU with this SDK, akin to what would be you interacting with a remote service, sending requests for the service to execute as if it was an API. And even though this is not correct because this is your own machine, this is actually a good intuitive way to see how you're interacting with that GPU given that it is through this queue with asynchronous command buffers that are executed. And of course, you can then have all the things that allow you to await until the execution has been finished, which we may actually talk about later on. And being able to build the foundation code required to actually run a simple program in Vulkan only takes from 500 to 2000 lines of C++ code. And this is the motivations, well, some of the motivations for compute itself, the compute framework. So compute enables developers to get started interacting with the Vulkan SDK with dozens instead of thousands of lines of code. And the key thing to emphasize here is that the core principle is to augment the Vulkan interface instead of abstracting or hiding it. It has a bring your own Vulkan interface, which plays nicely with already existing Vulkan applications. So if you already have a Vulkan application to render graphics related stuff, you'd be able to actually pass those Vulkan components. And it has non Vulkan naming convention. And we're going to talk about that. It's just basically to avoid ambiguity as there are libraries that may have classes called buffer. And it's like, is this buffer from Vulkan or for this other application, right? So things like that. So now in regards to other features is that it has, you know, a C++ interface, but also the Python interface that we're going to be using today. It has explicit CPU and GPU memory ownership. And this is important if you're using non compute Vulkan components, right? If you're already using Vulkan in another area of code, raw Vulkan. It gives you granular access to GPU queues, which is very important for optimizations, you know, which we have some material that actually explores that in more detail. There is, you know, single header file for the C++ development and a PyPy module for easy installation with the Python. And it has integration with mobile apps through the Android NDK, as well as game engines, such as Godot, which is also part of the frozen conference. And there is going to be a few links about that. So how does the compute architecture look like? And this is basically relatively or conceptually simple. So everything starts with a compute manager. And the compute manager is the component that basically oversees and manages all the explicitly relevant memory that is then created through your interaction with this like compute application. So the manager in itself would handle this sort of like device and the queue, which we talked about, you know, you don't need to go into that much depth, but it's still important. You would then have sequences and sequences are basically single or batches of operations to run on the GPU, right? That's basically what a sequence is. And you can have multiple sequences with multiple operations. Each sequence can have one or many operations. And each operation basically performs a specific action. An operation can have one or multiple tensors and tensors are basically abstractions into GPU and CPU memory, as well as the workflows related to move the memory around it. And optionally, it can also have what in the compute world is referred to as an algorithm. And an algorithm abstracts the concept of the pipeline, the Vulkan pipeline, the scriptor sets, and the specific shader code that then you can actually say, well, I want to run this code in the GPU with this specific data structures. And I want to run it in this type of way, right? So we will see what that actually looks like in more practical terms. But this is basically it, right? There's not much more around this. And there is an almost one-to-one mapping between the compute components and the Vulkan components. And that is explicitly to reduce ambiguity. Now, let's see what that looks like. So in Python, you would basically create a simple manager, right? You would then create a set of tensors with, you know, you can pass non-py arrays or Python lists. In this case, it's basically just two tensors that are going to be used in a multiplication and then the output where we're going to save all of the results, right? So that's what we're going to basically use. You normally would actually initialize the tensors explicitly, but there's a set of helper functions that, you know, take your CPU host memory list and then copy it into GPU-only memory. So all of this is handled to you. But again, it's not handled through magic. Every single thing can be actually accessed. And you can actually call those things yourself if you wish to, which is very important for several optimizations. You then define the shader code. This is basically the code that you're going to be running in the GPU. In this case, it's just basically a simple multiplication using the PyShader decorator. In this case, we have the first buffer, the second buffer. We're going to run a multiplication and store it in the output. That's basically it. This is going to run in the GPU. And then we're going to actually run it through the manager. So we're going to say we want to run this synchronously using these three tensors and using this shader, right? And we can just basically pass that. It runs synchronously. We can also run it asynchronously and we can actually do a lot of optimizations, which we're not going to delve into this, but you can run it synchronously. Once it's finished running, then you can actually copy the data back, right? Make sure that the output is now visible in the CPU. And then you can print it, right? So you can actually see that our output is 246, right? So that's basically it. That's all there is to it. And this is all you need to basically do all of the crazy, crazy stuff that happens underneath. But again, all of that crazy, crazy stuff is really, really interesting. And it is also very, very useful, very rich and very relevant for once you're starting to do optimizations. So ultimately, not to hide, but to augment. This is the key thing. So now let's actually delve into some of those optimizations that I've been mentioning. What we did right now is run a single command slash operation in through the manager, right? So we had the CPU running, we submitted that operation, we waited, then we came back, we then submitted the second one, and then we came back, right? What we can do as well is we can actually reuse multiple sequences, which means that we can pre-record commands. So we can record a bunch of operations and then run them. So the CPU would basically then execute. There would be specific operations that would run already in the GPU, and then you would actually wait until it's come back. You can actually run a synchronous dispatches or submissions of the commands, which means that you can actually not wait for the GPU to finish. So you can actually submit this. The CPU would continue doing other things while the GPU is doing other things, and then you can submit something else asynchronously and then do something else. There's also an await function that allows you to wait for the thing to finish. And then finally, something that we're not covering in this presentation, but it is also very interesting, is that you can also leverage GPU hardware concurrent to submit multiple batch operations that would then execute in different GPU queues that then would potentially run in parallel. And this is dependent on the hardware properties of your GPU and the families. Normally, for example, in my NVIDIA 1650, I have the ability to run hardware concurrent batches of GPU loads if I submit them to one compute family queue and one graphics family queue. We're not going to delve into that, but if you're interested, you know, there's a lot of really relevant content in the documentation, as well as in our other talk. But today, we're going to be actually covering something that delves into the world of machine learning. And what better thing to cover than the hell wall of machine learning, logistic regression, basically taking in a specific data point and classifying it as either correct or false. In this case, it's just a binary classification. And we're going to be letting the machine do the learning in the GPU. Well, what are we going to be doing? In terms of intuition, we're going to have input data that is going to look like two numbers that are going to go through our model, our machine learning model, and are going to perform a prediction, which ultimately should be what we are expecting. What our data looks like, at least the training data that we're going to be using, the training data that we're going to be using, is going to be basically, you know, when we see zero and zero, we expect zero, we see zero and one, we expect one, we see one and one, we expect one. And we have like, you know, a bunch of different training data that looks like this, extremely simple, I know, but just to make sure that, you know, the intuition comes through as opposed to the machine learning. I mean, this is not a machine learning talk. And then, you know, what we are doing is we want to actually train a machine learning model, basically learn the parameters that would allow us to ensure that every time that we have these inputs, we provide the respective outputs, right? And, you know, we, there is sort of like a more in-depth blog post that covers the underlying, you know, functions and the way that everything is broken down. But we're going to skim through some of that. If you're curious, you can actually delve into that. We're still going to talk about what's actually going to be happening. We're going to be trying to find the parameters of this function. This is going to be the input, basically, that sort of like specific X1 and X2 that you saw. In this case, because we want to actually leverage the hardware parallel capabilities, we're going to be able to submit multiple as micro batches. So instead of just running one by one, we're going to be running like five at the same time. So the GPU actually runs five and then comes back to us. So that's how we're going to be able to do this. We're going to be learning these two parameters, W and B. And we're going to be basically, you know, this is the function that calculates that prediction, right? We're not going to be delving into too many, too much of the depths, but there's a blog post that covers this in a bit more detail. There's thousands of talks that talk about logistic regression in Python. So more than welcome to check that. The key thing here is that this will be the shader code that we're going to be writing, right? And even though we're not going to cover in much detail, we're still going to be looking at what actually required to actually write that. And in the compute side, so this is the shader, but the compute side is what is going to be running this. We're going to have to create a bunch of tensors that represent our input data, our parameters, our predictions, et cetera, et cetera, the training data. We're going to have to initialize those tensors using, you know, on the sequence. We're going to initialize the algorithm, right? And then we're going to actually, so we're going to record it. So this is just to actually like initialize them. And then we're going to iterate and learn, let the machine do the learning, right? So we're actually going to be running a multiple difference sort of like iterations to that data set, updating the parameters every single time, running micro batches that are running in the GPU in parallel. I don't know why I'm doing this, but yeah, so we're going to be basically doing that. And then once we actually iterate those 100 times, we're going to be having learned those parameters, right? So that's basically what we're going to be doing in this high level logic. And you know, you know, this is just an intuition, right? The key thing here is to see what's happening in the compute and in the shader side. So the shader, what it looks like is just a much more complex version of what we saw earlier, right? So we have all of the inputs, so Xi, which is X1, X2, each of them, they're going to be in array because it's micro batches. We have the expected outputs. We have the weights that are coming in and we have the weights that are outputs and calculated. Remember that the parameters that we're learning are W and B. So those are things that we actually want to take out. And we're also taking the loss to be able to reuse where we're relevant and the number of parameters to use, right? So in this case, what we're going to be doing is we're going to be taking the specific input weights that are continuously being updated. These are the ones, the parameters that we're going to update in each execution. So we have to pass them every single time. We're going to be calculating the function, right? So as we saw, this is basically the function that we just had. And in the blog post, I break it down in very minute detail of how we actually go through each of these steps and calculate the derivatives, the partial derivatives of each of them, as well as the perspective loss. But here, the key thing is that we are able to ultimately calculate the parameters that we now have, right? So now in the compute side, we're going to first create all of the specific tensors. So as we saw, we had some training data. This is the 0, 0 equals 0, 1, 0 should be that, you know, 1, 1 should be 1, etc., etc. And then, you know, this is how we start with our weights. So we start with just like a random initialization, which then, you know, we're going to be iterating towards. We're going to then similarly start with, you know, random initialization for our other parameter, which is going to be 0. And then the number of, you know, data points is going to be basically the actual size of this parameter, right? So we're going to start with 5, right? So that's what we're going to be doing. We're going to be storing that in a variable called parameters so that we can reference them. We're then going to just initialize it by creating our manager and initialize the tensors. So what this does is it actually initializes them explicitly, whereas before we didn't do it, we did it implicitly with utility functions. So here we're saying initialize all of the parameters in the GPU so they're accessible in GPU memory. Then we're actually creating and recording the operations in the sequence. So what are the operations? We're going to first actually sync the data to device for these two tensors, because remember, the parameters are going to be updated every iteration. So we're going to put them in the GPU device memory. We're going to then record that algorithm that we just actually wrote. This is the logistic regression shader that you saw. So we're going to record that execution. And then we're going to record a sync to local, right? So we're going to record for all of the weights, the parameters and the loss to actually be copied back to the host so that Python can see them, right? And then finally, we're going to iterate 100 times by every iteration running that shader or running the sequence, which is basically all of the things that we just recorded, and then updating all of the weights. So I think there's an indentation missing here, but basically what is happening here is we're just basically updating it by the specific learning rate, which is just how fast do we want the actual parameters to be updated on each iteration? Again, the key thing here is just to see all of the features that you're able to use with compute and leveraging a simple machine learning use case as an example that, of course, we're skimming through and conscious. Sorry for that, for the people that maybe are seeing this. I'm going like, well, actually, that's not 100% correct. But in the blog post, I break it down in much more minute detail. This is just to show how you're able to actually interact with the GPU and optimize in different areas by pre-recording components, using the sequence, et cetera, et cetera. So in this case, we're just iterating. Once you're finished, you're able to then just print the calculated parameters, which in this case, they are the first weight, the second weight, the third weight, and the actual bias, which is ultimately what we ended up with. Actually, no. So this is basically the weights for the first component and then ultimately calculating what is the output. And just to emphasize, we covered kind of like this high-level example. But as I mentioned, we have a blog post that covers this one in Python, one in C++. It breaks it down in my new detail. And we have other tutorials, other examples that cover how to use this and actually the C++ as opposed to the Python one for integrating with your Android apps, as well as to game engines like the Godot engine, which we would recommend to check out. And more than anything, what I recommend is to get involved. If you go to github.com slash ethical ML slash Vulkan compute, you'd be able to actually check out what are some of the open issues. You can take one of the good first issues labeled good first issues. And also, please, there's an issue number 52, which is open for just general discussions. So you have ideas for improvements or questions, and you can actually just post them there. We've had some really interesting suggestions. Some of the key things in the roadmap, one of the main motivations to build this framework is to actually integrate it as a backend of an existing scientific computing framework that is potentially even being used for MOBA, machine learning, or for other types of use cases. So definitely really interesting on that. And if someone is running a scientific computing library, then be open to explore. Also creating more default operations, something like a fast forward transform, or like a parallel sum reduction, things like that would be really cool to have like out of the box operations that are perhaps even written in C++, but also exposed as Python. And then also adding examples. If you try this in a new sort of shader or a new sort of like algorithm, a new machine learning type of model, you know, would love to actually contribute it upstream and add it to the repo because I think that that would be very cool. So with that said, I think that's everything that we had sort of to cover today. Thank you very much for joining this talk on beyond CUDA, GPU accelerated Python on cross vendor graphics card with Vulkan and compute. We're looking forward to explore and hear your thoughts, ideas and suggestions. And if you have any questions to please feel free to reach out. Thank you very much.