 Welcome to my talk on Beyond CUDA, GPU Accelerated Python on Cross Vendor Graphics Cards with Vulkan and Compute. My name is Alejandro Saucedo. I am Engineering Director at Selen Technologies, Chief Scientist at the Institute for Ethical AI and Member-at-Large at the ACM. So today we're going to be delving into a very interesting set of topics, primarily around the parallel and GPU computing ecosystem. We're going to talk about what is the Vulkan SDK and how you can use the compute framework to build Python GPU accelerated applications together with a set of hands-on examples as well as some references that you can actually delve into if you're more interested. So to start with we want to cover the motivations of why parallel processing and I'm going to be referencing a research survey that I recommend checking out as well in itself as it basically collects insights from 227 papers in the parallel deep learning space which provides a much more intuitive perspective of the adoption and this trend of adoption on GPU for not just scientific computing but also general purpose computing. And some key observations that are really interesting is to see how the emphasis around a lot of the functions and paradigms in these computational areas that can be abstracted into highly parallelizable steps and others that may not in themselves be highly parallelizable can actually be reduced into equivalent structures that then can be processed with specialized hardware like GPUs or even more specialized hardware like GPUs. There's also concepts that are continuously evolving. A simple one that you may have come across is the concept of micro batching which is basically being able to not just process a single data point but also being able to process several more at the same time to ensure that the computation is done at a single or in a way intuitively on a single sort of parallel clock speed or clock iteration within that perspective instead of just submitting each one separately as well as different very interesting and innovative ways of breaking down computation that can actually be processed in parallel to then be re-aggregated. And the interesting thing about this paper is not only how the trend is moving towards this parallel processing and you know GPUs and specialized GPU type hardware but also that it's expanding into distributed processing leveraging this parallel capable type of hardware so a really interesting area. Now delving into the into the ecosystem and the motivations of why you know we're even having a new sort of framework the Vulkan SDK which we're going to introduce in a bit is particularly given that there is this sheer range and heterogeneity when it comes to the GPU and parallel capable processing hardware like GPUs etc that involves multiple different players, multiple different architectures, different drivers and consequently different frameworks available to take advantages of these parallel capabilities as well as for example you know the increase in very powerful mobile capable components that now we're carrying in our pockets as the smartphones. So there is a sheer amount of need in this space due to the heterogeneity of the different hardware across different vendors and from that same perspective is one of the key motivations of why Vulkan came to be. Vulkan is a cross industry initiative that brings in several of the leading industry players to create this open source and I think that's the one of the more exciting parts the open source and open sort of standard that focuses on not just interoperability but also performance and this is reflected in the interface that is exposed by the Vulkan SDK. Now in regards to the Vulkan C++ SDK there are several advantages and disadvantages as you would have with anything some of the very strong advantages is that it has a very low level interface with rich access to components that comes with this explicit and verbose C-style API within its core that provides you a you know no language sort of sophisticated verbosity or abstractions pure direct access into not just the Vulkan SDK but also the hardware underneath right and this is very important for optimizations that are necessary. There are a broad range of industry leading players that are contributing to the standards and to the SDKs and to the tooling which has been very very encouraging to see and also there is an emphasis towards this interoperability and high compatibility across different platforms mobile suppliers AMD NVIDIA Qualcomm etc and you know this is very very strong strengths right strong advantages you know disadvantages you know we also have that it's very low level with rich access to components which means that there is a lot of complexity and you know a very rich interface that needs to be interacted with and then similarly with the C-style API even though it provides you a what what you see is what you get with the hardware level that means that there needs to be a lot of domain specific knowledge to be able to build the foundational layers required to start building the application components right with that there's also the broad range of top players which even though it's a great thing always with many opinions many voices there will be a lot of sort of interactions that will be pulling toward different directions even though they would have the sort of like best in mind and trying to push for you know what is what is best and again you know with a high compatibility across multiple different platforms that means that it needs to deal with a very rich interface that I'm not just rich interface but but rich set of flexible you know back ends that can actually interact with with this sheer number of different hardware that is underneath and now let's see what the architecture of the Vulkan SDK looks like so the Vulkan application is the overarching component that we will be seeing from applications you can actually spin up instances these instances are the ones that then allow you to you know talk with your physical hardware so physical devices the C++ component that you know actually refers to the to the to the physical graphics card that you have in your computer and then you can create what is referred to us logical devices or windows or views that then allow you to interact with that physical device right so that logical device you know you can have multiple logical devices for one single physical device and we can have multiple physical devices for an instance and multiple instances for an application right so this is where it starts getting a bit complex but you know to keep it simple you know with this logical device the way that you interact with it with the with the graphics card is through a queue and this queue would have multiple commands right you submit the instructions that need to be executed right and ultimately this is how you would be able to submit instructions to the GPU once you actually want to run more complex components that's where the pipeline comes in you know there's the compute pipeline the graphics pipeline but we're going to be just talking about the compute pipeline this is all just to provide an intuition you know we actually have another talk that you'll be able to reference and check out as well as a documentation that explains all of this in detail this is just to give you an intuition once we delve into into the deeper depths of everything with the pipeline this is what you're able to say well I want to actually run a specific set of instructions as per a piece of code or an algorithm that is often referred to as a shader a shader module this shader is basically going to look like a piece of c code that just so happens to run on the GPU itself and this shader this piece of shader in the pipeline would also interact with data right and this data is referenced through the concept of the scriptor sets which again you know you're not going to really have to like know all of these different things but it's still important to get an intuition that this is what's happening under the hood right so the scriptor sets are just like a a container that basically says well I'm going to be using this GPU visible GPU visible data so that when you run a set of instructions it can actually reference these these different things and you see that because we're interacting with this queue the CPU in itself is able to send requests so you send send instructions to the GPU with its own specific memory address space and even though there are shared address spaces which means that you know the CPU may have access to see some of the data in in the GPU memory there is different address spaces for different computational areas in the in the actual hardware of the GPU which means that the actual code your C plus plus your python will not be able to see that memory space right and that means that you would be able to interact with your GPU with this SDK akin to what would be you interacting with a remote service sending requests for the service to execute as if it was an API and you know even though this is not correct because this is your own machine this is actually a good intuitive way to see how you're interacting with that GPU given that it is through this queue with you know asynchronous command buffers that are executed and of course you can then have all the things that allow you to await until the execution has been finished which we may actually talk about in later on and you know being able to build the foundation code required to actually run a simple program in Vulkan you know only takes you know from 500 to 2000 lines of C plus plus code right and this is the motivations well some of the motivations for compute itself the compute framework so compute enables developers to get started interacting with the Vulkan SDK with dozens instead of thousands of lines of code and the key thing to emphasize here is that the core principle is to augment the Vulkan interface instead of abstracting or hiding it it has to bring your own Vulkan interface which plays nicely with already existing Vulkan applications so if you already have a Vulkan application to render graphics related stuff you'd be able to actually pass those Vulkan components and it has non Vulkan naming convention and we're going to talk about that it's just basically to avoid ambiguity as there are libraries that you know may have classes called buffer and it's like is this buffer from Vulkan or for this other application right so things like that um so now in regards to other features is that it has you know a C plus plus interface but also the Python interface that we're going to be using today it has explicit CPU and GPU memory ownership and this is important that you're using non compute Vulkan components right if you're already using Vulkan in another area of code Royal Vulkan it gives you granular access to GPU queues which is very important for optimizations you know which we have some material that actually explores that in more detail there is you know single header file for the C plus plus development and a PyPy module for easy installation with the Python and it has integration with mobile apps through the Android NDK as well as game engines such as Godot which is also part of the Foslin conference and there is going to be a few links about that so how does the compute architecture look like and this is basically relatively or conceptually simple so everything starts with a compute manager and the compute manager is the component that basically oversees and and manages all the explicitly relevant memory that is then created through your interaction with this like compute application so the manager in itself would handle this sort of like device and on the queue which we talked about you know you don't need to go now into that much depth but it's still important you would then have sequences and sequences are basically single or batches of operations to run on the GPU right that's basically what a sequence is and you can have multiple sequences with multiple operations each sequence can have one or many operations and each operation basically performs a specific action an operation can have one or multiple tensors and tensors are basically abstractions into GPU and CPU memory as well as the workflows related to move the memory around it and optionally it can also have what in the compute world is referred to as an algorithm and an algorithm abstracts the concept of the pipeline the Vulkan pipeline the the script or sets and the specific shader code that then you can actually say well I want to run this this code in the GPU with this specific data structures and I want to run it in this type of way right so we will see what that actually looks like in more practical terms but this is basically it right there's there's not much more around this and there is an almost one-to-one mapping between the compute components and the Vulkan components and that is explicitly to reduce ambiguity now let's see what that looks like so in Python you would basically create a simple manager right you would then create a set of tensors with you know you can pass non-py arrays or Python lists in this case it's basically just two tensors that are going to be used in a multiplication and then the output where we're going to save all of the results right so that's what we're going to basically use you normally would actually initialize the tensors explicitly but there's a set of helper functions that you know take your CPU host memory list and then copy it into GPU only memory so all of this is is is handled to you but again it's not handled through magic every single thing can be actually accessed and you can actually call those things yourself if you wish to which is very important for for several optimizations you then define the shader code this is basically the code that you're going to be running in the GPU in this case it's just basically a simple multiplication using the pi shader decorator in this case we have the first buffer the second buffer we're going to run a multiplication and store it in the output that's basically it this is going to run in the GPU and then we're going to actually run it through the manager so we're going to say we want to run this synchronously using these three tensors and using this shader right and we can just basically pass that it runs synchronously we can also run it asynchronously and we can actually do a lot of optimizations which we're not going to delve into this but you can run it synchronously once it's finished running then you can actually copy the data back right make sure that the output is now visible in the CPU and then you can print it right so you can actually see that our output is two four six right so that's basically it that's all there is to it and this is all you need to basically do all of the crazy crazy stuff that happens underneath but again all of that crazy crazy stuff is really really interesting and it is also very very useful very rich and very relevant for once you're starting to do optimizations so ultimately not to hide but to augment this is the key thing so now let's actually delve into some of those optimizations that I've been mentioning what we did right now is run a single command slash operation in through the manager right so we had the CPU running we submitted that operation we waited then we came back we then submitted the second one and then we came back right what we can do as well is we can actually reuse multiple sequences which means that we can pre-record commands so we can record a bunch of operations and then run them so the CPU would basically then execute there would be specific you know operations that would run already you know in the GPU and then you would actually wait until it's come back you can actually run a synchronous dispatches or submissions of the of the of the commands which means that you can actually not wait for the GPU to finish so you can actually submit this you know the CPU would continue doing other things while the GPU is doing other things and then you can submit something else asynchronously and then do something else there's also an await function that allows you to wait for the thing to to finish and then finally something that we're not covering in this presentation but it is also very interesting is that you can also leverage GPU hardware concurrent queues to submit multiple batch operations that would then execute in different GPU queues that then would potentially run in parallel and this is dependent on the hardware properties of your GPU and the families you know normally for example in my NVIDIA 1650 I have the ability to run hardware concurrent batches of GPU loads if I submit them to one compute family queue and one graphics family queue we're not going to delve into that but if you're interested you know there's there's a lot of like really relevant content in the documentation as well as in our other talk but today we're going to be actually covering something that delves into the world of machine learning and what better thing to cover than the hello world of machine learning logistic regression basically taking in a specific data point and classifying it as either you know correct or false in this case it's just a binary classification and we're going to be letting the machine do the learning in the GPU well what are we going to be doing in terms of intuition we're going to have input data that is going to look like two numbers that are going to go through our model our machine learning model and are going to perform a prediction which ultimately should be what we are expecting what our data looks like at least the training data that we're going to be using the training data that we're going to be using is going to be basically you know when we see zero and zero we expect zero who is a zero and one we expect one when we see one one we expect one and we have like you know a bunch of different training data that looks like this extremely simple I know but just to make sure that you know the intuition comes through as opposed to the machine learning I mean this is not a machine learning talk and then you know what we are doing is we want to actually train a machine learning model basically learn the parameters that would allow us to ensure that every time that we have these inputs we provide the respective outputs right and you know we there is sort of like a more in-depth blog post that covers the underlying you know functions and the way that everything is broken down but we're going to scheme through some of that if you're curious you can actually delve into that we're still going to talk about what's actually going to be happening we're going to be trying to find the parameters of this function this is going to be the input basically that sort of like specific x1 and x2 that you saw in this case because we want to actually leverage the hardware parallel capabilities we're going to be able to submit multiple as micro batches so instead of just running one by one we're going to be running like five at the same time so the GPU actually runs five and then comes back to us so that's that's how we're going to be able to do this we're going to be learning these two parameters w and b and we're going to be basically you know this is the function that calculates that prediction right we're not going to be delving into too many too much of the depths but there's a blog post that covers this in a bit more detail there's thousands of talks that talk about logistic regression in python so more than welcome to check that key thing here is that this will be the shader code that we're going to be writing right and even though we're not going to cover in much detail we're still going to be looking at what that what actually required to actually write that and in the compute side so this is the shader but the compute side is what is going to be running this we're going to have to create a bunch of tensors that represent our input data our parameters our predictions etc etc the training data we're going to have to initialize those tensors using you know on the sequence we're going to initialize the algorithm right and then we're going to actually so we're going to record it so this is just to actually like initialize them and then we're going to iterate and learn let the machine do the learning right so we're actually going to be running a multiple difference sort of like iterations to that data set updating the parameters every single time running micro batches that are running in the gpu in parallel yes we're going to be basically doing that and then once we actually iterate those 100 times we're going to be having learned those parameters right so that's basically what we're going to be doing in this high-level logic and you know you know this is just an intuition right the key thing here is to see what's happening in the in the compute and in the in the shader side so the shader what it looks like is just a much more complex version of what we saw earlier right so we have all of the inputs so xi which is x1 x2 each of them they're going to be in array because it's micro batches we have the expected outputs we have the weights that are coming in and we have the weights that are outputs and calculated remember that the parameters that we're learning are w and b so those are things that we actually want to take out and you know we're also taking the loss to be able to reuse where we're relevant and the number of parameters to use right so in this case what we're going to be doing is we're going to be you know taking the specific input weights that are continuously being updated these are the ones the parameters that are we're going to update in each execution so we have to pass them every single time we're going to be calculating the function right so as we saw you know this is basically the function that that we just had and you know in the blog post I break it down in very minute detail of how we actually go through each of of these steps and and calculate the derivatives the partial derivatives of each of them as well as the as the perspective loss but here the key thing is that we are able to ultimately calculate the the the the the parameters that that we now have right so now in the compute side we're going to first create all of the specific tensors so as we saw we had some training data this is the zero zero equals zero one zero should be that you know one one should be one etc etc and then you know this is how we start with our weights so we start with just like a random initialization which then you know we're going to be iterating towards we're going to then similarly start with you know random initialization for our other parameter which is going to be zero and then the number of you know data points is going to be basically the actual size of this parameter right so we're going to start with with five right so that's what we're going to be doing we're going to be storing that in a variable called parameters so that we can reference them we're then going to just initialize it by creating our manager and initialize the tensors so what this does is it actually initializes them explicitly whereas before we didn't do it we did it we did it implicitly with a with a utility function so here we're saying initialize all of the parameters in the GPU so they're accessible in GPU memory then we're actually creating and recording the the the operations in the sequence so what are the operations we're going to first actually sync the data to the device for these two tensors because remember the parameters are going to be updated every iteration so we're going to put them in the GPU device memory we're going to then record that algorithm that we just actually wrote this is the the logistic regression shader uh that you saw so we're going to record that execution and then we're going to record a sync to local right so we're going to record for all of the weights the parameters and the and the and the loss to actually be copied back to the host so that python can see them right and then finally we're going to iterate 100 times by you know every iteration running that shader or running the sequence which is basically all of the things that we just recorded and then operating and then updating all of the all of the weights so I think there's an indentation missing here but basically what is happening here is we're just basically updating it by you know the specific learning rates which is just how fast do we want uh the actual parameters to be updated on each iteration again the key thing here is just to see all of the features that you're able to use with compute and leveraging a simple machine learning use case as an example that of course we're skimming through and conscious sorry for that for the people that you know maybe are seeing this i'm going like well you know actually that's you know 100 correct but you know in the blog post I break it down in you know much more minute detail this is just to show how you're able to actually interact with the GPU and you know optimize in different areas by you know pre-recording components using the sequence etc etc so in this case we're just iterating once you finish you're able to then just print the calculated parameters which in this case they are the first weight the second weight the third weight and the actual bias which is ultimately what we ended up with and you know just to emphasize you know we covered kind of like this high level example but as I mentioned we have a blog post that covers this one in python one in c++ it breaks it down in in in my new detail and we have other tutorials other examples that cover you know how to use this and actually the c++ as opposed to the python one for integrating with your android apps as well as two game engines like the godot engine which we would recommend to check out and you know more than anything what I recommend is to get involved you know if you go to github.com slash ethical ml slash Vulcan compute you'd be able to actually check out what are some of the open issues you know you can take one of the good first issues labeled good first issues and also please you know there's open there's an issue number 52 which is open for just general discussions so if you have ideas for improvements or questions you can actually just post them there and we've had some really interesting suggestions some of the key things in the roadmap you know one of the main motivations to build this framework is to actually integrate it as a backend of an existing site scientific computing framework that is potentially even being used for you know mobile you know machine learning or for other types of use cases so definitely really interesting on that and if someone is you know running a scientific computing library then you know being open to explore also creating more default operations something like a fast forward transform or like a parallel sum reduction things like that would be really cool to have like out of the box operations that are you know perhaps even written in c++ but also exposed as python and then also adding examples if you try this in a new sort of shader or a new sort of like a algorithm a new machine learning type of model you know would love to actually contribute it upstream and you know add it to the repo because i think that that would be very cool so with that said i think that's everything that we had sort of to cover today thank you very much for joining this talk on beyond kura gpu accelerated python on cross vendor graphics card with valken and compute really looking forward to explore and hear your thoughts ideas and suggestions and if you have any questions do please feel feel free to reach out thank you very much