 Hi everyone, my name is William. This talk is Kuda in your Python. So I'm going to talk to you a little bit about how we can start programming on the GPU from the safety of our Python program in the language we're familiar with. So I want to start off with some bad news. Moore's Law is dead. This is kind of a tough one. So Moore's Law, what was it? Basically an idea that the number of transistors that you can fit on an integrated circuit would double every two years. Initially, it was every one year actually. He had to revise that. And this is Gordon Moore, namesake of the law. And so he based it on this data. So you can see this graph on the left side. It's a log scale. So the linear nature of it says that it's doubling. And you can see this held from like the 50s, 60s, 70s. This actually went up all the way till the 2016. This is with some like actual data. So you can see that relationship pretty much holds true. I mean, it's doubling and doubling up until 2016. But honestly, we're starting to hit a plateau. A lot of people like to write about it. There's a lot of headlines popping up. What are we doing now? Moore's Law is dead, all these things. And you might ask, is this really true? And I'll say, I trust this guy. This guy says, I guess I see Moore's Law dying here in the next decade or so. But that's not surprising. And who is this guy? This is Gordon Moore. He said that in 2015. So if he said it, I think we can start to buy into this idea that at least over the next, I mean, he said it in 2015. So over the next five years, we're going to start to see a plateau in, you know, getting more and more transistors onto these chips. And so this leads us to the question, why am I up here talking to about GPUs? And how can they help us kind of combat this problem that we're facing? So a little bit about the history of the GPU. It's a graphics processing unit. It was originally developed for gaming uses. The typical work load involves a lot of arithmetic on like a lot of pixels, you might imagine, or a lot of objects that are in a frame to do rendering and shading. And it's specialized for matrix operations, because if you can imagine a representation of a scene in a game, could be a matrix, a 2D matrix of pixels or even a 3D matrix, and then you're performing some transformations on them, operations like that. So that's kind of the background of why did people start manufacturing these devices in the first place? And so to start to understand even more the differences and how we might get some benefit from computation on the GPU, we can look at the specs. And these are two kind of top tier consumer grade GPU versus CPU. So on one side, we have NVIDIA's 2080 Ti, which came out pretty recently. And then on the other side, this is Intel's i9-9900K. So this is what they are. We can look at the specs. And so on the one hand, you can see the GPU has a ton more cores, even more processors. So that's over 4,000 CUDA cores across 68 streaming multi processors. Versus the Intel CPU has 8, it can get up to 16 hyper threads. But that still is very small in comparison to the GPU. But on the other hand, you look at the base clock and the boost clock. And you can see that the CPU, in terms of the clock speed, is achieving up to three times better performance than the GPU. So if we look at this diagram of the architecture, we can start to understand kind of why these things are the way they are and how the GPU can start to help us with certain workloads. So basically, you have the CPU architecture. You can see a large amount of it is related to control and caching. So these are two important pieces of how a CPU operates. You want it to interpret your if statements, your while loops, and also to be able to cache memory close to it so it doesn't have to go into RAM. And you can see relatively fewer of the transistors are allocated to arithmetic units, which is the green. In the GPU diagram, you can see relatively little of the transistors are dedicated to control and caching. And so more and more can be dedicated to arithmetic operations, which is why for simple arithmetic operations that don't involve a lot of control flow, the GPU can end up running a lot faster. And so kind of all those factors have led to the rise of what you call GPGPU, which is the idea that we can do general purpose computing on a GPU and not just use it for specialized purposes like video games graphics. And so this quote is from a paper published by NVIDIA in 2014. And basically was saying that that in the past we thought these devices are only good for gaming, for graphics, but now we can start to think of them as parallel processors, which is what they really are. And a bunch of the GPU companies started making their own kind of models of how to program these things. So CUDA, which is what I'm going to talk about today was developed by NVIDIA, but also AMD had app and OpenCL is kind of a growing open standard. And so these are different ways people are coming up with how we can turn this device that we have that has certain desirable properties into something that we can run general purpose workloads on. And so this kind of gets into, well, why am I interested in this? How did I come to be giving this talk up here? And so I'll start by talking a little bit about my work. So I work as an engineer on the data team at Compass. Compass is a real state technology platform basically bringing together the real estate agent with engineers who are building technology to empower them. And what I specifically work on is bringing in listing data from a bunch of different geographies. And then we transform it. We perform a lot of conversions and normalization. And so we use a lot of common tools you might see in data pipelines and data processing. So this includes Spark, Kafka, Airflow. But I also like to keep my ears to the ground in terms of what's going on in the industry around this tooling. And, you know, you start to read some interesting things. This is Rapids AI, which is partnered with NVIDIA. And their concept is that the GPUs might start to be used for this entire data processing workflow and not just for, say, model training, which is a place where they've started to be used. So you might actually be able to do your pre-processing, even visualization using the power of GPUs. So this is something that's in development. We might have GPU databases. This is just from this year. Uber built kind of a GPU powered database for doing analytics. So even in terms of what I work on professionally, there's a couple kind of advancements that might be, you know, putting GPUs into use for these kinds of data processing workloads. But it also relates to my hobbies, which include deep learning. So I kind of got hooked in it through the FAS AI deep learning course. I started doing some competitions on Kaggle. And this is kind of what got me hooked into thinking about these hardware questions. So this is my computer at home that I put together. Basically, after leaving a GPU running on AWS for one weekend, you get a pretty big bill and you're like, okay, maybe I should start to think about building one myself. Maybe some people have been there before. And so this is kind of what got me thinking about GPUs and CUDA on kind of a more general level, not just related to deep learning. And specifically, you know, I made a small pull request to PyTorch last year. And that's kind of what inspired this talk was this was moving one function basically from their Python into the C++ implementation. But just seeing kind of how that library was all put together and how they were able to kind of merge, you know, Python API with C++ and also CUDA programming, it was really fascinating to me. And so I wanted to kind of dive in and learn more. So that was part of the origin of this talk. So finally, I get to the question that everybody probably wanted to know, which is how can you start programming on the GPU? And so for an example, I've started out with NumPy. So this is this would run on the CPU. And those of you who are familiar kind of with NumPy might see we're taking a random basic vector of 10 million numbers, we're taking two of them and just adding them together, right? And so this is something that, you know, you could run and NumPy is pretty good at this, like what I benchmarked, it wasn't too large. So let's see what would the code look like the equivalent of this on the GPU. And, well, that's not much change, right? I'll go back to the other slide. And there's the other slide. And you're like, whoa. So this is from a library called QPy basically. And it tries to mirror the NumPy API. And it lets you start to take advantage of GPU processing in a pretty straightforward way. Like I said, just one more time, here's NumPy QPy. But when you benchmark it, you can see that actually just that switch gives you a 30-time speed up. And that's accounting for things like, you know, the GPU executes asynchronously, but you want to make sure you synchronize it. So you're seeing about a 30-time speed up, again, just from this change. And so that's part of why I think this is so cool, because I think, you know, how many people out there use NumPy for some things, right? You know, it's a pretty common tool in Python. And so, you know, if there's this drop-in replacement that would let you get these kinds of speed ups, like that might really help your workflow. And so that brings me into talking about kind of the outline of this talk. And I'm going to talk about different approaches to doing CUDA programming from your Python application. One, which I just showed you is as a drop-in replacement. Two is basically taking CUDA strings and compiling them in a Python program. And then the third one, which is kind of the most complex, is actually building it as a C++ extension to Python, which Python allows you to do. And, you know, as with a lot of things, I think, in programming, software development, these are kind of increasing levels of complexity. But what comes along with that complexity and kind of initial setup is it gives you additional flexibility. It might unlock features of the CUDA platform that aren't necessarily available in, say, a drop-in replacement, but that you might be able to access by kind of rolling your own. So to talk about drop-in replacement, the library I showed was, like I said, QPI, which is built as a drop-in replacement. It was originally developed for a deep learning framework called Chainr. It supports a lot of NumPy features, some of which are pretty, you know, complex, like the whole indexing system. I'm pretty sure they have it working just like NumPy. And NumPy does a lot of crazy things with that. Supports a bunch of different data types, broadcasting. There are a couple gotchas. So those of you who raise your hands who might want to think about dropping this into your program, these are things to watch out for. So it can't use data types that are strings or objects. This makes sense if you think about the kind of diagram I showed you that there's a ton of arithmetic units on the GPU. So they're going to be working on numbers, but maybe not so good for these other kinds of data. The other thing is array creation. So in NumPy, you can do NumPy.array over a list. It'll turn it into a 4. You can't do that. And then the last one, which really might trip you up, is if you sum a NumPy array, it's going to return a scalar. It's just one value. But in QPI, it returns a zero-dimensional array, which is going to be a little bit different. That is to say that these libraries are good, and I would definitely encourage you to try them out, but definitely keep in mind that there are certain things that might trip you up. It's not quite going to be a total drop-in replacement. So the second way to do it, like I said, is that you can compile CUDA strings in your Python application. Before we talk about that, then we have to talk about the CUDA API, because compared to the drop-in, you're going to actually have to write some CUDA code. And so this is a diagram of the basic kind of building blocks of how CUDA programming is done. So on the top, you've got a grid. You've got blocks inside that, and then you've got threads inside your blocks. So to break it down a little bit, threads are the things that actually execute CUDA kernels, and they have a thread index, and what that's used for is basically to specify which part of the data that thread is meant to work on. And so this is from the body of a particular CUDA function, but you can see here, if you're executing over a two-dimensional matrix, you can basically have a thread index in the x and y axes, and that lets you specify, like, this thread is meant to take this element of the matrix and add it together. And you can imagine that this kind of simplifies the logic. If you've got a thousand threads, and I guess if you've ever tried to write some parallel code, there's kind of a lot of housekeeping that goes around keeping track of these thread indexes and which thread touches which part of the matrix. And a lot of that is built into this CUDA paradigm for you. So then blocks. Blocks is the next level up. Block is a group of threads. And the important thing is the blocks are required to be able to execute independently, but threads within a block can share data. There's kind of block level shared memory. And so if you do need to do some synchronization between your threads, that is possible on the block level. And just like the threads, the block also has dimensions and indexing. So you can do block index as well as the size of the block. And this is if you need to do kind of more complex computations of like what specific bit of this matrix is this thread supposed to operate on. Grids is something too fancy as a group of blocks. So going back to the diagram then, we kind of have these different levels and layers of being able to achieve parallelism. So from the lowest to the top, we've got threads which are executing the CUDA code, organized into these blocks, which again can also be arranged in this two-dimensional or three-dimensional space. So you see like the zero zero there is the index. So that block is indexed at zero zero and then one zero. So the grid itself is also a two-dimensional group of blocks. So those kind of threads, blocks and grids. So the other kind of piece of it when you get around to actually executing CUDA code is kernels. And that's basically C or C++ code that they've added a bit of extra syntax to. So specifically there's this global kind of identifier that you're used to specify your kernel function. And then there's this angle bracket syntax that lets you specify the grid size and the block size. So if we go back to here, obviously this is a parameter that you have control over. So you can say how many blocks go in my grid and then how many threads go in my block. And that's kind of one way to tune the performance. And so by using the syntax, you can kind of play around with that and see, well, what's going to be the most effective. So this is one example of what a kernel might look like. So this global identifier at the top here is saying this is the kernel. So this is what's going to execute on the GPU, whereas this main function is what executes on the CPU. And that's kind of the relationship in a CUDA program is you have your host, which is the CPU of your device, which is the term for the GPU. And basically your program starts executing on a CPU. And then as soon as it gets to one of these kernels that's going to be executed on device, it basically calls that and communicates and does data transfer across from the CPU to the GPU to be able to run this. And you can see here that at the top, the kernel is making use of these kind of terms I was talking about. So block index, block dimension, thread index to figure out, in this case, it's adding to two dimensional matrices, like figure out which piece of it the thread is actually meant to do work on. So that's kind of the high level overview of the CUDA API. So when we get to PyCuda, basically this was built by a researcher. It's used for a lot of scientific and research projects. There's even a research paper about this library, which is not necessarily something you see for all the code you find on GitHub. I was a little bit surprised. So what does it do? Basically, this is example code that kind of gets to the main thing that it gives to you as a programmer, which is you can take the CUDA kernel code. So if we go back here, this is what a kernel looks like. And this global function is what's going to execute on the device. That's what's going to execute on the GPU. You can basically pull that out and within your Python program supply that as a string to this source module object. And then there's an this get function call. And this basically compiles it as a GPU kernel and then pulls it into your Python program to allow you to be able to call it over your objects. So this is one way that lets you start to write CUDA code, but you don't have to step out of a Python environment because all the objects and whatnot are still going to be just like in a Python program. And so this is one of the nice things is in PyCuda, you get automatic memory management. So basically, once something goes out of scope or once you delete something, it's actually going to free the allocated memory on the GPU. And when you get to describing like C extensions and kind of the most manual way to do it, you'll see that that can actually save you a lot of hassle. These are other kind of classes it provides. So it's an in out and in out classes to describe your arrays or matrices. And these handle memory transfer between the CPU and GPU. Basically, there's a lot of steps you'd have to do in order to perform. Let's say you have a NumPy array on your CPU, you want to like double all the elements. These are all the steps you'd have to do to be able to accomplish that. And there's a lot of them in terms of creating memory moving data across. And so PyCuda also has certain higher level abstractions where basically in means this array is meant to go to the GPU and it handles that for you. And in out is even a little bit more complicated. We're just saying this NumPy array, say, is meant to go into the GPU, be processed in your CUDA kernel, and then come back out. And as long as you're okay with that happening automatically, PyCuda can handle that behind the scenes. And the last thing, which I think is really useful is automatic error checking. So because CUDA is some of the operations execute asynchronously, and so collecting and surfacing the errors can be really challenging. And including this, this is from the documentation. If an asynchronous error occurs, it will be reported by some subsequent unrelated runtime function call. And that sounds pretty confusing to me. It's like, okay, well if an error happens in one thing, basically the next function you call is going to error. And so PyCuda also handles that for you and raises them as kind of specific Python exceptions in the paradigm you might be more familiar with. And finally, another interesting bit is you can do metaprogramming. So basically, when I was talking about threads, blocks, grids, basically you have to tune those parameters, like how many threads in a block, how many blocks in a grid. Just like you might do with parallel programming on a CPU. And it's often done with heuristics, but PyCuda basically says, forget heuristics, we can do this by running things, and we can empirically determine how to actually set these parameters. So this is one example, which is actually using JINJA templating. So like I said, because the module is a string in Python, it kind of like follows that any way you can create a Python string, you can pass and compile it. So this is kind of a clever, kind of crazy way of doing it, but you can basically pass in different parameters in terms of types. So you can see it's parameterizing over the float type. It's, you can pass in a different block size, a different thread block size. And it basically, so first templates your string, and then kind of Justin Tom compiles it into a PyCuda kernel. So I mean, that's really cool, because, you know, especially as you're trying to play around with this, and like, you know, this week I was trying to compile different things and see how they ran, but this step would be would be a lot faster, just be able to basically loop through different configurations in Python, and be able to see the results empirically. Okay, so then the last bit, which like I said, is the most manual, but it might give you the most flexibility is Cuda as a C extension. And so first, let's talk about Python C extensions. So Python, because the, you know, the interpreter is C based, you can extend it with C and C++ programs, you can add new modules of your own design. And that's used in even a lot of programs that want to achieve a better performance on the CPU. So for example, NumPy links into certain CPU system, or yeah, C system calls rather, to be able to achieve the kind of performance it does. So this is already a paradigm being used in kind of high performance computing settings. And so the question is, okay, so I know I can get my C or C++ program into Python as an extension. So then how do I get my Cuda program to a C or C++? And basically Cuda, well, there's different forms of it, but what I've been talking about is referred to as Cuda C. And so this is basically C, but with a special syntax. And so Nvidia provides you with this NVCC compiler, which basically takes your Cuda C source code and does a couple things. One, it turns the kernel into like assembly or binary for operations on the GPU. It takes one of these special syntaxes and replaces it with runtime calls. And then finally, you can also have it just compile the host like CPU code. And so NVCC can do all of it like itself. So there's just one compilation step that basically generates output in Cuda as well as the code you would need to run on the CPU to launch the kernel. And so on the Python side, it can get very complex. This is a very good GitHub repo. It sheets a little bit and uses Scython, but there's a couple options you can use when you're kind of going between C++ and Python. So Scython is one of them. Swig is another one that's also, there's an example in this repo. So basically once you find some way of creating your extension, then you need to use setup tools to compile it and link it together. And this code is also from that repository. Basically in setup tools, in your basically setup.py, you create an extension. You can specify the sources as well as how to compile it. And so you can see here that we're using the Cuda libraries. And you can see NVCC as well as GCC to basically compile all this together and package it up. And then that will basically link into our Python program. But I would encourage you to, this part gets very complex. If you're interested in this, I would encourage you to check out that code because those people put together some very complex tricks to actually be able to get all this to come together. So why would you want to do that? I just said like it's really complex. One is manual memory management, which you might see as a downside. You're like, we're using Python. Why do I want to manage my own memory? And this is kind of the constructor of that GPU adder class. You see mallocs. You're like, whoa. But, and here's a destructor. So you have to call free and haven't forbid you don't. But basically, why would you want this? But I think there is a benefit because there's certain manual memory management features in Cuda. So some advanced things, mapped memory I think is the most interesting, which is basically like you can map memory between your host, which is CPU again, and device and basically be able to access it without having to do an explicit data transfer. And so that can be pretty cool. But basically there's a couple of features that might only be accessible once you get down to the level of doing this kind of stuff. The other thing is, and I know this might be heresy because it's a Python conference, but you do get a compiler. And I would say the nice thing about that is like you're writing in a language that's unfamiliar to you, right? You know, at least for me, like I don't do a ton of Cuda programming, even when I use it, I'm mostly using Python. And so being able to have the compiler tell you this went wrong on these lines, I would say when I was working on tweaking this C extension stuff, I actually felt kind of grateful for the compiler. So that can be kind of nice. So to conclude, one, I want to talk a little bit about accessing a GPU. So you might be interested in playing around with some of this stuff yourself. One, Google Colab is like really awesome. It's a browser based notebook interface, and it's free and it gives you access to a GPU. And so some of the more custom stuff, I don't think you'll be able to like link extensions on it, but you can definitely download QPy and play around with that. And a PyCuda you should also be able to install. And so that's free. And then even like Cloud GPU instances are starting to become accessible. So on AWS, you can get it for less than a dollar an hour if you just want to spend two or three or four hours playing around with that or Google Cloud. They're pricing a little bit different. This is per GPU. I think you also need an instance to attach it to, but it still should be under a dollar an hour. If this stuff kind of interests you and you want to play around with it. So then the last question I want to talk about is where do you go next? So you kind of have gone through this talk. I've submitted you to this talk. And what can you do with it? So one, I would say you could start to think about how you can apply Quda and GPU programming to your workflow. Like I said, that's like a really active area of development. A lot of things coming out. And so especially if you have your work has anything to do with data pipelines, processing or obviously machine learning and deep learning, but that's already kind of already been done. So that's something to start thinking about. The other thing is, you know, now you have access to like 4,000 cores. What do you do with it? So you can start learning a little bit more about parallel programming algorithms, how to make use of that, like most effectively, because you can do cooler things with it than just like adding 10 million numbers to each other. And then the other thing is if this stuff kind of excites you, there's also a bunch of other kinds of devices to start thinking about. So people talk about the XPU concept. So Google has TPUs, which are specifically for deep learning. People are starting to come up with all sorts of devices. And then you can kind of go a different direction to and start doing FPGA, which basically is building hardware to execute certain algorithms. So, you know, there's a whole device side of it too, and kind of getting beyond the CPU paradigm and figuring out, well, can I come up with a specialized device that's better than this kind of one single general smart thing? So that's what I would suggest if you found this to be interesting. Thank you so much. There's my Twitter. I'll probably post, I'll definitely post the slides and related code on there. So if you have any more questions, thank you so much.