 Let's get started. I guess you guys can hear me. So my talk is about writing extensions and bindings for GPU in Python. And essentially, what do I do? I am basically a data scientist or research scientist at Rakudin. We work for a nonprofit in Rakudin organization called RIT, which is Rakudin Institute of Technology. And so let me go over the goals of this talk or presentation, which are basically this. So our goals are basically trying to extend, but firstly, we show how to extend Python using native CRC++ code. And then we will show how you can extend GPU CUDA code and use it within your Python code and basically writing Rapples, Rapples code for that. So why did I try to solve this problem in my job and what is the motivation for it? So at Rakudin, I'm basically working on fashion recommendation system and in fashion recommendation system, I had the task of actually finding out distances between two feature vectors or two fashion items. And if you actually do a native pair-wise distance comparison, it takes invariably like invariably too much time to do it. And so essentially there are certain things called FAIS library from Facebook team, which actually does pair-wise distance comparison on GPUs and it's pretty fast. But there are certain issues with that library, one that the word vector or the word vector size that it requires is 128 or larger than that. And we had a resource constraint on the memory requirement that we could save our word vectors. So that's why we had to devise algorithm by ourselves that was less. So basically we wanted to implement pair-wise distance computation in CUDA with significantly less memory and we also wanted to be really fast. So initially I tried doing it using Python, but it was too slow as I say. And then I tried using other libraries like PyTor, SciPy, but invariably I found out that if I wanted to get more speed, I had to write CUDA code and there was no getting away with the fact. So one of the things that I wanted to show here is if you actually look at the K-means algorithm, So if you look at the K-means algorithm here, what I did here is for comparison basis, I have two data sets. One is of 1 million data points and one is only for 100 data points and this is the speed. So this is the speed of various technologies like Python C++, C++ using Eigen, Eigen is a library that is used for basically matrix manipulations and C++. You can think of it as a STD for matrix, doing matrix stuff. And it's a pretty, it is a template based library, it's pretty fast. And then we have CUDA, we have Scikit-learn and we have SciPy. These are pretty standard libraries and Python Scikit-learn and SciPy and they're used always for like their ubiquitous for being used with scientific ability. So similarly, if you look at n equal to 100 points, you can see that Python is the speed in seconds is around 0.014 seconds. And if you compare it with C or C++, it's pretty low. And this is just, but if you look at CUDA, you would see that the speed is pretty high. So one of the things that CUDA is not really good at is if you're trying to run CUDA code for a very small data set. The main reason for that is the bottleneck when you're trying to develop CUDA applications is the data transfer time between your CPU memory to your GPU memory. So that's why it takes a lot of time to do it. But you can easily see one of the simple things we can see here from this diagram is that this is that C++ is much better than Python in terms of even optimized libraries like SciPy and also Scikit-learn. Which internally actually use something similar to C, they actually end up calling C++ called internal. So, but if you look at n equal to a million data points when you're trying to do k-means algorithm, where you set k to be anything like in these experiments, you set them to get on Python. And so here you can see that the good thing about this diagram is you can see that you can't even see the CUDA stack bar because CUDA actually takes milliseconds to compute this. But if you look at Python, it is taking around more than seven seconds to do this. And think of this because if you're trying to do pairwise things, you have to compute a million cross million matrix and you have to recompute it again and again because feature vectors changes according to how because we have to recompute the feature vectors again and again because the uploaders actually change the images all the time by adding certain things. So that it would be probably the expensive to do pairwise computation again and again. So that's why we actually we actually you can see the need of using C++ or CUDA or in any bottleneck. So here are like basically a list of the pros and cons of writing extensions. This is definitely speed. Another is, you can use optimized C and C++ library functions, native library functions like for C and C++, most of the blast extensions are written in C or C++ you can actually use them. So that's one of the good things about writing extensions if you want to write them. You will have no GIL in Python, most of the speed issues come because of the GIL block, you can actually parallelize your code pretty easily if you do not have GIL and you can have more memory. So why not to do extensions is pretty similar why you don't write C and C++. It's it's development is so it's hard to write and there are memory safety issues. Unless you are a tennis program. So here I just go over one of the simple how do you write a simple CPU extension. So, this is basically firstly you write a native code this is can be thought of as a kernel code that will be written in C++ or CPP. And essentially what I've done here is written a addition of two vectors A and B and put the results in C. So this is just a for loop. And you can actually do this and see Python as well. But it would be a bit too slow because loops in Python are real slow. And if we can vectorize somewhere we should vectorize. So, then how do you do this so I talked about this a little bit later, various technologies that are available for wrapping our code or C code and being able to we call it using Python code. So, here I have shown the example using site and there are other in my GitHub repository there are other examples of using site and 5 by 11 and see types, but here I've just shown you to space limitations just one example. So, how do you write that. So what do you do is you insight and what you do is you import C import numpy as NP, and this actually gives you numpy data, like you can actually import direct numpy data types and call them in the Python function and then pass them on to your sequences. And you can actually then define CP depth, adder CPU, and it actually takes in the argument of double, double. So essentially it's if you want to think of it, it's just a pointer to a memory buffer of numpy. And that is actually passed to add a CPU, which actually stores it in these data types of anything and see, and then passes it to the later code. Then how do you set it up. So you have to this for distributing or it's pretty simple you in by using the setup of 5 by 5, you can just add one line of extensions and add this sources of I don't know for CPU and it will be. Like seamlessly as you just tried Python set up. So, after this I just go over what is a GPU and how GPU, how do you write course on the. So, basically, I like to think GPU as CPU with thousands of course, if your CPU has four cores or 16 cores of GPU would have like 1000 more cores than that. And that's why you can parallelize your any computation easy. It is also called a single instruction multiple data. So the same instruction set is run upon multiple data that can be done. Yeah, but there's obviously no free lunch. As you can see, like, GPUs are not great at control of it because they're controlled, like the controller, if there is many if and else branches in your code, then GPUs are really bad at that because there is something called head divergence which happens, and GPUs will be very slow in that. But GPUs are very good at like single instruction multiple data set data because they have multiple units so they can perform automatic really fast. Another thing that they're really shorter is the DRAM. So, or the memory that a GPU has. So GPU for mostly has like 16 GB to 32 GB of RAM in the expensive case but like so you are mostly constrained by the amount of data you can put on your GPUs rather than being constrained upon how much computation you can do. So, I go over some terminology before just in how the code is written. So, basically for terminology, essence, so if you read any GPU blogs on the video.com they typically call SM, there are SMs there. So basically GPUs are made up of essence. SMs are streaming multi processors and a GPU is made up of two or three, generally two or four SMs. Then SMs are then divided into grids. SM would have typically have four to eight grids and this is configurable obviously but there is a limit to how many grids you can have. So, basically think of this a grid as, a grid is a computational unit. You can think of it as a mini processor in some sense, a mini core. So you can have up to 16 grids something, the 16 cores on one SM. And then every grid has blocks within itself. So blocks can be 2D blocks or 1D block and blocks are unit of computation again. So, yeah like here I have shown like a grid, it contains many blocks. So these are 2D blocks and they're indexed accordingly like 0 0 1 0 2 0 and 0 1 1 1. A block is then made up of something called threads. Threads are the smallest unit of multi processor like processing in GPUs. There are some conceptually you can say that but there is something called wraps. So wraps are actually memory-aligned threads. These are 32 bit memory, 32 threads which are aligned and these are mostly required if you're doing real hardcore performance optimizations. Otherwise generally wraps are not that important. So threads are basically your CPU threads. As a thread computes a single unit like it's a single program that it computes on a single data point. So if you can look here, so again threads can be 2D or 1D. So a CPU grid is made up of blocks and blocks are then made up of threads. So the smallest unit is a thread which is now indexed by 0 0. This would be in block 0 0 and in the grid 0 0. Similarly, so that's how you index the threads. So you need to get the thread index. So let's try to see this comment by a person called Mark Newman. He explains it in a comment. So GPUs are made up of threads which are in turn made up of blocks and blocks are made up of threads. So threads, you can then say that threads are somewhat equivalent to cores in CPU. So if you have six cores in CPU, but if you have 20,000 threads in your GPU, you would have essentially 20,000 cores in your GPU. Yeah, and the synchronization is done at a block level in GPUs. So yeah, cooperation is done at block levels. So what do I mean by cooperation? It essentially means that thread 0 can share your data with thread 3.2. And that's how cooperation is done in terms of GPUs. Because they have a shared memory which they can write to and read from and perform computations and begin writing back. So that's how synchronization is done. They share the same shared memory. And shared memory is a very important concept in GPUs. When you're writing performance optimized code like for matrix transpose, you would definitely be using shared memory. So let's look at some code later. But so suppose that you do not want to write this code directly in the language you would prefer to write it in Python. So there are some alternatives for that. One of the alternatives is Numba. So Numba is open source and pretty active development is there. It has both CPU and GPU support and it's pretty fast. So your code can actually run. You can vectorize it on CPUs or you can write it for GPUs, which is really great. And it has access to libraries like QFFT, QRIME, which are very important once you start doing actual. So PyCuda has full code API available and it has safe code writing because objects have limited lifetime attached to them. It is a concept called resource allocation. I don't remember the form, but it's a concept called RAI and it's very prevalent with CNC++ users. So when an object goes out of, it's no longer needed, it's actually deleted. So that's what RAI is, once actually. Then you have QPI, QPI is NumPy for GPUs. So it has tried to actually replicate every NumPy function on GPUs and it is supported by Numba. So one of the issues with using all these libraries, I tried using all the Numba and QPI in particular. One of them is it's harder to debug them. And if you actually want to write something which cannot be written in Numba or QPI, or it is actually some QPI, you would have to actually write CUDA kernels and then actually import them to CUPA as strings. And then it's actually akin to writing custom kernels in CUDA. For Numba, one of the things is that if you want to write performance optimization code, then you cannot rely on Numba. You have to actually understand and get to the architecture of CUDA and then write CUDA code. There is no other way as I found out. So let's go over one of the similar examples of adding to... So let's go over a similar example of CUDA code. So here again, I'm trying to add two vectors and slowing the result in it. What I need here is the thread index. So as I said, thread is the basic unit of computation in GPUs. So every thread will now... So suppose that this A is a vector of size 256, then I can initialize my... So I can initialize 0, so blocks would be 0, and threads would be 256. So every thread is now responsible for one unit of data points. So thread 0 is responsible for A0, thread 2 is responsible for A2 and similarly. So once you get the thread index, then you can use that index to actually get the data point, the index from A. And then you can add it. So then it is actually a computation which is done in parallel. So this is a pretty simple code. It gets actually... So this would be your main kernel function that you would actually be run on the device code. That's why it's labeled as global. And yeah, I think that's about it. But how do you actually run this? So there is something called a driver function that actually calls the above kernel function. So firstly, what you would need to do is you need to allocate all these on your device memory. Device memory is your CPU memory. So you actually assign just a CUDA Malog code to... Oh, sorry. So CUDA Malog actually assigns... Can assign to both your host memory or your device memory as well. So here firstly, we are trying to assign memory on the device using CUDA Malog for DEV-A and everything like DEV-B and DEV-C, which would actually be storing the computations on GPU. And then we actually copy the bytes from the host memory, which was double A, to our device memory, which is dev underscore A. And that's why CUDA mem copy host to device. And then we actually initialize the number of threads to be 256, because we want it to be... Because we want to have 256 threads. Your A can be much longer than 256 as well. And then like this formula comes into picture, which is n minus one divided by number of threads plus one. What this does essentially is it actually divides your data so that every thread... So it's done like your data is handled fully because it could be 512. Then you would need to have actually two blocks rather than just one. So that's what this code does. And then you actually do a kernel invocation, which is done using this code. Basically, you say that this is the number of threads I require on my GPU. So the GPU assigns them to you. And then you pass on basically the pointers of the data where you have stored them and the computations. And then you actually then copy back the answer of DEV-C to C here. And this is here. And this is our device to host. And then if we could... So the wrapper is very similar to a wrapper in the CPU case, which is basically... I think it's very similar because we are dealing with pointers. So we just pass the pointers from the NumPy memory buffer here. And then we actually pass those pointers to this invocating code. And then they're passed by reference here. The important part here that I was not able to fully write down here is, is the setup.pyypy. So I struggled a lot with the setup.pyypy for GPUs because you need to actually add NVCC as a compiler to the setup.pyypy underscore compile function, and actually then understands, okay, that this is a CU code, then this should be compiled using NVCC. And if you just do this, it won't work. You would have to have some extra hacks which are there in my Github code. So you can have a full look there. But what this is saying is the sources are my CUDA code and my wrapper that are in the CPU code as well. I need to add some certain libraries like CUDA lib64 and CUDA HARD and similarly the runtime arguments. So essentially if you have ever used the make command or ever compile C, C or C++, you actually need to pass or using minus the linker flags and the link libraries using minus L and minus I flag including. So this is similar to that. That's what we are doing here. You need to pass some other things to NVCC, certain flags like what architecture you want to compile it for, some compiler options like optimizations that you want to do. So in conclusion, what I had learned from this is if you need to do rapid prototyping or something and you want to write some custom CUDA commands, you should use NumPy and CUBA. If not, then it's better to write CUDA code on your own and do debugging upon that because when you're writing CUDA code you would have to debug a lot to get better performance because one of the important things in getting better performance is finding out what number of threads you need to use and then here as well because if it's a complex code then you need to see how you would be sharing memory, how you would be writing the parallel code for parallel implementation. For more, you can look at the NVIDIA blog that there is where they go over transposing matrix in many forms and how it is done. So, yeah, and for binding my conclusion were generally pretty simple. I preferred siphon in most of my work when I was deploying to production. Here the performance I found out to be much better but then there's a steep learning curve because siphon is a purely new language. It has different syntax for functions and different syntax for types. If you are familiar with the C, you would be very, like it's not that hard to get because you have C depth, you have CP depth and then you have double, basically it's an array notation in siphon. Pyvine is easiest to set up and it's the only option when you're working with C++ like it has native C++ support for native data for STD containers but what I found that the performance is not good by passing memory buffers and also that it's pretty tricky to solve certain issues in Pyvine. Mainly I'm talking about like if you go to their docs you can actually find out known issues that they have by dealing with data types from STD buffers and then C types are great for simple cases like they are pure Python. I did not explore this much but like I've seen many libraries using C types because they are just pointers that are passed. So my conclusion for CUDA is that it has a steep learning curve but there are great performance returns in production if you have some really critical performance intensive code that you have to write then you should basically learn CUDA and try to do it because GPUs are a very expensive resource at any company when you're trying to go to production and proper utilization of CUDA I don't think most of the data scientists know and so I think that's something we need to learn to do. I'm still learning there's a lot to learn in CUDA. For more example like my KV implementation and my pair-wise implementation you can check my GitHub which I'll provide the link in the references. So yeah, that's about it. Is there any question?