 So, we are starting with this next talk which is GPU Accelerated High-Performance Computing Primer. The talk will be given by Ashwin Ashok and G. Raja Suman. So both are here. Can we have a round of applause for them? I am particularly interested in this talk because I think it got the highest number of votes when it was on funnel at Pycon. So Ashwin Ashok is a student at Vignan and he is interested in open source and embedded systems whereas G. Raja Suman, he works at TCS and is fond of embedded CN-Python. So just enjoy the talk. Hello. So I am all about everybody, right? So before I start anything I will make one thing clear. If anybody has done QDA programming or OpenCL programming before, please raise your hands up. You can sleep. Okay, because there is nothing new I am bringing to the plate here but what I am trying to do, what we are trying to do is try to bring awareness because everybody is like, if I want to process something I will put it in that system is really powerful I will just shift it there. Here if you buy an NVIDIA GPU or if you buy an AMD GPU you can do a lot of stuff faster than service can do stuff. So you have lots of power in your mind and I want to make sure that everybody knows. I won't bore you with my APS, I won't bore you with my functions, I won't bore you with anything. All I want to say right now is if you want to do stuff faster and if your algorithms are parallel or can be made parallel you will have amazing speed up. So and the other thing I just want to say is most of the stuff which I wanted to cover was already covered in Medusa. Like most of you guys were there in Medusa talk, right? So it was a super talk. So most of the stuff was covered there and I hope some of you have observed like whenever he went like this, Python 2, can you see? It says for my system it says GCC but for the Medusa talk it says LLVM. That will make all the difference in this talk. Have you anybody observed that? Okay. That will make all the difference. So GPA accelerated HPC problem, I just made this up. I just wanted to pull some crowd eyes so I just made this up. So the end of an era. So by the time Pentium chips were coming into the market it didn't make any difference like five years ago I had a 2D guy's processor. I still have a 2D guy's laptop. That didn't change. But the only transistor count increased. But this has nothing to do with performance. I'm just plotting a graph between clock speed and transistor crown. So basically what I'm saying is even though manufacturers are putting more transistors into CPUs the performance gains are only like margin and like 1%-2%. So if you want to do anything like if you want to increase the performance of your processor you need to do some tweaks like transistor level tweaks. I won't bore you. I'm an electronics guy. I won't bore you. So when you do that when you increase clock frequencies your power dissipation goes up. So with great power comes great responsibility. With great clock speed comes great power consumption. Remember that. So let's go to first two parallel. So what is parallel? Little bit more parallel. So I don't have to explain if you know multithreading you know parallel stuff. But this parallel is a new level of thinking. It's highly rewarding and challenging. And the hardware is already there. You don't have to buy new hardware. It's already there. If you play games it's already there. So I shamelessly copied this from the Go programming language slides. So I'm just saying if one guy can if one guy can do a job in 10 days 10 guys can do a job in one day. That's all I'm going to talk today. So raise to Minicoor Mountain. Since the Pentium Fiasco since the Pentium Fiasco where clock speeds are not increasing like Pentium was like I mean Intel was like you know let me put another guy. Let me put another guy. Let three guys do the job. Let four guys do the job. And so you know AMD whatever Intel does is just copy so. Okay. It's either way around. It doesn't matter. So this is the same thing. This changed the word the CPU courseware and this we call it a new processor. Whatever. So it's not really a new concept. This concept is not really new. Back in I mean I'm an electronics guy right? So I deal with this shit every day. I'm sorry crap every day. So back in 1970 and 1980 when these chips first came out people had to do floating that's when I triply floating point specifications came out. That's when I triply floating point specifications came out. So these 8086 processors can't be floating point math without a floating point unit. How many of you guys have had to deal with the IA32 assembly? Awesome. Oh you have dude you have. So what happened was okay nothing happened. So these days use another chip called 8087 and after the floating point part there. So basically what Intel is doing is it's just copying what it did back in 1970 and 1980s. It's just bringing up Z15 core processors which have 61 cores. So more people gets a job in so high love. So everybody got the job. Doesn't matter. So world's first open science supercomputer. It has 18,688 nodes and each node has 16 core operands. Now this makes sense. Since I was telling more people less number of time that makes sense. But why the NVIDIA? You should be asking that K20 chip costs like crap load of money. Why put that? Because that is where the magic takes place. AMD chip does nothing. It only does memory transfers and stuff. So most of the major supercomputers right now the GPUs do the magic. So why use CPU? Please read this slide. Explain. So why use CPU for computing? One thing, CPUs are hitting the wall. Like most of the CPUs are... Even if you wait for 5 years and buy a new CPU it doesn't matter. It will still be like 2.3GHz, 32 cores. Each one matter. And also development of parallel algorithms is taking place. Since the last 15 years we've seen... I was a small kid back in 15 years. I'm just saying we've seen a lot of parallel algorithms come up. And extremely powerful hardware that's scrapping up. It's not really a new idea. I know NVIDIA has a bad name in the open source community. But this is one thing they're doing it right. Kira. They're doing it this thing right. Why because? Simple. When you say that 4 cores do the job in less time. What happens when you put 240 cores? So NVIDIA is just going like I'll just keep on putting cores. It's a good thing. It's a good thing. And I don't use Apple products but I have to give this one to Apple. They're the one who started the OpenCL stuff. So what the OpenCL stuff does is really interesting. So first of all, if you want to increase your performance, like for a program if you want to increase the performance, there are four ways. First is you can put multiple CPUs. Or you can use OpenMP. HaspragmaOMP before the Furlough premiere. And you can use GPUs. Or like 7 years ago when Kira was not there, people used to fool OpenCL into thinking it was doing math. So that's a really complicated stuff. So what OpenCL, what happened it was, it made a common scene entertainment program in language for all these. That's OpenCL. Next. So I know you guys are thinking, this is Python. This guy is not talking about Python. Fine. He'll come. So where is the embedded here? We're in the embedded segment, right? I'm an electronics guy. I want stuff to get smaller, not bigger. So we are not doing anything with Intel now. We replaced Intel with ARM. The CPU is only doing the memory transfer, right? It's just doing Neolux and control stuff. Why do we need Intel to do that? ARM can do that. And I put, I didn't put, any of you guys are doing what I'm saying. So any of you guys put a powerful GPU and coupled it with ARM. And that thing, it's called the Jetson toolkit. It can beat, it can beat like my laptop with Intel. So it has saved a lot of code for something. I'm just saying, those kind of chips are coming to the market right now. Tegra chips are hitting the market. So those are insanely fast. I'm supposed to do after this, right? Okay. So AMD is bringing 5G, 6G and stuff. With great clock frequency comes great power consumption. If you want to pay the bills, you pay. So CPU architecture is more about how fast can I do a single task? Latency. CPU is all about latency. How fast can I do a single work? But GPU is more like, how many tasks can I do? Everybody getting the difference? CPU, 5G, how fast can I do a single task? GPU is like, how many throughput? That's why they don't have 16 cores. They have hundreds of cores on SMD cores. Now this guy will take over. So let's see what we intend to do with our GPU. Basically, we have our application code which we may consider as consisting of some sequential code. Something like if the data is dependent on a previous input or such kind of sequential stuff. Let the CPU handle it. I'm not going to touch these kind of things and let the CPU itself handle it. But I have some kind of code. Something like computationally intensive functions. What this program intends to do is switch these contexts into a GPU and thus increase the throughput. And now, why do we need to use Python for GPU? Well, it's simple. We do know about design of Python. It always tells us keep it simple, keep it simple. So what GPUs are, everything that's scripting languages are not. They are highly parallel and thus what we intend to do is shift all the work which are highly parallel into a GPU and leave the rest of the things with the CPU itself. Thus you could see that CPUs are largely restricted to control and as I said, the CPU always depends on how many tasks I perform per second whereas GPU deals with how much amount of tasks do I perform. So if you integrate Python and QDA in order to do this kind of stuff we can do it with PyQDA as well as we have something called as OpenCL. Let's see what PyQDA is. See PyQDA was a baby of Andreas Koleskir. He mainly developed this thing because he was a great programmer. He was probably interested in this and he wanted to do and achieve his stuff much more simpler. So you could see that we have a kernel function which we need to implement on the GPU and this kernel is QRAC but sometimes it is not. We could totally perform the kernel function into Python but sometimes we need to make a compromise and let's see why we need to make that. You see this is the basic flow of the CPU stuff on Python. Like you edit it, you compile it, you link it and then you run it. That's for the CPU stuff but for the GPU stuff you need to specify the code which you need to specify some stuff that this part belongs to the GPU and this part belongs to the CPU and so here we just move on to something else over here. We have something called as a source module and this uses an LLVM and PTX. It has a PTX and then we have our stuff to be implemented on a GPU but you see the box over here? That whole box is of the PyQDA and we didn't literally do anything. PyQDA takes care of all our stuff over here. So again, it follows the design of Python, makes our task simple rather than following the blatant C code which if you follow the C code you need to follow all these things. So Python. Yeah, we could read that. That's a pun kept by my brother. So let's just go about the setup and demonstration of PyQDA. A lot of queries have been pouring in on how to setup this stuff and how to use PyQDA. So setting up the environment requires you... Suppose you are using a laptop with an NVIDIA GPU. You need to install its drivers first in the first place and then you need to import some basic stuff before starting your code. You need to import the PyQDA driver, some tools and the auto unit and then we have the heart of our GPU programming. It is called the kernel and in order to write the kernel, this kernel is essentially to say that this part of the program belongs to the GPU and it must be executed by the GPU. Thus, we use the prefix underscore underscore global underscore underscore. This tells us that this is the C function and this prefix tells the compiler to generate the GPU code and not the CPU code while compiling it. And inside this kernel function, what we have is our code and it also has a written type and the function name along with a few arguments. Let's see how this thing will look. What do we mainly intend to do with the threads is that we have some computationally intensive stuff. Suppose we have a fun loop and we need to do some iterative operation like 128 times. What my kernel does is instead of performing the 128 operations on a single processor, I made multiple threads and I performed the 128 operations on the whole 128 threads of the GPU and thereby reducing the time taken in order to implement this stuff. This is a Python conference but please bear with us. You need to know some of these stuff before starting out diving into PyQDA. We have the concept of blocks as well as threads and grids. You could see that block is essentially a group of threads and a grid as a group of blocks. This concept will be very useful for performing stuff like image processing and such kind of stuff like 2D processing, 3D processing. And so here is a basic example of how our kernel would look like. Consider the example of having a for loop and having to perform a multiplication operation using that for loop. What we are here doing is we are using a structure and we are taking that 128 loops and giving it individually to one single thread and thereby we are reducing the number of steps taken. You could see that this is how a simple kernel function would look like. Let's move on to some bigger examples in image processing as well as signal processing. I asked him to drop certain slides because you guys are already bored. You have seen too much code already. Is it not? Let's be frank. I thought why not show you some demonstrations. I want to show you big things. I just want to show you small things and show that you can do it. I don't want to show you big things. You can write this. You can do this. Everybody can see my code. View presentation. It's bigger. This thing is the simplest Palkida program ever. What this thing does is simply takes to arrays and multiplies them element-wise. This takes to arrays and multiplies them element-wise. That's all it does. I am taking exactly 2 power 16 elements. I am taking 2 arrays of 2 power 16 elements and multiplying them each. Then I am benchmarking. It's not even benchmarking. How much time is taken for the CPU and how much time is taken for the GPU? That's a goof-up. It's not a goof-up. It's not a goof-up. My laptop fell. You can see the CPU time is 7. This is floating right. It's 0.0007 something. The GPU time was 0.11. We were promising you that the GPU is fast. What actually happened in the background was, the GPU code has to be combined. This compilation takes place. One of them wanted to know how this compilation takes place. NVIDIA is not telling us, but what I know is there is a language called PTX. Just like this assembly for Intel, there is PTX for the GPU. There is the LLVM architecture to compile this C99 code into the PTX. That's what happens in the background. That is what's slowing this thing down. If I take 2 power 32 elements, this delay won't matter because that time CPU will be lodging far behind. We'll see that. Let me take this program with add.py. It's basically the same code as before. What all I'm doing is, even if you can't see, it doesn't matter, all I'm doing is I'm repeating the experiment over and over again. Each time I'm repeating the experiment, I'm increasing the number of elements by a factor of 2. Got me? I'm repeating the experiment 28 times and any time I do that experiment, I'm increasing the number of elements in the array by a factor of 2. So I ran it. Awesome. The CPU thing was slowing it down. What happened here was, and I can zoom in too much. What happened here was, can you guys see the difference? See, when I had 2 power 1 element, when I had 2 power 2 elements, when I had 2 power 3 elements and 2 power 4 elements and so on, the GPU was actually slower. Anybody got me? The GPU was slower, why? You see the delay is consistent and that delay is the time taken to compile the PTX code. But once, I hit a number like 2 power 20, 2 power 21, so you decide which is faster. So basically, if your CUDA programs are slow, it is because you're either not keeping the GPU busy enough or you're giving a very small number. It takes an array of 2 power 64 size to actually kick the GPU out of slumber. So, and this is how the graph, this is the graph between GPU time versus CPU time. So as you can see as my element size increases, it's actually performed faster. I think this is exponential scale. And let me show you another program. What should I show? Anybody into fractals? Anybody? Ladies here? Fractals are self-repeating stuff. You take a triangle and on each side of a triangle, you put another triangle. On each side of that thing, you put another triangle and you keep on doing it, you get a fractal. And fractal antennas are electronic. So, if you want to do a fractal on a pure Python implementation, it's real slow. It's as slow as finding the Fibonacci number of 3040. It's that slow. But GPU is real fast. And this fractal has a dimension of 2560 by 2560. It's just fast. I dare you to do this on pure Python or even Python. You can't get this speed. I dare you. I promised you self-repeating stuff. So, I zoomed this part, same thing. Zoomed this part, same thing. And I promised you stuff in signal processing and image processing. This is another example I have. Actually, I have a color image on my hard drive. It has RGB to gray. It's a very simple algorithm. But it's fast. I mean, you can't... You won't notice it's fast. But the file size is really huge. The file size is really huge. And again, this is... Can you observe the difference? Can you observe the difference? The other one is black, right? So, this is a simple image processing application in PyQDA, where you just blur the image. Now, this is a filter. So, filter applied and stuff. So, I actually wanted to do something... So much more. But you guys are bored and tired, but I don't want to tire you any longer. But there's a lot of stuff you can do in PyQDA. Many, many times faster. Faster than Spark, faster than Medusa, faster than any other thing can run. Only thing is you have to learn how to use PyQDA and some... You know, I think this thing is a pain in the... So, what we do... What Andrea's character is trying to do is try to create a Pythonic wrapper. But I'll tell you what's the problem with wrappers. And the problem with being lazy is Monte Carlo. Anyone heard of Monte Carlo methods? You have Monte Carlo methods. Awesome. So, what we do in Monte Carlo methods is we simply keep doing the experiment again and again and again and again until we arrive at a result. We just keep doing it again and again. So, in this example, I'm calculating the value of pi. It takes some time. Take some time. Do you break anybody? So, I deliberately put this last. Okay. So, there is something you observe here. GPU took 19 seconds. GPU took 19 seconds. And CPU took like 0.06 seconds. Why? Why did the... Why did GPU fail here? Why did GPU fail? It failed because I was using wrappers. You see, easy this stuff. I didn't write C code like before, right? I didn't write C code like before. Are you something that calls another C code? When that happens, lot of memory transfer takes place. When you have lot of memory transfer, it will only slow the program down. So, if you want to test if your thing works on GPU, you can use wrapper. But if you want more speed and stuff, you need to go for pure C. Okay. Q and A session. Okay. So, just be careful when writing kernels. Any doubts? No doubts? Did we bore you? I'm sure we bore you, but... Which one, sir? This one, retracing. Is it doing ray tracing process? I'm not doing ray tracing, I'm just calculating the value of i. Oh, fine. You guys have time to listen to more of my crap. Okay. So, I didn't really tell you what formula I was using to calculate pi, right? I didn't tell you that. All I did was I put a circle inside the square and I started throwing stones. And then I counted how many stones were inside the circle. And then I counted the total number of stones. And the ratio happens to be pi by 4. Because the ratio between area of circle and square is pi by 4, right? Simple. Happy? Okay. There is one last thing that I'm not telling you. But which I want to tell you. You guys have been very patient with me, so I'm showing you this. So, people are asking, do I do ray tracing? Do I do stuff? What kind of stuff I do with pi cubeda? So, this is the simplest 3D example I have. Okay? Everybody, I hope this is the simplest 3D example I have. What I'm doing is, I'm again trying to calculate the value of pi by putting a sphere inside a cube. That's all I'm trying to do. And I'm trying to represent this process in graphics. So, and what is... The first code is this, is small, pretty small code. I don't want to scare you with hollow lines of code. Small piece of code. And I'm really relying on Qt libraries. Qt does most of the dual work and drawing stuff. And the other awesome thing is, they are not one but two GPUs can be used right now. Do you agree? Do you agree? They are not one but two GPUs can be used right now. So, my laptop is an optimized laptop. So, it has Intel CPU and Intel GPU and an NVIDIA GPU. So, whatever my NVIDIA GPU does is simply calculate. My Intel GPU draws stuff on the screen. And my CPU, it just coordinates the memory transfers. So, it takes some time to get over it but once you understand it, it's easy. So, basically this talk is all about a primer, a starting point. Anybody can do Qt, everybody should do Qt if they want to faster code. If your code allows it. Okay? I'm done. Anybody else? He has a query. You can switch on your mic. It's on Jithub. We put it on Jithub. It's on funnel. You could refer a funnel and... We don't know, we see one thing, one misconception I want to clear. We don't write a lot of code, we don't put a lot of code in Jith. But we put a lot of code in hard drive. So, we don't expect that code. So, this is a follow up question to the question you asked after the spark talk. I'm over here. Okay. So, you were like asking him if it was possible to use GPU computing for spark. I was. Yeah, but the whole idea of spark is everything is done in memory. You just throw gigabytes of RAM at your problem. But that is not really possible with the GPU, right? Because if you are dealing with the GPU in a CPU setup in a massively parallel framework, your main bottleneck is going to be memory transfers between the CPU and GPU. So, that is at least my understanding. So, how exactly do you think that something like spark can be done using GPUs? Like Andreas Kruppner is trying to do here with my Monte Carlo stuff. I mean, this is not optimized. See, he's calling functions like this. Can you guys see? He's trying to call functions like this. I'm trying to do something similar, like if spark can allow it. Like spark had ML functions, like machine learning functions. Machine learning functions are basically some of them are parallel in nature. Autoregressive methods, T-var, n-var methods are actually parallel in nature. So, I was thinking like spark was, all the spark, a person was talking about parallel stuff. So, some of the code might be offloaded to CUDA or OpenCL. OpenCL is much more liberal when it comes to stuff like this. All we want to say is that just put the stuff which the CPU is good at with itself. Don't throw it at the GPU. But if you find something in your code that the GPU can do, like something parallel stuff, something really parallel stuff which you can eliminate, move it on to the GPU. That's what we want to see. A lot of map-reduced stuff can be done on GPU. That's really fast. Map-reduced is really fast on GPU. Any other questions? I was a little confused on your last line when you said two GPUs were being used. That is the same for all the processes where you were doing using GPU, right? Maybe. It's a likely scenario, right? It's a likely scenario. We'll see. When I say MSPCI here, it shows both Intel and N-Area. But when I do MSmod, the N-Area driver doesn't show up. But when I do MSmod, why are the programmers taking place? I think both of them are running at the same time. See, NVIDIA doesn't talk about what happens in the background. There's a problem with that. AMD is really liberal with their code. So if you want to know what happens with AMD, you'll know. But what happens with NVIDIA, you'll know. Everybody knows. Okay, thank you. Questions? Oh, a lot of things. Don't be hard on me. Just one question. Are you running Bumblebee? You said you have an optimist laptop, right? Bumblebee. So the last time I checked Bumblebee, it was a pain to get NVIDIA and Coda to work together. So has the situation improved? Not quite. I'll show you. I'm trying to increase the screen. But can you read the font over there? I'm using Manjaro. I'm not using Ubuntu. Like if I use Ubuntu, Federer, and all these mainstream distros, you know, the problem with the open source community, I mean, I don't have a problem. They have egos this big. They have egos this big. Like, no, this is open driver. I want an open system with open drivers. Here, you don't know how the hardware works. How can you design a damn driver? So, leave the stuff to the professionals. You can't do drivers, you can't drive them. They don't need the hardware that you have. But Ubuntu is trying to do, like, my driver, my display server, my drive, my stuff. Doesn't matter. If you want to do, if you want to do Paikiran, Linux, please use Arch or Manjaro or any other Slack or stuff like that. Non-stream distros. So, one last question. So Paikuda, the examples only contain a C code that is being wrapped and executed on the GPU, right? C code is not being wrapped here. Oh, oh, fine. I have some problems. Basically, in every Paikuda example, I've only looked at the examples, by the way. So there is always a sizeable amount of C code there. There are some QDA programs which are absolutely no amount of C code. But C code is only in the background. So it is possible to escape avoid writing C code and still have to and still make things run on the GPU with the Paikiran. You don't want to, we're basically trying to put in C code at all, right? Yeah. You can. See, NVIDIA teams up with continual analytics. Of course, yeah. And they're trying to bring a Numbapro accent rate. I know that's why it called it. Numbapro accent rate. Is it going to be free? No, you need, like, I took an academic license from my H.A.D. Sir in college, but it's not free. Numbapro itself doesn't have that functionality. It's only Numbapro which is not free, right? Numbapro is an open source. So Numbapro which has this feature, GPU acceleration built in for Numbapro. And even that is half, but it's because, sorry. Numbapro's implementation of GPU, you still need to have a knowledge of threads, blocks and stuff. There was two years ago, my memory is fading, but two years ago, there was this project called Copperhead. There was a layer. Yeah. It really resembles STL. Yeah. Still there, yeah. So you can do, like Copperhead, allow them to be pure. Pure. That is not being maintained. I don't know why. Theor is not being maintained or Copperhead is not being maintained? Copperhead is not being maintained. Okay, thanks a lot. Any other doubts? We'll take your hands up. That way. Which game? Yeah, hi. So it was my understanding last time I checked that the consumer level GPUs only run single precision calculations. Is that still true? No. See, the QDA, I'm not sure, I'm not so sure about OpenCL, but with QDA, they have this compute number, right? Compute 1.0, like compute 3.0 supports double position, I believe. You check, you check. I don't have real crappy GPU. So you check. My thing is like 5, 4 years old. Okay, so we would just wrap this up. So give a round of applause for them. If you... I guess you people are interested in that. So if you want to catch him, you can catch him afterward because we have got a session here right now. And a pretty interesting one.