 Why use GPUs in neural networks? There is a hardware and a software aspect to this answer. And so, this is going to be a two-part series. In this video, we're going to take a look at what the GPU hardware provides that makes neural net training much faster. And in the next video, what software and algorithm changes do GPUs make to speed up GPU processing? What is 5 times 6 times 7? 210. 210. What is 8 times 3 times 5? 120. 120. What is 7 times 12 times 3? 252. 252. The CPU is blazing fast. The GPU is almost as fast, but slightly trails behind. What is matrix A times matrix B? This. CPU, your answer? This. Looks like the CPU took its own sweet time. GPUs are used in neural nets because of this ability to perform matrix multiplication so fast. But why does this happen? Why are CPUs great with scalar multiplications and GPUs better with matrix multiplications? Three main reasons. GPUs have a larger memory bandwidth, they use parallelization, and they have faster and more memory access than CPUs. We're going to delve into these three points in detail and then show some PyTorch code in action that really demonstrates the difference in speed. Let's get started. So here's a block diagram. We have a system memory. This is the main memory which contains the matrices that we need to multiply. We have a CPU and a GPU. And each of these also have their own memory that I'll just label as memory without the details for now. Let's make an analogy here. The CPU is a Ferrari and the GPU is a truck. Case one, performing scalar multiplication. The CPU can get this data with a fetch operation to memory. The amount of data that we can fetch though is pretty tiny, but it's large enough to fetch floating point numbers within a couple of trips, and these fetches happen blazing fast, like a Ferrari. A GPU's fetch operation, on the other hand, can transmit a lot more data, but they are not optimized for speed, so they're like a truck. They can take more time to fetch and process small chunks of data. And that's why for scalar multiplications, GPUs can lag behind CPUs. Case two, when we multiply two large matrices. CPU, it has a super fast fetch operation, but each fetch can only transmit a tiny amount of data to and from memory. So it can take thousands of trips to fetch all the required data. Well, one trip is being fetched, the CPU memory is processed and freed up for the next inputs. This is where GPUs work much better. At a time, we can fetch much larger amounts of data from the RAM into the GPU's memory, so the GPU doesn't need to make that many trips. In more technical terms, GPUs have a higher memory bandwidth than CPUs. Memory bandwidth is the amount of data that can be transmitted in a single trip to and from the memory. And this is one of the main reasons GPUs have an edge over the CPUs for large matrix multiplication. But even if we get a lot of packages or a lot of data, the GPU processors remain idle for a while. After all, they're too fast and the truck is just too slow. So instead of having a single truck, we can have a fleet of trucks. This way, GPU processors do not need to wait around and it always has data to work with. This notion of using a fleet of trucks is called parallelization. Coupling large memory bandwidth with parallelization reduces any time a GPU would be waiting. So basically, we fetch a lot of data and we do it fast. But GPUs offer something more. Faster memory. I used the term memory very vaguely before. But what I'm really talking about are the caches and the registers. GPUs have a similar structure, but GPUs L1 and L2 caches are smaller in size than a CPU's L1 and L2 caches. Smaller size means that it can be accessed much faster. And also, these streamlined processors of a GPU have a bunch of registers, which are super fast. And GPUs have upwards of 1,000 times more registers to play around with than CPUs. So all of these memory enhancements together makes computations just blazing fast. And it's a combination of these three points that GPUs are faster than CPUs for matrix multiplication, especially for large matrices. Faster multiplication means that we can perform operations in deep learning also faster. Cool. But how do we actually make use of GPUs as a programmer? Well, this is where CUDA comes in. From a programmer's view, CUDA provides an API that allows us to access the components of a GPU, like the streamlined processors, the caches, and the registers. CUDA already provides a nice high-level abstraction, but deep learning frameworks like PyTorch make life even easier. We don't even need to know about the inner components of a GPU. We just treat it as one big abstract unit called a GPU. And we're all good to go. Let's actually see how matrix multiplication with GPUs perform. I'm in Google Colab, and it's an environment that allows you to run your code in chunks and see the outputs directly. I have some code cells here that we can walk through. The first cell is a scalar multiplication with the CPU. We're just taking the square of a number. PyTorch uses tensors as the fundamental building blocks, and you can think of tensors as wrappers around your scalars and matrices for any dimension. I'm just squaring a one cross one tensor here. We're using a command called timeit to calculate the execution time of this block of code. These are known as magic functions if you want to look them up for more reference. Looks like the CPU took 5.26 microseconds to do this. That's super fast. In the second cell, we have similar code, but this time we multiply a 10,000 by 10,000 matrix with itself, and this took 11.8 seconds. It probably seems like it's taking longer when you actually run it, but that's only because it's being executed three times. For the rest of the code, you need to set up a GPU. So for this, you got to go to the menu bar and click on runtime. Then on the dropdown, you go to change runtime type, and from the hardware accelerator dropdown, select GPU. This initializes the GPU. We now have to tell PyTorch to use this GPU, and we use torch.device to get a reference to this GPU. And this .2 device tells PyTorch to store and process this variable Z using the GPU. Let's run this cell. Okay, so that took 35.2 microseconds, which was slower than the 5.26 microseconds that we did using without the GPU, so it was slower. Let's now run the large matrix multiplication with the GPU. Okay, wow, that took less than a second, way faster than the 12 seconds it took before. Pretty slick, right? What I explained right here is only half the reason GPUs are blazing fast with matrix multiplication. In the next video, we're going to take a look at how we can change the original matrix multiplication algorithm for faster processing, but that's all I've got for you now. Hope you all enjoy what you saw. Click one of these cards to see some of my amazing work, and I will see you very soon. Bye-bye.