 Why use GPUs in neural networks? The short answer is that GPUs speed up matrix multiplication. And this is the most fundamental operation in neural networks. But how exactly do they speed up this multiplication? Before we get started, give this video a like and subscribe if you're new here. I make content as frequently as I can between this and my day job, and some of you have reached out to me personally saying that you like my content, and I appreciate it. There will be more to come in the future, so be sure to leave a like, hit subscribe, hit that bell for notifications when I post, and I really do appreciate the support. We're growing slow, but we'll get there. So how exactly does GPU speed up matrix multiplication? On the hardware side, GPUs have three main advantages over CPUs. The first is high memory bandwidth. That is, the amount of data that can be transferred to and from memory. GPUs have a high memory bandwidth. They can transmit large chunks of a matrix between different GPU parts. In the last video, we saw this by comparing a CPU to a small yet fast Ferrari and a GPU to a slow but a very high capacity truck. The second advantage that GPUs offer is parallelization, which means that we can transmit large matrix chunks in parallel. This is equivalent to using a fleet of trucks. While the GPU is processing the current matrix chunks, it is fetching more chunks from the system memory, so the GPU is never idle. The third advantage of GPUs is fast memory access. GPUs have small registers and caches, but there's a bunch of them. We can store a lot of data and access it pretty fast. A GPU has a few streamlined multiprocessors. These constitute the processing part of the GPU. It has a global memory that is shared by these multiprocessors and some local memory that is specific to the multiprocessors and cannot be shared with other SMs. This hardware has so much potential, but it is pointless if we don't use it in a clever way. We're going to look at what algorithms a GPU uses for multiplying large matrices so that it makes use of this hardware to its fullest. Let's start with a quick review of matrix multiplication. If you want to multiply matrices A and B, the number of columns in A should be the same as the number of rows in B. The output product will have the same number of rows as A, and it'll have the same number of columns as B. The first element is computed by taking the first row of matrix A and the first column of matrix B. Then we take the element-wise product and sum that up. The element in the first row and second column is computed by taking the first row of A and the second column of B. Take the element-wise product and sum that up. Similarly, we compute the other elements. This is how we would do it by hand. But what if we were dealing with multiplying two matrices that were of 10,000 x 10,000 in dimension? How would a computer do this? And how would it do it efficiently? Well, the first case is using brute force. We multiply every pair of numbers one at a time and sum that up as we would by hand. But the problem with this approach is that it takes way too long. GPUs take time to fetch data, and we aren't making use of its high bandwidth and parallelization if we process one element at a time. So that brings us to the second case. We can use the GPU memory, which is shared by the streamlined multi-processors. The GPU memory may not be large enough to hold both large matrices, but it could be large enough to hold the i-throw of matrix A and the jth column of matrix B. We could fetch these into the GPU memory, and this way we use more bandwidth, we use some amount of parallelization to fetch the data, and processing here is faster than in the first case with brute force, since now we're using the GPU memory. But we can do this even faster using block multiplication. Instead of using the GPU memory, we can make use of the shared memory within a streamlined multi-processor. This means that each SM would compute a block of the final matrix, but we can only allocate a block to an SM if every element in this sub-matrix block can be computed using the same shared data, and it should be small enough to fit in the SM memory. This is possible with block multiplication. So first let's talk about the math behind block multiplication. We want to multiply two 4x4 matrices. We can split the matrix into blocks of say 2x2. The final product can be written as a product of these elements. I know it looks messy, but the main takeaway is that to compute all the elements in a 2x2 block of the result, that is like C11 for example, we would need to take the sum of products of blocks in the original matrices. This result is handy because of the small memory constraints of the multi processors. One SM is responsible for computing the block elements of C11. We can take the sub-matrix blocks A11 and B11 to compute the intermediate sum. We remove the blocks from memory, and then we take A12 and B21 into the memory. Take their product, add it to the intermediate sum, and then we get the full matrix C11. And while we can compute this sub-matrix block C11, in parallel the other SMs would have computed different blocks in the final matrix. This is much faster since we're using the memory within a streamlined multi processor. We can make use of more memory bandwidth to fetch the blocks of data either into the GPU or into the streamlined processor memory. And lots of parallelization is going on from getting data from the system memory and also in parallel processing with the streamlined multi processors. But clearly to make block multiplication work, there are a few hyperparameters we would need to know to get it working. We would need to determine the optimal block size and also the number of streamlined multi processors to use. And not to mention how do we also allocate which blocks go to which streamlined multi processor? Well CUDA takes care of determining all of this under the hood, and all we need to do is tell our system to use a GPU with a single line of code. And that's it. I hope this video was interesting. There are some interesting pieces of research on further algorithm speedups that I'll reference in the description down below. But that's all for now. I hope you guys enjoyed this. Please like and subscribe for more content on machine learning, deep learning, and data sciences. And I will see you in the next one. Bye bye.