 Good afternoon, everyone. Can hear me at the back? Give me a thumbs up if you can. My name is Sam, and I work with Alfonso Labs. We work on television data. I specifically work on video AI and applying vision algorithms to television videos. I'm going to talk about using CUDA and OpenCV. And combining that to analyze video and probably apply machine learning or vision algorithms that you want to. We'll be using a very simple example to do that. It's not a very complex problem. Detecting scene transitions in a stream of video. That's what we are going to use as a demonstration to apply CUDA and OpenCV. We'll talk about video encoding using FFMPEG. To decode videos and eventually use OpenCV to analyze video frames. So the algorithm that we are going to talk is detecting scene transitions. So essentially, the idea is to detect when does one scene change from a shot to the other. Essentially, the algorithm is very simple. You compare the pixel values of subsequent frames to consecutive frames. And if the average of the difference is higher than some threshold, you declare that it's a transition. So it's a very basic algorithm. And the idea is to use it to analyze the difference in execution times when you execute it using the CPU and when you offload the computations to a GPU. So the first thing when you talk about video is a video codec. And it's basically some software that decompresses and decompresses distilled video. A lot of people confuse video codec with the containers, but they're very different. And essentially, if you want to use video frames, if you want to analyze video frames, you need to have some mechanism to decode video frames. And that's what video decoders are. So video decoding is basically you decompress video from raw format and use it in your analysis. Now, there are two types or two ways to do that. One is using a software decoder. A software decoder would essentially use the CPU cycles to perform the decompression. The other is to use a dedicated chipset or a dedicated hardware to essentially handle the decoding part. Now, using hardware, of course, is better. Not because it's faster, but simply because you free your CPU for doing something more useful. A lot of hardware is available. NVIDIA has a lot of hardware for video decoding. And NVIDIA, yeah. So NVIDIA GPUs, according to the NVIDIA website, NVIDIA GPUs contain hardware decoders which can provide fully accelerated hardware decoding and encoding. But this is partially true. You need to spend a lot of money if you want unlimited encoding sessions at a time. Close to $500 if you want a GPU that good. But most of the GPUs support two encoding sessions in parallel and a lot of, I think, around 10 decoding sessions at a time. And NVIDIA provides a very comprehensive video SDK, video codec SDK. A lot of people have tried to build stuff on top of that. And FFMPEG has support for using the NVIDIA video codec. And yeah, coming to FFMPEG, a lot of people here would be aware of this. FFMPEG is basically an open source library to do anything with media. Transcoding, encoding, decoding. You can do almost anything using FFMPEG. It's written in C. The only thing is if you want to use it with NVIDIA's codec, you need to build it from source, which is kind of tricky. But it should be pretty straightforward if you follow the docs. NVIDIA provides the docs there. Coming to the second part we have, that is CUDA. So GPUs basically have gained significance since people have started using machine learning. And they found that GPUs can accelerate computations tremendously. And that's when NVIDIA came out with its architecture, which can allow a user to write partial code in GPU. So basically you can have an application that is running on the CPU, but it can offload some functions on the GPU. They call it kernel functions in CUDA's architecture. So it basically provides a C extension or C++ extension. And you can offload the heavy lifting, the heavy edithymetic to the GPU. And because GPUs are designed for parallel processing, they'll hopefully compute it faster. OpenCV is a very popular and a very widely used, free and open source library for vision. The OpenCV website claims that it has around 14,000 active users. It provides plugins in Python, Java, and C++. It also has bindings which allow you to use the native hardware acceleration. And specifically it allows you to use CUDA. So that's great for us. Yeah. So there's a slight missing, but yeah. So basically that's like talked about how CUDA helps you accelerate. So imagine if you have two vectors with 64 elements or some elements, essentially a CPU would process it sequentially, whereas a GPU would have an individual thread working on each element of both the vectors. So essentially a GPU does not speed up, but it maximizes your throughput. I hope that makes sense. So yeah, using OpenCV with CUDA. So OpenCV again has bindings, as I said. You have to again build it from source if you want to use it with CUDA. So it has a lot of name space, not a lot of wrappers. And the problem is that if you want to use the decoder, the NVIDIA decoder for decoding video frames, you need to use it with ffmpeg basically. So you need to have ffmpeg built for using GPU to be using video decoding with OpenCV. So basically the scene detection, right, we talked about it. It requires three things. It requires the first step is essentially decoding the video to get individual frames. And then you need to use, you need to perform computations on each frame. So let's see how that works. So basically you have OpenCV library which allows you. So essentially you decode and then basically you have two frames, which are subsequent frames. And they are in HSV color space. And you can offload your pixel by pixel subtraction to the GPU. And because GPU performs a computation on each pixel, this essentially should be faster if you just use it if you just run it on the CPU. So these are some functions that you can use. You can basically perform a matrix subtraction. You can find the absolute sum of all the pixel values of a matrix. And the last step is basically just averaging across three different channels. And again, basically a simple matrix, a video is basically a collection of images. And each image is essentially a matrix. And because matrices can individually be, each pixel of matrix can be used to individually perform computations, you can use a GPU to speed it up, simply because you have different pixel points. And each pixel is independent of the other if you're performing pixel by pixel comparison. Some experimental, some experiments that I did. Again, this uses a Tesla 1080 Ti. That's the NVIDIA GPU that I used. So if you run your decoding, this basically is running the above algorithm on a one minute MP4 file. Again, if you run only the software decoder, that is you decode on the CPU, and then you run your entire algorithm on the CPU, it takes about 18 seconds. But if you offload the decoding to the hardware, and even execute your pixel computations on a GPU, it just takes two seconds. So the speed up is quite high. Thank you. And questions? So you used a pre-existing file to do these computations. I understood. So is it also possible to, for example, do the same thing on a live stream, in wire camera or something like that? Yes, you can do it using in a live stream on camera. So if I understood correctly, two seconds for the whole minute, that means live analysis should be no problem performance-wise. Yes. I think the only catch there might be that you may not be able to use the OpenCV's existing libraries to decode the frames using the NVIDIA hardware. You might have to write some wrappers around that. I've not tried that, but you might need to write some wrappers or some code around that to use the NVIDIA hardware. Yes, depends on the camera, probably. Yeah, probably. If you have one that puts out H264, it should be. That should be OK, I think. How much frames do you need to decide that the image have changed from one send to the other? Again, that's something that you can probably experiment and find out. But again, for the purpose of this experiment that I did, I just compared two subsequent frames. So the first frame is, of course, not useful. But the first and second, the second and third, like that. Another question, please. In the case when you use FNPH, I don't know. I'm sorry. Yes, with CUDA bindings, do you leave the code frame into the GPU memory? Or do you get back the frame to main memory, and then you have to upload it again to the S? Very nice question. So if you see the second one, software decoder with CUDA. So that essentially does the computation in memory and then copies that to the GPU. So the second one is what you were asking. And the last one, of course, does it on the GPU memory? Everything. Yeah, everything on the GPU memory. Thank you, Ramit. Other questions? Just looking at those stats there, it seems like you get more benefit from just doing the hardware decoder as opposed to software decoder than your actual CUDA algorithm for scene change detection. You're actually missing a combination there. What if you'd just done hardware decode without the CUDA? Is it going to go from 18 seconds to four seconds? Do you have any feel for what the division of labor is there? Because it seems like your algorithm is pretty cheap no matter what for the actual scene change detection. And the benefit that you're showing there is largely just the difference between software decoding and hardware decoding, which, yeah, it's obviously going to be faster. Again, that's possible. But essentially, if you just compare the first two stats, you see that there is, again, even if you offload the pixel computations to the GPU. I've not tried hardware decoder without CUDA. Essentially, I think that would be more expensive because you have to copy data back from the GPU to the CPU to perform the computations. So there'll be some overhead involved there. So if you're using the GPU for decoding frames, you might as well just use it for performing pixel computations and save that overhead. Do you have any idea what the division of labor is between? How much of the savings is because of the hardware decoder versus just the fact that you're not using the memory bandwidth going back and forth between the GPU memory and the main memory? So I think the second stat we have, that kind of explains that. So there, you technically have to copy the decoded frames to the GPU. So that's why I think that takes up some more time here, simply because you have that memory transfer overhead. Thank you. Yes, I'm coming. So with your algorithm, you say that you decide if you have a transition, a same change, if the difference between your pixel values is above certain threshold. But it looks that there you have a branch. And I just know a little bit of GPU, but I think you are not supposed to have any branch so that your code actually runs really fast, right? Yes. How do you fix that? So you don't run the entire algorithm on the GPU. Basically, OpenCV allows you to offload only the pixel computations on the GPU. So I'm just calculating the difference between two matrixes or two image frames on the GPU. And I get the result back, and the decision making happens on the CPU. So the branching essentially, as you said, is on the CPU. That makes sense? Yep. OK, thank you. We still have plenty of time for more questions. Sorry, they follow. Don't you get a delay of two seconds when you do this for a complete stream because it takes two seconds to compute for one? I'm sorry? Don't you get a delay of two seconds between each computation because you take two seconds to compute for one as two images? Two seconds to compute for two images. No, it's for a one minute file. So it has around 1,800 frames, 1,800 images. All right. So two seconds is for 1,800 images. More questions? So I'm trying to wrap my head around how would I use. Because I use OpenCV and TensorFlow to do similar things by using convolutional style networks models in images. But I use OpenCV for the pixel computations. And when I want to use CUDA for big stuff, I use TensorFlow GPU because I only know Python. So I was wondering, it seems like this type of approach would cut a lot of my time if I knew how to do that in Python. How would you, because this example that you showed, you would do it in C, right? Yes, C++. Yes, I imagine obviously if I knew C, I would do it in C. But do you know if there's a wrapper, if there's a way where I can make use of GPU computations with OpenCV in Python? I don't think so. I tried, and I couldn't get it. I would have loved to do it in Python, too. But I couldn't get it working. I don't think it's possible to do it in Python. But you're free to go ahead and give it a shot. Other questions? Thank you. Thank you, Sam.