 Okay. Welcome everyone to ELC 2022 and thank you for coming to this talk. Today I'll be talking about V4L to M2M framework and how you can use this to write your driver if you have a video processing IP. Let's start off with what is V4L to M2M, right? So it's a driver framework for memory-to-memory devices. So memory-to-memory devices are those that take data from memory, they do some processing in hardware, and then they write out the process data back into memory. So this is slightly different from the normal traditional V4L to devices which are either capture or output. Typically a V4L to M2M device would have both output and capture node. So think of that, right? And another additional thing that V4L to M2M does is that it has support for multiple contexts. So you can have multiple applications or even an application having multiple contexts and they could all share the same underlying hardware underneath. So this is technically possible even for capture and output devices, but as such the framework doesn't directly support it but the drivers can do it themselves. A few images showing what I just described. M2M meaning like you have some data in DDR, you send, read it. The M2M device reads it, does some processing with it, and then writes out the output to another piece of memory in the DDR. And how that differs from traditional V4L to output devices, you typically read the data and then send it out to an output device hardware. And if it was a capture device, your output hardware would generate some frames and then that would be written back into memory, right? Another thing to note here is this can be a bit confusing that the terminology of V4L to uses output and capture and output typically might look like input and capture might seem like output. So these terms might get interchangeable in this presentation and all across V4L to documentation as well. And here this is the M2M device which does both, you know, it has an output as well as a capture. So let's take a look at a typical V4L to application workflow. So you'd have, you know, you have the application layer, we have the V4L to kernel API layer and a driver typically interacts with, if it's a traditional V4L to it, it interacts with the V4L to core and the VB to core. VB to is the video buffer to and it is the buffer manager for V4L to. And I'm not going to full details, but a very brief explanation on this. So what would happen is this typically a setup stage where you'd set up formats for your device. You'll allocate some buffers and then you'll cause some sort of a state change. It begins streaming, right? And that's when the hardware starts producing data or if it's an output device, software provides data to the hardware which then goes out to the output device. And then once that streaming state begins, then you probably have already queued some buffers into the stack beforehand. And then as and when data is available, you dequeue the buffer back from the kernel to the application. And then that process keeps continuing as long as you are streaming. So that's the typical workflow for any V4L to whether output or capture. In M2M, you just do that twice. So you'd have one for an output, there'd be an output queue as well as capture queue. And you keep cycling those buffers. So there's output buffers going to the output queue and capture buffers to the capture queue. And there's an additional core component that you deal with, which is the V4L to M2M core. So driver not only interacts with V4L to core, VB to core, but also the V4L to M2M core, which is basically a set of helpers that let you do all of this easily. So this slide shows the overall architecture of the V4L to M2M architecture. So I mentioned earlier it's multi-context. So what we mean by multi-context is that and the way multi-context is implemented is very simple. Every time you open the file descriptor, you get a context. And so that's something to take care of sometimes as application developers, you'll open a file multiple times, but you've got to remember with M2M, you're going to get a new context if you do so. And so that basically allows multiple applications to open multiple contexts or even a single application to open multiple contexts. And each context can be thought of as a combination of a different job for that particular hardware, a different set of output and capture buffers. So the way it works here is the application does maintain a set of buffers that it allocates with the help of the driver. It'll maintain it around in its application space. And at some point when it's ready, it's done all the format negotiations, it would push those buffers to the kernel, through the kernel API. And VB2 is the component that receives it. So now once VB2 has these buffers, it's going to start making calls into the driver to queue those buffers. So those buffers are now intercepted by helper function in V4L to M2M. And those queues move into something known as a ready queue, which is shown here in this purple box. And then each context has its ready queue and there's a ready queue for both output buffers as well as capture buffers. And then from here, there is an evaluation that is done by the V4L to M2M framework, which determines the eligibility for running. So typically for a very simplistic device, you need at least one output buffer and one capture buffer to complete the job. But that's not necessarily true. There are some codecs, the hardware that might take two input buffers and produce one output buffer or vice versa. So that eligibility is determined via driver callback. And when that eligibility is met, you know, that context makes its way into a single job queue. And then the job queue, once it enters the job queue, then the framework issues this context one at a time to the driver. And the driver will take out the job from it, remove all the buffers, give it to hardware, do all the processing. And once it's done, it calls a bunch of finished calls, which then go back to VB2 and then also takes out the context from this queue. And then it's then ready to pick up the next context in that queue. So this is a lot of information that I've put in here. But to better explain this, I have an example. So I created an example, M2M scalar device. This is a good example of a video processing IP. So it's a video scaler. So you can scale up or scale down, either one of them. And it is a virtual device. And it's built with KEMU. So it's emulated using KEMU. The actual core logic for it is an open source project on its STB image resize library. And this effort here uses Yocto Linux kernel 5.15. And it also builds a device driver for this virtual device. And finally, there's also going to be an application which uses lib camera C++ library, which has really good M2M helpers. And then we'll demonstrate a simple downscaling example why this process. So let's start off with the actual M2M device. Every hardware has a data sheet. And there's a data sheet here as well. So that's for the virtual device. It is virtual with respect to KEMU, but to the driver it looks real. So it has a register map. It has an interrupt. It also can do DMA. So that's what this data sheet is explaining. So the way you configure it is, you know, there's a couple of input and output configuration registers. That's where you would go and program the input width, height, stride, output width, height, and stride. And then there are two DMA registers. One is for the input buffer itself. So this basically signals the, tells the hardware where to go and read the input buffer from the actual physical address. And the output buffer DMA address tells the hardware where to go and write the finished output too. So that's the output buffer DMA address. And then there's one register like at offset 18, which is for control and status. So, and there's a programming model described here. So typically what you do is you reset the IP and you do all this configuration beforehand, right? And set up all the weights and heights and everything you need. And then you basically toggle that start bit. That would cause the device to start doing the processing. So in KMO, what happens is it takes it and gives the job to a thread that's STB, which calls STB and gets the image scaling done. And then after which it will be soon interrupt. And then the interrupt status bits are also inside that same register. So you, and while the KMO device model is doing all of its job, it will keep updating the status. When it's completely done, an interrupt would be issued. And it's either going to be done or done with error. And you want it to be done. That means the job is successful and you can pass it back to the application. Now let's take a look at the device driver that's been built for this device. So on the right side, I have the device tree. So like I said, there's a single IRQ line. There is a register resource range. It says 1,000 here, but we do not have 1,000. That's just a random number. Don't go into it too much. And this is a compatible string, which says this is a virtual device and the type is a M2M scalar. And on the left side, we have the platform, we have a platform driver built for this. So that's going to use the virtual M2M scalar compatible string as well. And it'll have a probe and a remove method. Let's look further into the driver. So in the probe method, one thing you do is you allocate a device private structure. So that's shown over there. You allocate it out of kernel memory. You get the platform resources for the IOMM, the MMIO. You get that in here. Then in this driver, we're also using RegMaps as a nice, because RegMaps is a very new and great API. We use that to program the hardware. It makes it very simpler. It particularly helps with those bit fields that we have, you know, single register that has 16 bits of width and height. So with Reg, Reg feels that that becomes very, very easy to read. And then we also get acquired the IRQ line and install handler here. So in this particular case, a threaded handler has been has been installed. This is because we didn't want to use another deferred function in the IRQ. And for this particular use case, a threaded IRQ handler is good. Then the remaining part of the probe function, we do have to call V4L2 device register. And then we set up the driver private data to both the video device as well as the platform device. And then there's a call to V4L2 M2M to which we give the M2M ops. We'll review that in the next slide. We also have to call video device register. And at the end, we also enable interrupts because this driver requires interrupts for it to function. We cannot do polling over here. It's not possible. Continuing with the probe method, we also have a media controller, what do you say, registration here. So media controller is supported by V4L M2M. And as far as my knowledge goes, it's mostly for the purposes of implementing media request API. But I think I don't want to go into media request API. That's out of scope for this presentation. But for those reasons, there is a media controller support. There may be more, but just I'm not aware of what the other reasons might be. And then let's go through the other ops. There are file ops because the driver must be aware of when application is opening and releasing the device node so that it can create a context. Rest of the calls like poll, unlock, type, tell, and map are all pointing to helper functions. We do have a scalar video dev over here. This is a video device data structure. And here we set up the caps and also a pointer to all the ioptil ops. And this is the core ops, like the main, the most important ops for this driver. There are actually three ops here. Only one has been implemented, which is the device run. There is another op to cancel or abort. Sorry, it's called abort. And the third one is called job ready, which is what I was mentioning earlier, that some hardware will require more than one input buffer and output and capture buffer to do the job. So in that case, if that is true for your hardware, you must implement the device ready callback. And then media controller callbacks, ops are just pointers to some helper functions. And these are needed for the request API, like I mentioned earlier. Let's look at the fops open. So this is where the context actually gets created. So every time an application opens it, you create a context and you store it into the file handle. That's what is shown here. So driver is very aware of this process, but it just makes use of the V4L2M2M helpers to help it, you know, make it a lot easier. And there's an important step here, which is the M2M context in it, to which you pass the Q in it data. It's actually a function pointer and a couple of initialization is done there. So and then the rest of it is boilerplate. All drivers would do this. So I would, I mean, I don't want to go into the details of that. The other steps that are done in open is this particular driver is caching the formats for both the output as well as the capture. And those formats are stored inside that of context data structure itself, right? So in the event the application doesn't set anything, you should have some defaults to start off with. So that's what's being done over here. You can also see the context data structure over here. So it has a pointer to the original device, the file handle, which is where the context is stored. And then the formats are stored in an array here and also sequence number because as the driver progresses frame after frame, a sequence number needs to be incremented. So this particular driver only supports one image pixel format. That's one shown here, which is RGB 24. And so that needs that is what has been set up here as the default format. In the release, what you do is you initialize and free up the context. So that's that's pretty standard. And most drivers do this again, mostly boilerplate code here. The queue in it. We spoke about this in one of the previous slides. It's done as a part of the context, the fops open right by right after creating your context, you have to initialize that context. And this is where you're doing most of the VB to related activity right here. So there's a source video queue and a destination video queue. The source is for the output cap cues and destinations for the capture. And over here, you need to set up and determine what kind of a device you have, right? So So here we have the telling is an output. And this is a capture IO modes. We say M map and DMA above. These are the most popular ones. And you have the driver private. We also do not need a special buffer structure so we can just go with the M to M standard buffer size here. And there's a few queue ops here. These are the VB to queue operations, right? And and this one is important. This is telling this depends on your hardware. So this particular emulated camera device can only do physically contiguous memory. So we'd use the VB to DMA contig mem ops. Again, this is coming from the VB to layer. This they have their helpers for that. And finally, you you make a call into VB to queue in it once you populated this data structures. So once VB to write before it starts allocating cues, it's going to make a call into the driver, which is known as the queue setup. What happens here is you need your as a driver developer, you need to tell it how many planes are there in the buffer and what's the size of each plane. So that's that's what this driver is doing. The additional things could be done, but this driver is only doing the minimalistic thing here. And buff repair this callback is again issued by VB to and it does it before queuing it. So we only do a single step here that is set the plane payload size for every buffer. So that's an initialization that's happening. And there's a buff queue. This is the VB to layer calling into the driver asking you to queue the buffer. Ma'am to M driver just calls the Ma'am to Ma'am helper function because it will queue it into the Ma'am to Ma'am layer and go into the ready queue, right? That's the only step this is done here. There is a start streaming for VB to we don't really need to do anything here, but we are maintaining a sequence number. So that's being reset at a start stream so that you get a incremental one once which starts streaming in the stop streaming. What's happening is we're kind of draining the cues and we're just returning all the buffers back to the VB to layer. So here are the iOctyl ops. Every V4L2 device has a V4L2 iOctyl. So that's what's being shown over here. And the most important ones here are the format related. That's the only one this driver implements, but depending on your driver, you might have more. You could even have controls and other sorts of things, but this driver is very simple. It doesn't do that. So we do have like format enum, G format, tri format and S format. And because this particular device has the same capability on both the output as well as the capture side, the same set of iOctyl calls are used. And since they operate on their specific data structures that we saw earlier, that can work. So let's take a very quick look at the tri format. So this is supposed to validate any format and make any corrections as possible. So it's basically doing a bounce check to see what's the maximum and minimum resolution so the driver can support. And if it exceeds it, it's trying to fix it. Any other device constraint, device specific constraint must go here, but because it's relying on the STB library, it's very flexible. So there's not much other than the max format. And that's also artificially introduced for the purposes of not having a very large buffer. Then the enum is basically telling there's only one format supported and it's going to report the one that is supported. The G format returns whatever is there and the S format would essentially call tri-format, fix the format and then set it. So that's what it would do. The most important callback here is device run. This is where the job actually comes. A set of buffers come and it is handed over to the actual driver. So here what's happening is you go through the driver programming sequences as we saw in the datasheet. We reset the scalar. We configure the input width, height, output width, height, stride. We program those registers over here. And for the input and output registers, addresses, right, the DMA addresses for that, we'd get what you say through the VB2 API. And then finally we do the start processing by hitting that start processing bit, which is shown over here. So once the job is complete, like we mentioned earlier, the device would issue an interrupt. So we have an IRQ handler, a credit IRQ handler that we saw earlier. So the first step here is to read the status bits from the statuses and control register and determine if this job has been successful or not. And accordingly the VB2 status variable is being updated. And once this happens, then this is really when we remove the buffer from the ready queues because now it has been processed. So we remove those out and we update the sequence number and we finally return these buffers back to the VB2 layer. And once that is done, we call V4L2M2M job finish. This would remove the context from the job queue and then the next eligible context is ready to run. So let's take a look at the application. So for the test application, I used Lib Camera API. One of the reasons for it is because it has a really good C++ API, not even for MEM2MEM devices. So it could have been done using other libraries, but I think the code looks very neat and clean and concise when doing it with the Lib Camera API. I was also familiar with it at that point and so I decided to just use Lib Camera here. So in this, you see a very simple class. It's called MEM2MEM Scalar and it has two public APIs, one being the constructor. So when you construct the object, you give it an input size and an output size. That's what is required. It only has one format, so all the configuration you do here is the size. And then there is a run API which would do the job and we'll look at the details of that later. Let's look at some of the private data structures. So there is a device enumerator structure here and there is also a media device and an M2M device. So these are all abstractions provided by Lib Camera. And you have a vector of capture buffers, output buffers. You have some integers here that are counting. These are like frame counters. This is just to determine when to stop the application. At some point it has to stop and stop streaming. These are cache size variables which we take from the constructor and set inside to use later on in the run API. And this application is also trying to test that the buffers and their contents are correct by comparing it against the test vector. Hence there's a need to map the memory map those buffers. So there's some data structures used to memory map them. This is what the application looks like in total. So you configure an input size which is 640 by 480. There's output size of 320 by 240. Those are the two sizes. And then you construct this object M2M scalar and give it the input and output. And finally you just run. There's an answer to make sure that the run has been successful. Let's take a look at the constructor itself. So in the constructor we reset the capture frame and output frame counters to zero as shown here. We take the size arguments for input and output. And we store cache them locally into the class variables input size underscore and output size underscore. Then we create a device, device enumerator object. And we start enumerating all the devices. And after it's enumerated, we'd search using this device match. This is again all the camera API. But there's a way to search for it. Because we added media controller support to our driver, so media controller has a discovery API. So that's kind of what is being used here to find your device data. There are other ways to do it maybe through sysfs of v4l2, but this is a better way to do it because it already has a better support for finding your device. So that's kind of what's happening here. That's probably the only reason why this driver is using media controller. It's not using the request API or anything else but mostly just to, for this application to be able to find it. Usually you call into the search API of the enumerator and to which you give the device match object. And if it's found well and good, otherwise we add search. Let's look at the run function. So over here, you can be using a buffer count of one, but you could use up to as many as you want. The driver also has a way to decide how many minimum buffers it requires. I don't think there's a way for it to tell maximum, but yeah, it can specify a minimum number for count. And so over here now we grab whole of the media entity by using the M2MS source. This is a string that you can actually see this if you type the media controller minus P and they give the name of your media device. It'll list this name. So that name you grab and put it over here. So that helps you to find the media entity and the entity has a device node. So you use that to get a pointer to the M2M Scalar. And now M2M Scalar has a capture node and an output node. So you get those two media devices out from that M2M Scalar data structure. And the next thing is to set up the formats, right? So like I said, this is a very simple device. It has only a single format. So what we do is we do a get format on it and then whatever size it returns, we just take that and what do you say, modify it to and assign the cached size variables that we had set up like yeah, club objects that we had cached earlier in the constructor. And once we have done that, we call a set format on the capture side as well as the output side. That's kind of what's happening over here. The next thing is to set up the callbacks. So we need to an in-leap camera that uses something like a QT style, signal slot kind of a thing. So this is basically like a slot that you're installing for that signal. So we give it some capture, there's a receive capture buffer and an output buffer completed. There's those are the two callbacks that you do. Then the next thing is to queue all the buffers. So we'll start off with the capture side and we queue all the buffers. There's an additional thing we do here. We set a buffer cookie. This is so that later on in the callback we'd reference this to find out what is the memory map address of that particular buffer. And it is, I'll come back to this later. And then over here, we map those buffers using again yet another lip camera helper class that helps you to map it and it creates a span object. So those span objects are then put onto its own vector over here. Now we do the same thing with the output buffers as well and make sure that everything is memory mapped as well. Additional thing we do with the output side is we now feed it our test vector. So for test vector, I use the big bug bunny, the open source movie like an image from there. I convert it into an RGB raw file and also changed it into a header using xxd and that array is basically getting copied here. So this is your input, known input test vector. And after doing a memory map, you mem copy the known input into it. And then right after that, you also queue the buffer, right? And after that, there's two things to do. You need to stream on the capture side as well as the output side. Both need to be done. Only after doing this will your application start to work. Other driver will reject if only one of them is turned on. It's eligibility is not met yet. The rest of the application you install like a kind of like a busy loop event loop, I should say actually. So you have some sort of a timer which is evaluating if you get an opportunity to exit. And if nothing, you just call the event loop of lip camera. That would start monitoring all those FDs on both the output and the capture side. And that's how your callback mechanism is working, right? And then finally, when there is a break, there is an exit criteria here. So as soon as we capture four frames, we break out. And then once you've done that, we stream off both the capture as well as the output side. Let's take a look at the callbacks itself. So the output buffer callback is pretty simple. So we just take whatever buffer is coming in and queue it back. There's nothing more to be done here. We already copied in the known input test vector. So this is simply needs to be queued. On the receive capture buffer side, we're doing an additional test here. We're trying to see that whatever scaled output, downscaled output, the driver or that hardware generated doesn't match with our known test vector. So that's a mem compare function here. And there's an assert added to it just to make sure that it all works fine. And then after that, it's that buffer is queued back into the capture queue. So yeah, if anybody is interesting in reproducing this system, all the code is on GitHub. I've used Yocto for building this specific Yocto distro is known as the YOE distros. I have a fork of it, a branch called ELC 2022. So this, if you run through this, it should be able to reproduce these results. And those are the references, kernel v4l2 documentation, scalar data sheet, the scalar driver itself, the test application. And if anybody's interested also how the device was emulated on KMO, there's links to all of it over here. Okay, so we have nine minutes. If any questions from the audiences? Okay, let me repeat the question he's asking, could GStreamer do some of these steps here and instead of v4l2m2m? Yeah, GStreamer, I'm not a big expert on GStreamer, but from the way I understand it, it's a plugin architecture, right? So it might have some software-based plugins which can do scaling, but it can also sit on top of driver framework like this. So that also could be done. So yeah, you could use GStreamer, I believe, yes, it's possible, it's possible, right? I have not tried it though, but yeah, yes, yes, yes. Okay, let's just repeat the question once. He's asking if there's, because this is using a two-to-one scaling ratio, but if you use something like a fraction 1.5, 1.6, would there be any performance application or quality degradations and things like that, right? Yeah, those things are valid questions. It is applicable to a real hardware scaler. Those things might have some implications, but this is an emulated device, so it ultimately calls inside the KEMU, I mean, to the driver and the application, it looks like it's true hardware, but under the hood in KEMU, it's making a call into the SDB image resize library, which is software. So I did not write that, I just passed it from somewhere, but I think it doesn't have those restrictions like the ones you're mentioning. Those are typically applicable to a real hardware scaler. Any other questions? Performance data, I do not actually, because this is not a real hardware, I have emulated it on KEMU, but I have some experience on working with this with real hardware, so the most of the latencies will be from the hardware itself. The V4L2 layer does have a few latencies, but those are very minimal, which is pretty efficient framework, actually. So if you have a video IP, and there are a lot of good features here, like the stability of V4L2, you also have the multi-context support. These are all pretty hard to develop by yourself, so I mean, when I looked at it, I was like, this is really good if you have an IP that kind of meets this criteria. So it's probably a very apt choice. Any other questions? We have five more minutes, if there's any other questions. Or is there anything I should go over once again in the slides, since we have five minutes? More, that's not clear? Or, yeah. Or in the application? Or in the application or the driver? Okay, back, okay. Oh yeah, that's right, that's right. You're right, actually. So this, in a way, it's like a mem-copy function if you didn't change the input and output size because they are the same, right, to start with. You're right, you're right about that. But typical, this is just so that, I mean, a knowledgeable application writer would go and change them, so but you have to provide some defaults just in case, right, so it doesn't crash. So if you were to implement it as is, it's like a giant mem-copy with lots of framework called, that was unnecessary. But it is going to call the stb function and try to do a resize of the same size which is essentially not required, but yeah. No, I do not have a demo, right? That would have been cool though, but yeah. Cool, thanks, thanks everyone. If there are no more questions, I guess we can probably end four minutes early. Thanks everyone.