 Hi, everyone. Thanks for attending this talk. We're going to talk a bit today about how to, well, the work we did on the chip to support the camera interface test. And more generally, how do V4L works for camera interfaces in general, like applied to the chip. So I'm Maxim. I'm an embedded Linux engineer. I've been for five years now, a bit more than five years, at three electrons. And part of my work involves working on various embedded components, so Linux, obviously, but also build root, you boot, barebox, and so on. And so I'm also the commentator of the all-winner SOCs. So the chip using one of the all-winner SOCs, I've been working on it. Yeah. And I live in France. So maybe you've not heard about the chip, which is basically an SBC, a single-bolt computer that is sold for $9 that has been introduced last year through a Kickstarter campaign. And it's based on the all-winner SOC, which is equivalent to the all-winner SOC, which is called the A13, which was out, like, four years ago or something like that. If, in terms of house power, it's roughly equivalent to a big old black, roughly. Pretty much the same generation of CPU, GPU, and so on. So it has a 1 GHz Cortex-A8 CPU, a Mali GPU, and a lot of GPIOs to do whatever you want with it. Wi-Fi as well. And it's running some kind of a mainline kernel. So far we have a branch that is based on 4.4. We plan on doing a 4.9 branch very soon because we basically upstream most of it. So it's going to be quite easy anyway. And so as part of supporting the chip, so it started a year ago and we started working on the mainline support for the all-winner SOCs, like, four years ago. It was in, yeah, ELC, Barcelona, actually. So, yeah, 2012. So there was already a significant part of the work that was done. And found in the mainline kernel. But since we were all obese at the time, basically everything that was quite hard to do was left out. So the NAND support in particular, the display, GPU support was, like, unexistent. And we had all the other nice things to have when you want to have some generic purpose SBCs, like audio, camera, GPU support that were also completely left out because, well, it was a significant effort and no one really wants to tackle that on their weekend and, like, evenings. And so obviously we had also to tackle board specific developments. So, for example, we have a very unusual setup when it comes to regulators for the Wi-Fi chip. And we also have some kind of additions that you can plug into the chip that uses some kind of an auto-discovery mechanism that is called, well, the boards themselves are called DIP, but, yeah, in general, the DIP support. So we had to, like, use that in the kernel as well. But, yeah. And so part of that work was actually to support the camera. There's another talk from my colleague Boris this afternoon about the name support, if you're interested. It's actually very interesting. So, yeah. So part of the work was to work on the camera. And, well, camera support in Linux has to go through the V4L framework. And that's what we are going to talk about today. So, this video capture in Linux is handled by a framework that is called Vio for Linux 2. Most of the time just referred to as V4L 2, which is quite old, or at least it existed for a very long time. So it was first introduced in 2002. And it supports a wide range of devices from video capture, like camera, that we're going to see, to, like, hardware decoders and encoders, software-defined radio as well, radio receivers, and so on. So it's actually very ubiquitous. It can support scalers as well, but basically everything that's supposed to work on video in Linux is with V4L. And so, in the cases of cameras, a very simple setup, and it's kind of the one that we actually have on the chip, on the SOC that is used on the chip at least, is that basically you have two components, one that you can use. One being the camera sensor itself. So these are the actual components that capture frames and send them to some kind of bus. And then on the other side, on the SOC side, you actually have a controller that basically reads that bus and transfers data to a memory buffer. And that's it. So obviously, there's a bit more to it. So for example, you will need to tune some camera parameters of the time to be able to, I don't know, change contrast, brightness, that kind of stuff, focus maybe as well. And so you have usually, in addition to that bus, an I2C bus to be able to control whatever the camera is sending on that bus basically. Yeah. So when it comes to cameras, the very first thing that you have to like negotiate some kind of format, so are you going to use a plain RGB format or some YUV variant or something like that. And obviously, both sides have to agree about the controller and the camera. If there's a mismatch, the base case scenario, the image will look a bit weird. Worst case scenario, you will not see anything. And so your sensor doesn't work at all. And so there's a very wide range of video formats. And you even have some weird variations of them from someone that basically never did video before actually starting this project. This looks like a giant mess. So you have different, for example, just speaking of YUV, YUV can have like the, so you basically have three fields, but the order in, so these three fields will have different bit width basically, which you can expect. But the order of the fields can actually change as well. Or be put in different buffers, or be put in the same buffers, but at a different offset. And so it's like completely weird. It was completely weird to me at first, and I'm still confused about that. And so obviously, most of the time, the set of formats that the controller and the sensor support are not exactly matching. So you have to negotiate at some point which format you can basically do on your sensor and controller pipeline. And so it's usually done drivers, at least for this particular setup. And so the first thing you will need is actually to negotiate and try to see what format the sensor support, what format you support, and basically give the list to the user space and let the user space decide what you're trying to use. After the formats negotiation and, well, obviously at some point the user space will also have to set the formats and, well, you will actually use it in your capture stream. You also have to implement some streaming hooks which are basically just starting, stopping the capture. But also deals with memory allocations, so allocating the buffers, queuing them, dequeuing them once the capture has been occurring, and so on. And so together with the formats, it's basically the only really needed operations to get started. Because then you can actually set format and then queue buffer into your controller that will then fill it with an image coming from the camera, like dequeuing it once the capture has been occurring. The nice thing about streaming modes is that you actually have different backend if you want for that streaming mode. So you can have different source of buffers that are going to be used differently. So it's completely like transparent to your driver, basically. So you basically have three sources. The first one is the user that provided a buffer. So the buffer has been allocated in user space as a buffer gives it to your driver for you to fill it. That's not very widely used, to be honest, because that means that your controller is able to do DMA, scatter, gather, to be able to deal with non-physically contiguous buffers which are usually allocated in user space. It might come from another device in the system, and it's shared with a sensor, with a controller, sorry, with a mechanism called DMA buff, which is basically a mechanism to share buffers across different devices. So the allocation will not actually occur in your, neither in your controller driver, or even before I let all, it's just coming from an external source. Maybe for example, DRM, if you just want to display the captured image directly, and put it directly in your display pipeline without any copy or something like that. Or it can be allocated by the driver itself directly, which is probably the most, at least, intuitive way of operating, at least in, it basically just works like any other drivers usually found in Linux. And so basically for all of these streaming operations, because it's actually quite hard to do right, you have a generic implementation that is called video buff too, which is actually very nice, and it will do most of the work and still rely on very few simple callbacks to implement in your driver. And so you have the ability in your driver to choose different allocation methods, depending on your actual hardware and what it can do. So you can choose between backed by the virtual memory allocator. I'm not sure, it's very useful for devices that do DMA, unless they can do scattergazer as well. But then you have scattergazer DMA, or you can actually ask for contiguous, physically contiguous DMA buffers if your device is not able to do any kind of scattergazer, but just fill contiguous array memory. And you have the streaming modes, notions as well. So it's actually very flexible, and the only thing you need is in the callbacks to basically tell it what size and the number of buffers you need to allocate for each frame, for example. So for example, for a given format at a given resolution, what is the size of my buffer? Do I have some extra padding? Do I have some to be able to control it? And how then to insert new buffers into your queue and basically just start and stop the capture. So it's very convenient. For the actual devices, you actually might need more than just being able to tell which format you want to use and let it capture some videos. Most of the time you will have some extra controls in either your driver or the sensor itself that you will need to at least expose the user space for things like brightness, flight balance, saturation, things like that. And by default, there's basically no controls implemented. So the driver needs to declare them during probe. So there's a lot of them that are standouts, but you still need to declare that you actually support those standout controls. And then you have a dedicated callback that you will basically will be called when someone will set that control to a new value. So it's actually very, in that particular setup, it was kind of confusing to me at least as well because you will also expose the controls from the camera sometimes to be, because there's basically just a single video node, a single video device in your, in slash dev. So you will still need to be able to tune this. And so you basically have to forward the controls. Yeah. And so if you look at it from, if you look at the same pipeline that we were seeing, you will basically have from a driver point of view, you will have different drivers for the controller and the camera that are probed at different time and are exposed to the user space through a single device which is dev video and then a number. And so this, the two drivers are completely independent of each other. They're not linked in any way, but through the V4L framework. You will actually need to have some kind of synchronization point where you know that you have your controller driver and your camera drivers that have been loaded that are ready to operate and are ready to link the two together. And it's done through a framework which is called V4L to async which is very similar to other, similar to other frameworks in Linux like you have the one in ASOCK that does pretty much the same thing that allows you to tie what they call in ASOCK at DAI. So basically the thing that will stream music to an external codec and the also controls where it's basically have the same, the same needs and DRM has some kind of generic implementation that is used only by DRM I think which is called the component framework which is kind of, kind of doing the same thing. And it's, so basically it's a two-stage probe and in the actual driver probe that you usually have, you will basically just register in V4L to async and wait for some, for your camera or controller driver to show up and then do the actual setup of your hardware and so on in that second stage probe that is called only one, the two are linked together. It's also where you will do the formats, animations and so on. Yeah. And so if you, if you really look at it from what is actually happening in the system when you capture something, you will, it will basically come in a few steps. The first one is you will basically set the formats and the controls to be able to like prepare your, your capture and set it up and exactly how you want it to be. Then the video buff framework or generic implementation will actually allocate the buffers, give them back to the user space and then the user space will start queuing buffer so that your capture can actually start. You will, so start the capture with already a bunch of buffers that are queued so that you can, you will not have any gaps or underruns or anything. And then you will, maybe not on all hardware but almost I think, each time a new frame will have been captured, you basically have an interrupt. So you will just have to handle in the interrupt the buffer flip. So you will take a new buffer from your queue, place it as an actual buffer to be used for the next frame. And they queue, they queue the one that was done before and give it back to user space. So the whole dequeuing thing is once again something that has been implemented in a generic way because it's actually difficult to do, right? Yeah. And so if at some point you want to stop streaming, you can. But otherwise your capture will just go on and loop between the interrupts, buffer flipping and so on. Yeah. So that was for very simple drivers but you also might have some more, for example, complicated formats like YUV. For example, might have, might store the three Y, U and V components in one or two separate buffers again. So you basically have to put there like from one to three buffers depending on the actual format you want to use to queue them, dequeue them as well. And so for, it's supported in V4L through a different capture type. So you basically say that you can do, that you have a different capability than if you just support a single, a single plane format. The callbacks to implement in that case are different but not so much. The conversion from a driver that supports only formats with a single plane to a driver that supports formats with multiple planes is very, well, almost trivial. It's not very difficult. It's basically, you basically have the same arguments but instead of taking a single structure you will take a list of them most of the time. So it's just iterating through the arguments instead of just using them. It's not much more difficult. And so we should probably have started with that format that actually require install various components of your image into different buffers called multi-plane. And so you have once again weird variations for these multi-planes. For example with YUV you could have YUV packed into a single buffer or Y in a buffer and U and V in a second one or Y, U and V in different buffers. So it's really like, it's really up to you and to the algorithm actually support to know what to implement. And you might also have a bit more complicated setup. For example if you actually have some kind of a like controller or image processing engine or something like that, that is for example optional that you can add or not into your pipeline that will have its own set of controls. And so in this case you need to use another API which is called the Media Controller API that will basically expose all those devices, sensor included as different device files in your system. You will actually be able to also enumerate basically your pipeline and the topology of your pipeline as well through that Media Controller API. And you will basically have a different tool to manipulate it as well which is called Media CTL. And it might even simplify your driver because in some cases, well for example the format negotiation will not be done anymore because it's up to the user to actually set up on all the different nodes a format that is actually consistent and can be used. But you also will gain a few things for example if you have the same controller of different devices in your pipeline. For example you might change the brightness on the two sides. So in this case it's actually easier to deal with because you basically have two different entities with the same set of controls and you can tune them. That's it. Yeah. And you can also use the same tool to manipulate it. I think I went way too easy, too fast anyway. So yeah. And you have a bunch of tests that are actually quite interesting. The first one is an application that is called V4L2 compliance which is coming from basically outside of the V4L2 world is something that is awesome. It's basically a tool that you run in user space and will tell you if you are actually implementing V4L2 write or not. So it will query the formats, try to do various stuff and at the end you know if you did things right, at least from the user space point of view, or not. And it's like the best tool ever when you are actually developing a driver. I wish other actual framework were having that because it's awesome. You also have the other bunch of V4L2 tools, suffocating for example, V4L info can actually give you some information which can be used for debugging when you are actually writing your driver. There's an application called Jafta which is also very nice because it allows you to grab some frames and put them into a file instead of displaying them. So if you actually want to do some headless development just for example, because you have a development board sitting on your desk that is not connected to any display or anything, or if you actually want to inspect your frames or make sure that they are all the same, for example, using a CRC or whatever, it's actually very convenient. And then the final application is basically all the V4L enabled application out there. So you have a bunch of them. I used cheese for example, because I'm using GNOME and GTK and some, but there's probably way, way many others. So this is actually only part of basically grabbing a frame from a camera and putting it into memory, but it's not doing anything helpful behind that. An application using V4L might be able to display it using the X server, for example, or whatever tool it uses, but you might have something smarter to do. And so it's probably what we are going to do at some point. The first one is integrated into DRM. So in our case, I don't know if it's a case for most assistives, but I can figure, I guess. Basically the camera and display and joints can actually work in the same format. So you can actually allocate a plane that will be directly rendered on your video output that is directly at the output format of the camera, which is actually very nice because then you can just do, basically start capturing frames from the camera and then putting it directly into DRM without any CPU intervention, without any copy, without any format change, without actually having to do any kind of composition. It's all done in the hardware and through buffer sharing. So it's actually, it would be very, very nice to support. So the display and joint is even able to rescale the video or rotate it or stuff like that directly in hardware. So it's actually something that we want to support at some point. And I'm not sure how it could be done in user space. Probably something like gstreamer or maybe there's some kind of setup file like has a has for cards somewhere in user space. I don't know, maybe you will tell me. I hope so. So yeah, I'm not sure X is able to work with planes as well so that might be challenging. But maybe see you next year to talk about that. And the next one is that we also had a different V4L project going on this summer which was to support the hardware decoder that is using these SOCs. And this hardware decoder is actually also able to encode. So it's probably going to be even longer shot because the whole VPU is closed entirely. There's no documentation for it. So it's all based on reverse engineering. And it doesn't work for any formatting and coding I think. Now at least entirely. There's some partial reverse engineering happening but it's not complete yet. But decoding definitely works for now. For some codecs and most of the image formats. And it would be actually very nice to actually to be able to encode the frames into for example H.264 directly through the VPU without doing any kind of copies as well. And just like giving it to the VPU directly and just grabbing the compressed frames. So yeah. So I went way too fast. So I went way too fast. So if you have any questions we have a lot of time. And I have microphones as well. No one. I was fast. And you don't have any questions. Wow. Yeah. So what is the overhead for the CPU for the whole frame grabbing and buffer management? Is there any significant overhead or is it just below one percent of CPU usage? For the buffer management? For video capturing say. So the whole video streaming part to take it from the hardware and put it to the user space. We actually were not that far yet. I'm still grabbing frames directly into file. So I'm not exactly sure how the display overhead would be. And I've not really benchmarked it yet. I'm guessing from the actual code the CPU doesn't intervene much. But basically handling interrupts and flipping the buffers and passing the buffers to user space but it doesn't do any kind of format conversion, compositing or anything like that as far as capturing is involved. It's actually when you want to display it that you will need to try to be smart. And so that's why we want to go through basically the display and trying to do that because the CPU doesn't intervene as well as well. It's basically all done in hardware. But yeah, I'm guessing for the capturing part it's not that bad. We don't have a lot of interrupts because it's basically just a video so you probably will have something like a few dozens of frames per second and that's it. So a few dozens of interrupts as well. Yeah. Maybe, well, the same question I would have for the latency. So between the frame is released by the sensor hardware and is available in the user space. Do we have any feeling or data for that? I don't but I don't think it would actually matter that much. There's no buffering taking place between the sensor and the DMA. So basically what the sensor outputs goes almost directly into the memory. So there is no latency. So you have to wait until the frame is complete. So that is your latency basically. I missed the first few minutes of the presentation so maybe I already said that. What's the approximate performance you could get out of this hardware just for streaming to memory? Can you do HD at that? Not with this generation of SoC. It's actually limited to, so I'm not sure which variant VGA it is but it's 640 by 320 I think. 480. So it's very, very low actually. The newer generations of ordinary SoC can do much more than that. But yeah, this one is not very good with the trigger. And for snapshots from the sensor that you can just, for example, a 5 megapixel snapshot every few seconds that should work? Yeah. Okay. It should and if it doesn't it's a bug and you should report it to me. Okay. You mentioned in the future developments things about scaling and how that could be handled by user space. And while you had open questions there, could you elaborate a bit more on the problems that you see? Well, I'm not exactly sure which user space component would be like making the link between the frames that are grabbed by the 4L and put into a DRM plane basically. With a right like scaling ratio and so on but that's different, sorry. Okay. So your scale is displaying the video that's captured by camera directly to DRM device. Yeah. Directly to DRM plane. Okay. So in that case, well, you need to use space application that handles that. There's multiple options but you need a dedicated application. You can't expect your display server usually to go directly to the camera device. So G-streamer is definitely one option there. Okay. It could be custom applications as well. But as long as you share buffers with DRM for proper performances, then it's just a setup of the capture pipeline indeed with the scaler. Okay. So a neat camera application should, if it's DRM based, work in that context. Okay. Nice. So you said that the controller and the actual sensor are kind of different. So you have really independent drivers. So that means that if you have a sensor that works on one SOC, it will work on another controller, so on another SOC with a different controller? From a theoretical point of view, yeah, definitely. But most of the time, like the actual media controller API I was mentioning is not supported by all sensors, for example. So if your DT support is one of the also kind of not supported by every sensor driver, so it before allows you to do that, but the state of each individual driver might not. And so in this case, the proper fix is just fixing the sensor driver and that's it. Just to follow up to the question, I just wanted to know what a use case for the media controller API, a general use case, I mean, to understand better. There's a few of them. The first one is, so it's not obvious from these slides, but in other like camera interfaces, I'm guessing your map 3-1 is a good example of that. You actually have not just one controller sitting aside from it, but you actually have a very complex pipeline with your video capturing a stream can actually take multiple passes and you actually have to set it up somewhere. And so in this case, since you basically have something like 20 or 25 components, you actually want to be able to select the boxes controlled for each of the components in the path. And in this case, the media API is actually very, very interesting. I'm guessing it's also convenient to be able to enumerate devices as well, because otherwise you don't have any way to actually enumerate the pipeline and so on. So basically kind of an IPU processing pipeline? If by IPU you mean image processing unit, it might be somewhere in that pipeline. So the basic idea behind the media controller is to expose more of the device internals to use space. We started with a video following API that had the model of the device that was quite simple. And then over time realized that for really complex devices when you have a complex processing pipeline, it wasn't possible to just use that abstraction and in the kernel drive handle the hardware in a meaningful way. So if you have multiple processing blocks that can all do the same kind of operations, it's use case dependent whether you use one or the other. So we didn't want to have that kind of policy in kernel space. That's what drove the development of the media control API to be able to expose the complexity to use a space and let you use a space controller device and with a finer grain control. That also means that you need to use a space a more complex piece of code that knows more about your device to be able to use it in a meaningful way. I guess that's it. Thanks for attending. If you have any questions, I'll be somewhere in the conference for the two next days.