 Hi everyone, my name is Maxime and I'm going to talk about video cameras in Linux in this talk called from the camera sensor to the user, the journey of a video frame. So as you can probably guess, he's going to talk about video cameras and their support in Linux. Although I'm not going to focus about the software point of view, but rather the hardware itself. Let me just introduce myself, and Maxime, I've been working at bootlin for almost three years now, mainly on networking topics and networking files and so on. So when I had the chance a few months back to work on a project involving a video camera, I took kind of the same approach when trying to understand everything that is going on, that was trying to first understand how the hardware works and then see how it is supported in software. So the goals of this talk would be to discover the various hardware components that are involved in a video camera, understand how everything changes together, the values configuration that you can find, and see also some real life designs that you can find as examples. So first let's talk about the acquisition hardware by itself. So to acquire an image, what you will need is what we call a sensor. So the sensor is going to acquire the image, transform an optical signal into an electrical signal. Before this sensor, you have a lot of optical elements that are here to focus the incoming light, filter it and alter it in some ways. And after the sensor, you will have some components that are dedicated to transcoding the signal from what the sensor can output into something your system on the chip can understand. So sometimes you don't even need this signal transcoding layer, but you have to know that in some cases you will have to have some intermediary component. So we will basically discuss about each and every one of these components and also how everything interacts with the Linux kernel because we are at ELCE. So let's get back to the basics. What is a sensor and how do we capture an image? So the basic image acquisition setup that I present here is what you are going to find in almost all smartphones, camera, for example. And also the USB camera I am recording this talk with is basically using the same kind of setup. So to acquire an image, you need three things. The first thing that you need is an optical signal. So basically incoming light. Incoming light means that you have to be filming something that is in a bright environment, so in some cases you will need some way to enlighten the scene that you are filming. On smartphones you will find almost all of the times a powerful LED that is very bright and that serves this purpose. Next you will need some optical elements. So this is here represented by the lens and the voice coil. So this is a way to deal with the focusing of the incoming light beams. And then you have the sensor itself which is doing the optical to electrical conversion. So first let's talk about the lens itself. So the lens is going to control the focus of the incoming light. So it is the first thing that the incoming light is going to encounter when it goes into your sensor. Adjusting the lens position is important because it will adjust the part of the image that you are going to find that are blurry or very sharp and depending on the distance of the object you are filming you are going to want to change the focus of your sensor and therefore physically move the lens closer or further away to your sensor. Most of the time in compact sensors this lens moving is done through a voice coil actuator. So this kind of actuator is very similar in principle to what you find in an audio speaker for example. The basic idea is that you have a copper wire coil that is attached to the lens and this coil sits inside a static magnetic field that is generated by permanent magnets. When you pass a current through this coil you are actually going to move the lens inside the magnetic field. So this is how we actuate the lenses inside our systems. So what you have to do to control the position is actually control the intensity of the current that you are going to pass through the coil and this is done either through dedicated chips or using just a simple digital to analog converter with the proper analog circuitry. And when you are going to implement drivers for that you have the proper support with the media controller API and all of the existing lens coil drivers that are out there and supported in Linux are driven through I squared C. So this makes it pretty simple to integrate for the system. Next you have the flashlight. So it's simply a high power LED. So either you are going to control it through a LED driver or just through a GPIO but you can also find dedicated chips to drive these kind of flashlights. And these chips will basically have incoming signals such as a strobe signals to have a very precise timing of when you turn on and turn off the light. And also sometimes this is used to control what we call the privacy indicator which is the tiny LEDs that sits next to your sensor and that is on when you are filming and off when the acquisition is not ongoing. So this is very common on laptops for example so that you know that it is currently filming you. Once again there are drivers for that supported in Linux controlled through I squared C and you have also the proper support in the media controller API and video for Linux. And next we're going to talk about the biggest part which is the sensor by itself. So the goal of the sensor is to convert an incoming optical signal into an electrical signal. To be able to get a full image you are going to divide the real scene into a small element of your picture that are called pixels. And the sensor is going to do this subdivision for you. For digital sensors there are two widely used technologies which are CCD and CMOS. Although CMOS is the most widely used one as of today. So the basic principle is that for each pixel on your sensor the pixel is going to detect how much light is going into it at a given amount of time. And then this is converted into a voltage. This voltage is analog and it is read by internal ADCs. But because it is very, very tiny you first have to go through an amplifying stage. This amplifier has to be controllable from the user because this will allow you to be able to shoot some scenes in very bright environment. So you set your amplifier to a very low gain or in a very dark environment. And then you will set your amplifier to a very high gain but in that case you will also be amplifying the noise from your sensor array. After that some very advanced sensors have a built-in e-max signal processor but not all of them. We will see what this is for just after that. And then you have some internal queuing mechanisms and then the data output is done through a dedicated interface. So on most sensors you have a control plane mostly through I squared C and then data interface. We're going to see which technologies we can use but for example you can use the CSI interface. So modern sensors are used to acquire color images. So our human eyes can see three colors the red, the green and the blue. And we're going to put filters onto the sensors so that each pixel is going to measure the amount of incoming lights of a particular color. So this is an example of arrangement of the color grid on the sensor. As you can see the first line is going to acquire blue and green, blue and green pixels and the next line is going to acquire green and red, green and red and so on. So this data is not usable as is. As you can see there are twice as much green as there are other colors. So this needs to be compensated for. We usually find that the sensors acquire more green than other colors because our eyes are actually more sensible to green. The action of converting this row data coming from the sensor into a free component vector called the pixel is called debiring or sometimes demossizing. This can be done right in the sensor when the sensor embeds an image in a processor or the data can just be transmitted as is to the CPU. In that case, we talk about row data. And then you also need to know in which order your sensor is going to send the data to the CPU. It can be red, green, green, blue or red, green, green and so on. There are lots and lots of possible orders in which the colors can be sent. So yeah, transforming this raw image into a usable image is possibly a costly operation involving some heavy algorithms. So let's talk about this raw interface. So the basic way to transfer data from a sensor up to a CPU is using a raw parallel interface. So it's pretty simple. You have some data lines that are representing the precision of your analog to digital conversion that you have embedded in your sensor. So if your ADC has a resolution of eight bits, you will probably find eight bits of parallel data lines on your sensor on your raw interface and it can go up to 12 bits. This data is synchronized with the pixel clock which will tick each time there are new data sent on the data lines. And then you have some synchronization signals. So one is the h-sync for always-at-all sync. So this one indicates when the sensor is done transmitting a full line of pixels. And then you have the v-sync, the vertical synchronization signal which will toggle each time you transmitted a full frame. So with this information, the CPU on the other side is about to reconstruct the image just from the synchronization signals and the data. Another widely used interface is the compact camera port too. So the CCP2, it is a serialized interface. So contrary to parallel interface, data is going to be sent one bit after the other. In that case, you can notice that there are no h-sync and v-sync. Actually, this synchronization information is embedded into the data. So what you have is just a basic clock lane. So using differential pairs and data lines. So only one of it. So just with four lines, physical lines, you can connect your sensor to your CPU. So this is very useful, but it is pretty limited in bandwidth. It can only go up to 650 megabits per second. So next there is a very widely used interface. It's actually a standard, which is the CSI standard from the MyPyIde Alliance. It is very familiar to people who are used to work with networks because it is divided into several layers. A five layer dealing with the physical transmission of the data over electrical signals. Then you have a lane management layer. So we will see what this is used for. It's mainly about splitting data in multiple lanes to transmit your information. Then you have a low-level protocol layer which deal with checksamming and error correction, and then the application layer. So this one is mostly dealing with the pixel to byte conversion. So how do you represent the pixel on the wire in your CSI interface? The version two of this standard is the most widely used, although there exists a CSI version three. An interesting thing about this standard is that the file layer is actually shared with another standard that is widely used, which is the DSI standard, which deal with displays. So the file layer is exactly the same for DSI and CSI. So let's talk about this file layer. There are several variants of it. One that is found in most cases is the D file layer. So as you can see, it is pretty similar to CCP2. It's also a serialized interface. The synchronization signals are also embedded into the data. The main difference is that you can support multiple data lanes to transmit data in parallel. So you can have up to four lanes and go up to six gigabits per second. So this is starting to be a very fast interface. And then the other widely, not widely used, but the other file layer that you can use with CSI is the C file layer. So this one is much more complex. I'm going to talk a bit into more details about that because what we saw as of right now are pretty familiar stuff. We are used to see clock lanes and differential pairs to transmit a high bandwidth data coming from the networking stuff. This is very familiar. However, the C file layer works very differently. So you also have the synchronization data included inside the data, but the clock is also embedded into the data. So this is also something that is familiar. The clock is reconstructed on the other side of the link. However, instead of using a differential pair, we are using a trial of wires. So the data is conveyed on three wires working all together. And instead of using just a binary to transmit the data, so with high and low signals, on each line, you are going to have three level signals. You have high, medium, and low electrical voltages. And you have some constraints. So for example, two lines cannot have the same level at a given time. So if one line is in high, the next one must be in medium and the third one must be in low. And so therefore you have only six combination of signals that you can have with these three lines. Moreover, in order to be able to reconstruct the clock on the other side, each time you send data, so a symbol, you must not send twice the same symbol over the lane, so that at each clock tick, all of your signal level changes. So this limits us to only five possible symbols that can be sent over this wire. So this is why this is an interface working in quinary system, so base five. In order to convert the incoming binary data into these quinary symbols, they use a 16 to seven ratio of conversion. So 16 bits of data are transmitted using only seven symbols in quinary. So this allows us to drastically reduce the clock rate on the lanes. And you can use at most three lanes in parallel, going therefore up to 41 gigabits per second. So this is a very fast physical layer, but it's not as straightforward to understand and implement compared to the other ones. This is why it's not as prevalent as the DeFi layer. Let's talk a bit about analog then. So what we saw as of right now was digital transmission protocols, okay? So we were assuming that we were dealing with a digital sensor, but analog video representation is still something that is going on. So it has a very, very long history and a long legacy stuff going on. Starting almost a hundred years from now, that was the first video broadcast. So by then they have some interesting constraints. For example, one thing that they have to deal with is that some people had black and white televisions in their houses and other people's had color televisions. How do we send a video signal in analog that is going to be compatible with both black and white televisions and with color TVs? You cannot just send an RGB signal because the black and white televisions would, as I display, only the red component or the green component or the blue component. And this will give a very distorted view of the original image. Instead, what they did is that they shifted the color space. So there were three main standards at this time, PAL, NTSC and SCAM. So PAL was mostly used in Europe, NTSC in USA and Japan, and SCAM was mostly used in France, Eastern Europe and the former USSR. So the main idea of shifting this color space is to have one component representing the black and white signal, which is called the luminance or luma signal, and then two other components carrying the information about the colors. So these are called the chrominance signals or chroma. So by doing so, you have two main advantages. The first one is that if you transmit your luminance signal over the frequency that the black and white televisions were designed for, they are still going to display your black and white image correctly. If you then also transmit the chrominance information on other carrier frequency, you can also design color TVs, which will be compatible with the same signals. The other main advantage of that is that actually the human eyes are much more sensitive to black and white information than to color information. So when the brain interprets things that it is seeing, it is mostly basing its analysis on the black and white luminance information. This means that we can compress the chrominance information with BLOSS and therefore transmit data with a higher throughput. So this is still used as of today in some cases. These tendons were designed to represent how we transmit video signals over analog. Most of the time, what the TV did is that they synchronized their clocks to the power grid in the country. So for PAL, which means in Europe, we have 50 hertz electrical grid. And therefore, most of the standards using PAL to transmit video run at either 50 frames per second or 25 frames per second. In NTSC, so in USA, the power grid is at 60 hertz. That's why NTSC mostly deals with 30 frames per second or 60 frames per second signals. In order to deal with this, 25 frames per second is pretty low and the human eye can start to notice that the image is not very smooth. So in order to increase the perceived frame rate, what they invented is the process of interlacing. So this means that when the video was acquired by the camera at the time, what they did is for one frame, they only acquired the odd lines and for the other frame, they acquired the even lines. So you have pretty much... Each frame is only representing half of the image but with a space between each line. And then you would transmit this information. You have only to transmit half as much information as if you were transmitting a full frame. So this half information is called a field and you will transmit these fields of 50 fields per second. When displaying that on the CRT televisions, the CRT actually displays the image in exactly the same process. So it would display the image every other line and switching back and forth between odd and even lines at each frame. So this was very smooth to watch on a CRT television. When you want to display that on a more modern digital screen, you don't have the same displaying mechanisms in place and therefore you have to display the full frame at any given time. So this means that you have to wait for the two fields to come in, join them and display them. But when you do so, you start to notice some very bad artifacts as shown in this picture. The object that you are filming is moving very fast horizontally. We will have some interlacing artifacts displaying some double images with some weird shadowing effects. So removing these artifacts is a process called the interlacing and the interlacing is not easy to do at all. So if you want to have the interlacing component inside your video pipeline and you want to display a proper image that is coming from an analog source, you will probably need either an advanced image signal processor or use your CPU to do the conversion. So there exist some transcorders that are used to convert an analog signal into a digital signal. So I had to work with one of those. So that's why I'm talking about this. This is also supported with the video and media control frameworks. So the goal is to convert analog signals into digital signals. Most of the time these ecoders will support all of the existing standards and they can embed a small image signal processor to do some very basic processing on the image such as cropping and scaling but also do some de-interlacing for the advanced one. There exists a standard to convert and convert analog data into a digital interface. So these are the BT656 and BP1120 standards. So they indicate how do we represent the transcoded data into most of the time a parallel interface. So in that case what you are going to transmit about the interface is most of the time images using the YU and V color space. And then you have the host interface. So this is what you are going to find inside your system on a chip. So what I show you here is a very basic interface. You have to know that every vendor has its own way of doing stuff. Sometimes including image processors, sometimes not, sometimes with a firmware running inside of it and so on. So basically your host interface is a collection of hardware blocks inside your SoC. You have a block that is dealing with the file layer so decoding the incoming signal. And then you have some image processing that is going on. So possibly scaling or cropping, de-interlacing, changing the format of the pixels and so on. And then the camera interface is going to store this into a buffer into memory using DMA. So you have full support for all of these blocks with the video file index framework. Inside the ISP, so the image signal processor, you can do a lot of various things. So sometimes it is a very, very small block that is embedded inside your camera interface that just deals with cropping, for example. So cropping is the action of removing areas from an image. So getting rid of information that you don't really care about to resize your image. So scaling, on the other hand, is changing the dimension of your image without losing information or at least with losing at least information as possible. You still want the full frame to be displayed but just in a larger or smaller format. So this requires very complex algorithms to do so. So the scaling operation is an advanced feature. Then you have de-interlacing. So recompose an interest stream. Adjust the action of joining the fields back together. It's very easy to do. However, removing the artifacts requires heavy image processing. Then you have actions such as changing the pixel format. So debiring, for example. So converting raw data from a sensor into pixels or changing the color space. And then you have a collection of algorithms which are called the 3A algorithms. So auto exposure, auto focus and lens. So auto exposure means adjusting the brightness of an image. So in my case, here I think I'm a bit too white. So my sensors gain is a bit too high. So the process of adjusting that requires a feedback with the sensor. So you will basically measure how bright your image is and then lower or higher your sensors gain. By software, of course. But some image single processors can help you observe that for you. The auto focus process is the action of moving the lens back and forth making sure the subject stays in focus. So this is also pretty complex to do. And you also have to feedback the information up to your lens driver. The auto white balance on the other hand is purely the action on the software side or can be a flooded but you don't need to feedback to the sensor. It's the action of making sure that the whites in your image are really whites. So in some cases, this is pretty common to see when people are filming inside, the image tends to be a bit orange and when they are filming outside the image tends to be a bit blue and you can adjust so that the whites are really white and not orange or blue and you also have to change all of the other colors so that this matches this correction. So there are powerful algorithms to do so and then this can also be flooded in image single processors. So there was a very interesting talk earlier today by Helen who is talking about upstreaming a driver for ISP processors on Roch platform I think. So let me give you a few examples of devices that are supported in Linux. One that I found very interesting is the Nokia N900 so this is a very old platform and it dates back from 2009 I think and what is interesting is that it has a full Linux support and the sensor is completely controlled through Linux so you have Flash LED driver you have Lens voice coil driver and the sensor driver everything is controlled by the CPU so if you want to write a sensor that is fully supported in Linux this is a good example to look at. So you can find the device tree inside the Linux kernel. Another example is a project that I've been working on so I can only give you a few details about that but it's based on a system authorship and one of the particularities is that we had an analog video source that can be PAL or NTSC and we don't really know when we start streaming so we have to perform some automatic detection inside the decoder. The decoder is going to convert this analog data into digital data and it is conveyed about on a parallel bus using PT656 it is using the VIP camera interface from Rockchip so the streaming is in progress and the same as the decoder the streaming is in progress too. In our cases we don't do image processing to do the real de-interlacing we are simply joining the fields together and therefore we have to accept that we have artifacts inside our final video. As for the support in Linux all the components that I've talked about are supported by Linux so you have support for lenses, flashlights, sensors, decoders, camera interfaces through the video file Linux and the media controller APIs so the community is very welcoming. An interesting project to follow is the Leap Camera project there is a very interesting talk about that who explains how to deal with very complex cameras like that that have complex sensors and complex camera interfaces inside the SoC the pipeline is very contributed and how do we deal with that in user space so this is very interesting and I encourage you to look at that so thank you for listening in conclusion technologies can be a bit overwhelming when you are starting to discover the video world but it's basically the same thing for networking world so I was not very surprised by that but it's also interesting to see that the old analog technologies and terminologies still apply today and the legacy of the 100 years of video acquisition and transmission can still be useful today the ideas are very interesting the Linux support is very good although there are lots of different hardware the support is good the community is very friendly and using video for Linux and the media controller API you can have very very complex use cases implemented in Linux so thank you for listening and I will now try to answer any questions you might have thank you