 We are from Pengatronics, and we want to show you or tell you a bit about what's being next with the DMABuff API in the kernel, what we can do, and as already said in the abstract, we'll go through this with a really simple video pipeline on the example of a GStreamer pipeline, then adding some hardware units in it to see what happens then, and yeah, the DMABuff stuff, why we do it, how we do it, and what's the problem with that right now, and what we can do about it. We're using GStreamer as a really flexible framework for all kinds of multimedia stuff. You can just black those elements together, form a pipeline, may it be software, may it be you have some hardware elements in there, you can just as well use them from GStreamer and there's a lot of infrastructure already there to make it really easy to build those pipelines. So let's look at some simple software pipeline. This is almost the simplest pipeline you could imagine to play some video. On the one side we have the UVC camera, which is just a USB camera, on the other side the DRM device to get something to a screen, or yeah, and in the middle we just have a software scalar which can also be a new element when it's just copying stuff around, so if you don't need any scaling in your pipeline. How does it work? On both sides we have the concept of buffers. Buffers are just some logical elements where you can put in your video data or scan out your video data from, and they are always backed by some memory. So may it be system memory or some on-chip memory, whatever. In this simple pipeline we will just use this memory and M-map it into the user space. So this is all kernel space controlled, this is the user space, GStreamer pipeline element, and we just M-map the storage of the buffer and yeah, in this simple pipeline we will just copy stuff from one M-map to another to get it on screen. So let's see what happens if you add a hardware element in this pipeline. It just looks a bit more complicated. Yeah, we replace the software scalar with a hardware scalar, which is found on a lot of SOCs these days, so you can accelerate the scaling operation, so you don't have to use your precious CPU time for doing such things, and yeah, practically we get a lot more buffers. We have an output buffer of the UVC camera, the input buffer of the hardware scalar, the output buffer of the hardware scalar, and then finally the buffer where we are going to scan out from to get it on the display. So what we have to do now in GStreamer to make this work is M-map all those buffers into user space and copy stuff around. So even we are not doing any real work on the CPU, like doing the scaling because we are now doing the same hardware, we have to do two copies on the CPU, which might just as well be the same amount of computation you have to do. So what can we do about it? The simple solution that came up a couple of years ago is using user pointers in the Video4Linux API, so both those elements are controlled by the Video4Linux API, and what we can do now is we get a user pointer from the M-map and stuff it back in when we are queuing the buffer to the hardware scalar. What happens now is the kernel part just resolves this pointer to the actual storage, locks the backing pages of the storage, and the hardware scalar can just read from the storage. So we are saving on this copy here. Yeah, the problem with that is the exporter of the buffer or the storage doesn't even know about that. There is now an importer that is using the memory. So there are logic in place to make sure the memory doesn't disappear while it is used by the hardware scalar, but nothing more than this. And additionally to this, the user pointer API is just a single subsystem solution, so it's only usable if both those elements are Video4Linux elements. As soon as we go to the DRM part, we still need the M-map copy to get the things to the screen. So this is where DMABuff comes into play. DMABuff is generic infrastructure which is cross subsystem in the Linux kernel, and what we can do now is we explicitly request to export this buffer from the exporter, and user space gets handed and fire the script. The advantage of this is you can now actually hang on information and operations onto this fire descriptor. So when it gets imported to the hardware element the scalar here, the kernel can actually extract the DMA buffer with its operations from the FD again and can call some operations on this. So yeah, we'll call it that later. And as you can see, it's cross subsystem, so you can just export the same handle from the Video4Linux API here as the exporter, pass it into user space, and then pass it into the DRM subsystem and make a frame buffer from it. So you save both copies. Great. So what happens when you pass the FD into some importing element is it attaches to the DMA buff. This is just calling DMA buff attached. And what happens now is this operation goes back to the exporter, and the exporter gives you some attachment. So the exporter now knows that there's an importing element that's interested in the buffer. So if you're no longer interested in the buffer at all, you just detach from it to tell the exporter, okay, I don't want to do anything with this buffer again. Yeah, that's the one time operations. You get handed the buffer, you attach to it, and if you're not interested anymore, you just detach. Now every time you want to really access the memory from the DMA buff, you have to do a map attachment. And that's just passing in the DMA attachment you got, and you get back a scatter getter table. And the scatter getter table is just a collection of all those pages backing the buffer you got handed. And every time you access the buffer, you map it, you do your DMA with your hardware engine, and then you unmap it to tell the exporter, okay, I'm not doing DMA with this buffer right now. So all this sounds like a really good idea in principle. Really easy to do. You are saving all those copies, you're just passing around some file descriptors, which is a known thing in the user space, but what happens if the buffer or the devices using the DMA buffers have different constraints on the memory access patterns? There could be different DMA windows. So it's not really happening on the embedded systems right now, but if you go back to the PC where you have some ISAR cards, which can only address the lowest 16 MB of memory, and then some PCI cards, which maybe can only access the first gigabyte of memory and so on. So you could have devices that can only access different memory portions. So they actually cannot use a buffer that's handed them from a device that allocated the buffer in another memory region. Another constraint is contiguous, which is pages, pages memory. Many devices can just access the DMA buffer as one big block. So it really has to be physically contiguous in memory because you have no MMU in between the device and the memory. So you cannot use page memory or even different MMU page sizes. If you think about IOMMUs, there might be some that are using large pages for the little performance benefits. So you need chunks of memory that are physically contiguous that may not be your system memory page size. So the really common restrictions we see today on embedded systems are devices that are not able to just get a DMA on their own. So they can't just fetch from different portions of memory, but only you have a start address and a size and it fetches only the linear chunk. Also there are many times no IOMMUs available which could overcome this limitation. So DMA buffers in general have to be physically contiguous in a lot of places and embedded systems. That's not really a problem because we have the CMA contiguous memory allocators. So you can just make sure you get contiguous memory for these devices that needs them. But what happens if you have a mixed system like this one? Again, a UVC camera which can do ScalaGator DMA and so it allocates its buffers as page memory that is not physically contiguous. Then we try to export the buffer, import it again in the hardware scaler and the hardware scaler can only use contiguous memory. It now tries to attach to the buffer and gets a ScalaGator table that has more than one entries so it can't work with that. So here the pipeline will break because there's no way this can work right now. And now to our solution which is just transparent backing storm migration. And as I don't want to take the fame for this because Philip implemented all this. Okay. We have some prerequisites for this because before we can do transparent migration of any useful sort, we have to know something about the constraints of the devices which are involved. And one of those constraints which is already commonly known is the DMA mask inside every device, struct device, which was the bits that the actual DMA could, the address bits that the DMA of this device could drive. But on the SOC systems that we are involved with, this is usually just 32 bits all set so not important. But there's already more. Inside the struct device we have the struct device DMA parameters which had already the properties which are not printed in bold, the max segment size and segment boundary mask which are properties related to the MMUs. And for our prototype we have just added two more which is the minimum segment size which can basically used to get contiguous chunks of a minimum size. For example, if you have a 64K MMU, you can put 64K in there and you will get pages of 64 which are contiguous physically in memory. And the max segments which we right now only use to put either one in there or in max which either means you have a paged device or you have a device which needs physically contiguous memory. So if you put the one in there, there will be only one entry in the scatter gather table and all devices which need physically contiguous memory can use this. Okay, and now that we can describe those constraints on devices, we need a way to allocate some memory which fulfills those constraints. And there we traditionally use the DMA APA to get contiguous chunks of memory and we request size of memory and we have some parameters, the DMA attributes which also includes the protection flags and we get out of this function void pointer which is contiguous virtual mapping in counter space or some kind of cookie depending on which attributes were put in there. And DMA adjurty, this is a physical start address in device space. So in case you have no IOMU, it's just the physical address of the memory that is given back by this API. So it's not possible to get generics get-together memory page memory. There is somehow distributed out of this API and we want to change that to have a single point where you can put in our constraints and then get out memory that is suitable for the devices which are involved in this exchange. And we implemented this right now as a test case only for ARM, only for enabled contiguous memory allocators was really just proof of concept. And what changed is we put in there is get-together table to be filled and we put in there the constraints that the device gave us for example but separately. So we could pull the constraints out of the device but we can also have one device request memory with arbitrary constraints. And the other big change is that we don't get back the DMA address right away because this memory is not allocated for just one device but maybe we need to map it to two different devices and depending on whether you have an IOMMU in between them you will get different DMA addresses and it does not get the virtual kernel mapping right away. So in case we just want to take a buffer out of one hardware module and put it into another we don't necessarily need the virtual mapping at all. And this get-together table coming out of this is just a useful structure to describe memory which is distributed. So you have a list and the page or link entry contains some magic bits in the first two bits which is either the table is at its end or it is a link to the page, struck page. And optionally you can fill the DMA address and it has the length of this part so it will be able to describe pages which are consecutive with just one element. So in case you have a contiguous memory you have a get-together table with just one element pointing to the first page with the length of the contiguous region. Okay and now that we have this we need a way to get back the virtual address if needed for the CPU and we need a way to get the physical address or the DMA address cookie for the device to use this run and this API is already existing right now. So if we have the scatter list out of the get-together table that we got back from the new API we can just call DMA map SG and we will get back a scatter list with filled in DMA address entries. So this can be used right away by the device and we had to add a new function which will just take the get-together table. The protection flags now can be set at this point when we are mapping depending on whether we want to have the mapping cache right combined something like that and we'll get back a scatter gather table and we'll get back a virtual mapping into the current space which is virtually contiguous. Now that we have this method of getting memory that fulfills constraints which are given by the device involved we need a way to have all devices pool their constraints and find out which constraints we have to fulfill for others which need to be used by both devices and the way to do this is to check at attachment time if the current storage if the driver already has allocated memory is compatible with the attachment and if yes we can just return the get-together table that's already existing and if it's not compatible we have to wait until all other involved importers which might have already mapped this memory unmapped this so they are not using it anymore and then we can reallocate the storage still wait while this imported this new importer is waiting for its attachment and after we reallocate the storage we can I'm sorry I'm getting ahead of myself so I just wanted to explain how the reallocation is going to work. First we have to try to find the DMA parameters compatible with all of the attached devices and we have added a small function for this which just takes an array so this is the number of the error elements in there of those DMA parameters of all devices and it pushes them together and finds for example the minimum of the max size and the maximum of the min segment size and the minimum of the number of entries that all the devices support and gives out a new set of constraints which all of the devices involved should support and if this is not possible it will get back an error and we are back again where we started before we started implementing all this so if we have a buffer exporter and multiple importers we can start with trying to fulfill the constraint of all involved devices. If this is not possible there might be a second way which is we just wait for all involved parties to unwrap the buffer and then with when one device tries to map this buffer we can just use the constraints of this one device and the exporter so another device which might be also attached will not have access to this anymore but at the moment it didn't request it so it's no problem we can just migrate and let this other device use the memory and if this is also not possible it is the last resort would be possible to just throw away the constraints of the exporter and just use the constraints of the importer so it's always possible by copying by the CPU for example to get this buffer to the importer somehow and it's all transparent from user space then. So the parameters returned by this function are then used to allocate new storage with the DMA API returning the new scatter gather table then we have to move the content we have a new buffer we have the old buffer we copy the content over so the simplest way to do this is just map both buffers into CPU memory space do a mem move and if the exporter itself has facilities to move for example with the GPU behind the MMU can use splitting so the exporter always has the choice to implement it in a more efficient way and in theory it would be also possible to use dedicated on-chip DMA engines like some system DMA to do the migration in a more efficient way. Now after we have migrated into the new storage we can just return the SG table which we got from this allocation function and after copying into it just free the old buffer and have whatever the user space wanted the involved elements to do continually with the new buffer which is compatible with hopefully both exporter and importer and now I talked about copying a lot the question is why isn't this dead slow and the reason is any sane user space like G streamer for example is reusing allocated buffers actually this is a main concept of video for lingus also so at the beginning you allocate the buffers needed on the devices you queue them into the devices let them fill them dequeue queue into the next device and let them use the memory and then have them buffers go back so in our case if the buffers come from the first device the UVC which is requesting itself page memory which is not compatible with our simple hardware scale we will have buffers that are not that the hardware scale I cannot read so we have to copy them but in this case we have three buffers that we are just pushing around so we will have to copy three times at the beginning and after this we have buffers which are compatible which are contiguous for the hardware scale and they get reused and we will not have to copy anymore and well again all of this without any user space involvement so far except that user space has to use the DMA buff facilities that are available okay there are some strange corner cases for example for devices which are not overlapping at all in the DMA parameters be it because the DMA windows do not match or well because one of them can has a fixed page size of 64k and the other can for some reason not free that it will still work we will just copy every time so then there's one more extreme corner case which might be devices with memory not accessible to the CPU and no way to migrate to CPU memory on their own but then I would ask you do you know any such hardware I don't and even if you do know such hardware why do you want to share the buffer so if the hardware can sensibly share the buffer this method will give some improvement over the case before as it works at all okay and one possible optimization well it's a possible optimization that you want to do very much is just to delay the allocation of the buffers as Lukas said we have the attachment and the mapping which are separate so there's no need to allocate the buffer at the attachment already or allocate the buffer before this you can just map the buffer when the first user tries to map it and tries to write some data into it so if from user space you have a way to present all involved drivers with this buffer before doing the first map you can already get together all the constraints you need and then allocate the buffer in the first place with the right constraints so there will be no copying at all if user space is enhanced to first show the buffers around before starting this pipeline yeah okay and I think this is for the talk if you have any questions so far feel free and I will do a quick demonstration I hope this works I'm not sure any questions hopefully soon yeah it's pretty invasive in the DMA API to get this new method to allocate things and get the CPU mapping but yeah we have to implement it for more than ARM we have to implement it for x86 and whatever to reasonably get this into mainline kernel but yeah so we hope to get some patches out soon any other questions no in kernel API that's the one thing that we need to do is we need to get the API to do something about all this it's completely transparent to user space the user space only have to use the concept of DMA buffs and a lot of user space already is using DMA buffs to share buffers between DMA devices so it's a file handle it's DFT and you just pass it around and pass it into your appropriate IO controls to the kernel so you just need to use that and everything other is completely transparent to user space all the nasty details how the memory is laid out and such is completely hidden from user space sorry can you get me a console on there no he doesn't like it I'm sorry my laptop battery died and I I don't know what this thing is doing it's particularly unspectacular okay so what you would have seen is the system would have booted with a black screen I would have started the pipeline and you would have seen live video from the USB camera play black on the screen nothing more and so it's just the camera is allocating buffers which are paged and the first three flames are copied and because the system is idle otherwise the copying is very fast so after this copy you will see no more than just the camera pushing the fixes to the screen I'm sorry yeah no this will be inside the drivers yeah the kernel driver will know the capabilities of the devices like it's always been and yeah user space doesn't need to know about it just passing around buffers so the exporter creates this dmabuff exports the file descriptor the user imports the file descriptor and converts the file descriptor back to the dmabuff and then calls operations on this dmabuff and from the dmabuff which has a list of all attached attachments of all users the dmabuff can just check all users for their constraints and then do something useful with it. It seems to be changed in DNA capable drivers isn't the allocation of the buffers are mostly yes so you need to set up the structure we've shown that we device dma parameters that's not always filled right now for all devices so you need to fill this and yeah then just use the new way to allocate the buffers and yeah then for migration stuff you need to do a bit more but yeah basically if you're just importing buffers it's nothing more than filling the structure. The exporter will do all the heavy lifting and yeah you have to do all this stuff. We hope to get some useful helper functions for this the common path of migrating with the CPU will be easy to do for you as the exporter. On what platform did you start? This year is an old IMX 53 Cortex A8 class. It's interesting just for platforms where you might have mixed systems which can mixed IPs which can only use contiguous memory and others which can somehow scatter together. This for example interesting for everything that has simple hardware decoders and GPUs and you want to push the hardware decoder and CPU memory back and forth. The question was what kind of improvement we see? So the improvement can be from nothing at all if you have enough memory bandwidth available and the system is idle you do not want to use your CPU the speed is enough to do 30 FPS VGA or something but as soon as you increase your buffer size or you want to use the CPU while playing video the change can be it works or it does not work at all. So as soon as you store because your memory bandwidth is filled you will get I don't know things like 50 percent. Yeah for the simple example. Well it depends if you have a fast USB and you have a use case where you want to use all of the memory bandwidth of your CPU and for some reason you want to get the frames quickly out of the camera so you want low latency then is something which is really interesting. If you just want to play back a high resolution USB video maybe you are better served with the USB camera which encodes itself and then just put MJPEG or H264 over USB but then you have latency so there are use cases for this maybe not everywhere. Actually the other method to handle all of this is do what the Android people are trying which is this memory allocator like iron for example where you have a general place which knows beforehand where your buffers have to come from but we wanted to avoid this so for development it's just easier to pipe everything together and maybe it doesn't work as fast as it possibly can for the first few frames if you don't do everything right but at least it works at all. So it's mostly interesting for bringer for us and if you then have a proper user space so we hope to get patches into GStreamer eventually which can do this this will be fast all the time. The thing is the iron allocator and Android assumes all this knowledge about DMA constraints and whatever is into user space and if we really want to keep this knowledge into the kernel and all those fuzzy bits and yeah it's really easier for the user space developer if you're just dealing with a generic interface to the kernel and don't have to worry. As we have shown you have some possibilities to optimize things if you are passing around the handles before and before using them you can delay the allocation and make it even faster but yeah basically it will just work if you black things together like you do in GStreamer or other applications. Also for generic user spaces like the Western compositor we have the same problem we have a GPU that's doing the compositing and putting spitting out paged memory because it has an IOMMU so it's the memory that gets allocated and paged from and then we want to put it on the screen and especially on our free scale hardware the scanout driver can only work with contiguous memory so yeah you have to have a way to just pass the handles around and make it all work in the kernel somehow and it's really easier to wait then going into Western and tell Western okay we may want to put this buffer on the screen please tell the GPU driver to allocate it in contiguous memory and then get out the right memory so yeah you basically don't need any knowledge in user space for this to work. We think this is a real advantage over what is there currently. Any more questions? Okay then well thank you for your time.