 Good afternoon, everyone. I'm Wendy Liang. I'm working as a Linux kernel developer for Cruise, which is a company to produce driverless cars. I have working experience with data-dreamers system communication and management. I worked with Linux kernel remote proc, app message, proc, usage, and DMABof framework. In today's presentation, I'm going to talk about DMABof usage in automotive sensor data pipeline. Thanks, everyone, for joining today's presentation. Here's today's agenda. I will briefly talk about automotive sensor camera data pipeline. And next, we'll briefly introduce DMABof. And then I'm going to talk about how to use DMABof and DMAfencers to set up a zero-copying data pipeline and for synchronization between each same points. Automotive sensor camera data pipeline example. This image shows a driverless car. There are sensors installed on the car. The raw condition data captured by the sensor and camera will pass to ISP for peak assessing. And then the data will be passed onto accelerator pipelines for further processing. And then the outcome data will then be passed onto a host for the next stage computation. There were a lot of data coming from sensor. And so how to process the data efficiently and how to transfer the data in the pipeline efficiently is a challenge. DMABof is a good solution to help to have efficient data transfer across the data pipeline. DMABof, DMABof is a framework for sharing buffers for hardware DMASS across multiple device drivers and subsystems. It is also for synchronizing a synchronous hardware SS. It has three subcomponents. The core component is DMABof. It defines DMABof operations for the underlying memory blocks. DMAHips, which it is a implementation of DMABof exporter and it provides user space interface for user to allocate DMABof from a particular memory region. DMAfans, DMAfans is used for synchronization between the hardware and user application and between hardware drivers. DMABof, DMABof is shared DMA buffers. It has a buffer exporter for user to allocate DMABof from some memory region. And each buffer has users. The users can be in kernel space or in user space. DMABof is a scatter gather list of memory block. The underlying memory can be contiguous or discondiguous. Each DMABof is represented by a file description so that the user's space application can access the DMABof with normal file operations. DMABof operations is to enable user to attach to the buffer and access the buffer memory. Here is an example on how to use DMABof. The user application use DMAhip iOctoLock syscall to request a DMABof. It uses the DMAhip allocation iOctoLock arguments to pass on the land and the file permission flags and the Hips memory attribute flags onto the iOctoLock. And the underlying DMAhip will allocate the related memory and return back user a file description for the allocated memory. And then the user application can use the M-Map to get the virtual address for the DMABof so that it can later on read or write the underlying DMA memory. If user need to pass the DMABof onto the next onto other devices, the other devices driver will need to implement a iOctoLock to allow the user space application to attach the DMABof to the device. If the driver implement such a iOcto API and then the application can use that iOctoLock to attach the DMABof to the device memory. And after the device driver receive such a request, it can call the DMABof to get a DMABof data object from the DMABof file description. And attach the device to the DMABof with the DMABof attached API. And you can get the scatter list from the DMABof with the DMABof map attachment API. And with this scatter gather list returned back from this API, the driver knows the DMA address of the buffers so that it can configure the hardware registers for the hardware to access the DMA memory. And for synchronization, the device driver can create a DMA fence so that later on it can return back to the user so that the user can pull the DMA fence or the next device driver can also pull the DMA fence to know the completion status of this device on this DMABof. So buffer exporter. So a buffered exporter is to create a DMABof to wrap around the DMA memory and return the DMABof file descriptor to the caller. It is responsible for the DMA memory allocation and manage coherency of the allocated memory such as device access and CPU access for the device access that is after the CPU application produce some data to the memory. And before the memory, before the device can access the memory, this device access operation will make sure for example, flush the cache to make sure the data, the correct data is put onto the underlying memory so that the device can get the correct data. And the CPU access is for the other direction. That is after the device produce the data to the memory and before the CPU can access that memory, this operation will do the necessary steps such as in ready date cache if it is required to make sure the application or the driver, the software can get the correct data from the underlying memory. And it provides memory, DMABof memory mapping functions for kernel and user space users so that they can access the memory. Manage backing storage for the attached devices of the DMABof such as PIN and PIN, the DMABof memory, notify attached devices if the underlying physical memory needs to move. Here's a list of buffer, the DMABof operation, a buffer exporter need to implement attach, detach, map DMABof or map DMABof, release and map, remap, we are mapped. The others such as the PIN and PIN, between CPU access and CPU access, this can be optional, it depends on the, it depends on the memory attributes of this memory region, but the others are usually required. And in the kernel, there is one example for buffer exporter, DMA, UDMABof, which it is to export DMABof for shareman regions. The user application uses shareman create API to get a memory file descriptor from the share memory. And with the UDMABof auto API, it tells the UDMABof the memory file descriptor and then the UDMABof driver, you create a DMABof for it, for the specified region and return it back to user application. One use case to use this UDMABof is for QMU. QMU use it to create DMABof for what I own GPU, for example, there were other buffer exporter in Minus kernel, you can find those examples from video, pipeline, drivers, framework, buffer users. The buffer user can be in kernel space or in user space. For user space, after it request, after it allocate a DMABof user space, buffer from Minus kernel, for example, it request a DMABof with the DMAHipAuto API, it already attached the, this application is already attached to the DMABof. And when it needs to access the memory for read and write, it will need to, it will need to call the MMAP API to get the virtual address. And for Minus kernel drivers, in order to attach the device to the DMABof, it need to call the DMABof attached operation as shown in one of the previous slides. And if the driver itself want to access the underlying memory, it can use the VMAP operation to access the DMA memory. The attached operation, it is mandatory for the driver to call because the DMABof, it maintain a counter. So if the counter drops to zero, that is no user, the DMABof will be released. So before the device finish consuming the buffer, the driver is responsible to make sure the device is attached to the DMABof. And when the device has done with the buffer, the driver will need to call detach to detach the device from the buffer. So here it shows there is a exporter implemented in the Minus kernel provides user AI Octo API for user to allocate DMABof and each devices in the pipeline will need to have its driver implement AI Octo API for the user to attach the DMABof to each of the devices. DMAHip, DMAHip is a framework for memory allocators to export allocated memory as DMABof. It is a DMABof exporter and it also has a device file interface for user to use AI Octo to allocate DMABofers. In this kernel, it implements two hips. One is the system memory heap, the other is the CMAHip. When user allocates memory, request DMABof from the system heap, the system heap driver will allocate DMA memory from the Linux system memory. And when driver wants to get a DMABof from the CMAHip, the underlying DMAHip driver will allocate the memory from the CMA pool. And here it shows that you can specify a particular memory region with reserved memory to the CMAHip and then the CMA specified by this reserved memory will be used as a Linux CMA pool. And then you use the CMAHip API to allocate a DMABof if you get the memory from this specified CMA pool region. So here shows a user space API to request a DMABof. DMAHip I Octo alloc and when it returns the DMABof, the DMABof file descriptor will be specified in the iOcto's argument structure. And this, the length it is for user to specify the size and the fd flags is the file is the permission of the DMABof file. And the heap flags is for the memory attributes. So besides the DMABof operation required by DMABof exporter, the DMAHip driver will also need to implement a DMABof allocate operation to allow user to allocate DMABof from the heap. In this example, it uses reserved memory to specify a DMAHip region. And in order to register the heap to the DMABof framework so that the user can use that DMABof framework to register the heap to the DMABof framework so that the user can use that DMAHip allocation iOcto, it will need to call this DMAHip add API to register the heap to the DMABof framework. DMAfans, DMAfans is a Linux kernel internal synchronization primitive for DMA operation. Besides a single DMAfans data structure, Linux kernel also provide DMAfans array and DMAfans chain data structure to group DMAfanses. DMAfans can be exposed as a sync file for user space to inquire DMAfans status. And the sync file can be passed around explicit synchronization points. And DMABof can also store DMAfanses with DMA reservation for implicit write and re-synchronization. Later on in the presentation, we will show more on the sync file usage because it is more explicit. And with this sync file, it is very easy to get the notification for the DMA operation on each buffer and if any buffer failed any DMA operations. So this is the some example code for DMAfans creation and DMAfans usage. To create a DMAfans, we'll need to allocate the memory for the DMAfans because when a DMAfans is released, it will usually the underlying, the allocated fans will got released. And use this DMAfans to initialize the fans. And if we want to create a sync file for the fans and we will call this sync file create and you'll get a file descriptor for the sync file. And for user application, user can hold on the DMAfans file descriptor to know if the fans is signaled. And for the Nina space kernel driver, can use a DMAfans to wait for the fans to be signaled. DMAfans arrays. So DMAfans array is an array of DMAfanses. The array itself is represented as a DMAfans and this array can be configured to be signaled if any fans in the array is signaled or if all the fans in the array are signaled. So it has two configuration. And DMAfanses chain. A DMAfanses chain is composed of a previous fans and a associated fans to the current chain node. It is signaled if all previous fans is in the chain and the associated fans are signaled. So here shows how to create a DMAfans in the pipeline. After the application allocates a buffer from the exporter, it will attach the DMAfans to the device with the device driver provided attachment API and then the device will create a fans and it will return back and create a fans and create a sync file for the fans and return the sync file back to the application. And then application will pass the DMAfans to the next device and also the previous sync file of this created by the previous device driver on this DMA buff. And the next device itself will create a DMAfans for its own DMA operation on this buffer and return back the sync file to the user application. And so the next device can call the DMAfans rate API on the DMAfans created by the previous device to know if the previous device is done with this buffer before it request its own device to work on this DMA buff. And the user application by pulling all the sync files it knows the completion stage or whether any of the accelerator in the pipeline has errors on this DMA buff. So we have talked about the DMA buff, DMA heap and DMA fences and now we can use them to create a data pipeline and then we can use them to create a data pipeline for zero data copying. User will allocate a buffer from a DMA heap. This can be a system heap, CMA heap or user defined platform specific DMA heap and it assign the DMA buff to each devices on the pipeline through each devices drivers iOcto API and each device driver will in queue the buffer or a list of buffers from user application to its hardware and for each buffer of DMA operation it creates DMA fences and their sync files for the buffer operation completion or failure notification. Here shows we used reserved memory to specify the DMA memory region. Usually in a system there are multiple memories and each memory can has different memory attributes and also each devices can have different latencies to access different memories and here we use reserved memory to specify a preferred memory region for this group of devices and we have implement a DMA heap for this memory region and the user will use the DMA heap allocation API to request a DMA buff and after it get a DMA buff from the DMA heap it call the device drivers defined attachment API to attach the DMA buff to the device. For this device the underlying DMA hardware it requires each DMA buff has a DMA buffer descriptor for it and the DMA buffer descriptors also need to be stored in memory and so the device driver will allocate this allocation is just use normal DMA alloc coherent API to allocate the memory for the buffer descriptor for this buffer and the device driver will also create fences for DMA operation completion and adverse and return the sync file for this DMA fence back to the user application through the iOcto API return and the user application will attach the buffer and also the sync file created by the previous driver to the next device driver and the next device driver will create a DMA buff and then a sync file for its own operation on this DMA buff and so that the user application can monitor can pull all the same files returned back by the device drivers and the next device driver can wait on the DMA fence created by the previous driver to know if the previous driver has done with the buffer before it start its own DMA operation on this buffer so here this diagram shows how we use the DMA fence to notify user for buffer completion or adverse after the device driver get the DMA buff request from the user when the user call the iOcto Cisco it get the DMA address from the DMA buff and configure the registers of the accelerator so that accelerator can start DMA operation and then the accelerator finish the DMA operation successfully it will raise an interrupt back to the driver and the driver will check the interrupt status to know that the buffer is complete the buffer DMA operation is complete and then it will signal the DMA fence and then the user application because it is using whole NINUS API to wait on the DMA fences and so it knows that and so this whole API got wake up and it will check each of the whole file descriptor in the array, in the polling array and then it knows that this buffer is complete and for the next device driver and it also wait for signal on the buffer complete DMA fence so it knows when the buffer is complete by the previous accelerator so that it can start its own DMA operation but when there is error happen the accelerator will also raise interrupt and the device driver will check the error status or the interrupt status register knows that there is error and so that it will signal the error fence and the application because it also pull on the error sync file so it knows that there is error happen and also the next device driver it also wait on the signal the error signal of the previous DMA operation and so that it knows that there is error in the previous operation and then it can it knows that it should not go ahead with this buffer so it can detach from this buffer and here it shows the full data pipeline with the DMA buff so we can see that the data coming from the sensor through MIPI bus and it goes to the ISP and it goes to the accelerators before it goes to the host compute and for the embedded Linux system the user application will allocate DMA buffers from the DMA heap of the specified DMA region the DMA region is specified by reserved memory and then it will pass the DMA buffers to each of the accelerators so usually there will be input buffer and each accelerator will also require an output buffer so it takes input and output and it assigns the buffers to each of the accelerators on the pipeline and each driver after it gets the user request it will create DMA fences for those buffers and the DMA fences it will cover the DMA buff operation completion and also there should be a fence to capture the errors and return back to user and each of the the device drivers it will wait on the DMA fence created by the previous step so that it will not start its DMA operation until the previous one has done and also it will create the fences for its own DMA operation and give it back to user and so that once the the pipeline kick off and then when the the data coming from the sensor and there is no software intervention there is no user space software intervention and then the data will come to the first IP and the driver will monitor the first IP completion status and then it is done it will trigger the fence and the next driver will got notification and it will trigger the next DMA operation and so on and so there is no use application intervention is required and of course for each IP this one it is we need the DMA buff it is because there is no stream line hardware connection between each IPs and for some customized IPs it is possible that you can use stream line interface for example like as a stream and then you don't need buffer between the accelerators but however there is also limitation to this approach maybe not every hardware IP can do that and also if we use some third party IPs and maybe they just do not have this access stream interface so in this case with the DMA buff it provides a solution don't need use application intervention to wait for the previous to wait for the previous data processing to finish and then trigger the next one and it is all done by it is all done automatically from the kernel space and this will be quite efficient and also the pipeline can be can be set up ahead and during run time the user application don't need to be involved except that it just need to monitor for completion status and adverse status and you can find more details about the DMA buff API description from the Linux kernel documents so that's all for today and thanks to join the presentation thank you very much and have a good day