 Hello everybody. Thanks for joining this session. I hope you and your loved ones are doing well. My name is Ajay Kumar. Today, me and my colleague Smita will be taking you guys through the ARM 64 cache coherency from a hardware and a software perspective. In this session, we will be covering the mentioned topics. We will start with the cache basics, what it means to cache hit, cache miss and cache landfill. Then we will discuss about the cache operations, which are the cache cleaning and cache validation. Then we will discuss the problem of cache coherency in a multi master system. We will discuss a little bit about the paging, the memory translations and the memory attributes. We will move on to the memory, shareability and the domains. And then we will discuss on how to achieve coherency in a Linux system with the three methods, non-cached, hardware snooping and software cache maintenance. Then we will elaborately discuss on the DMA APIs, which are provided from the Linux and of two parts, consistent API and the streaming API. We will also discuss about the address mapping, the user side, the kernel side and the device side mappings. We will end the session by combining all these concepts to a Exynos MMC use case. Let's start from the cache basics. The processor speed is multiple times faster than the memory speed. As there is a saying, the hard is as slow as the slower sheet. If the memory is slow, the processor execution will become slow. So fetching data every time from the memory hinders the processor's performance. So they put a cache, which is a small and a fast block of memory compared to the main memory that sits between CPU and the main memory. This cache holds a subset of memory items from the main memory. Accesses to the cache occur significantly faster compared to that of the main memory. This slide shows the data inside main memory and the cache at any random point of time. As you can see, not all the addresses are cached in the cache. Few of the addresses are cached and the cache data is of size cache line size. To match that cache line size of data, we have not stored the complete address, but just the tag part of it. Cache line fill. Assume your PC was trying to read from an address C000000 and this particular address was not in the cache yet. So it picks up a cache line size of data from the main memory and stores in the cache alongside a tag address entry C000. Any subsequent fetch from this address or any address which might fall in this cache line size of data is picked from the cache itself. Cache hit and cache miss. Subsequent fetches to data from the addresses falling in the same cache line can be picked up from the cache. This scenario is called cache hit. In case the requested address is not found in the cache, that's called a cache miss in which case a cache line size of data will be picked up from the main memory and stored in the cache with a new tag address. The hardware block of the CPU which automatically does this for us is called a cache controller. Cache organization. Till now we have seen examples where the tag address was MSB 16 bits, but that's just for explanation. You can see from this bit in this processors design, the tag address was chosen to be 30 bits and the set index was chosen to be 8. So there will be 256 lines. The tag address and the cache line size of data also has a valid bit and a data bit with it. We will be discussing about the valid bit and the data bit in the next few slides from a processors design for the cache. The main thing which matters to the programmer will be the cache line size because any data structures optimized to cache line size will give benefit of the cache. 30 bit indicates whether the data in the cache line is modified by the CPU since the original fetch or the last clean operation. In such a case, the data is not coherent with the data in memory and the corresponding cache line has to be cleared and sent to main memory. This process of clearing the cache line to main memory is called cache clean or the cache flush operation. After the cache clean operation, the dirty bit becomes 0 and the cache clean operation can be done only for the entire cache line. In fact, any transaction for the cache happens in the size of cache line size with the main memory. Similar to dirty bit, we have a valid bit which indicates whether the cache line inside the cache is valid. If V is equal to 1, then that's called a valid cache line. For a cache hit to happen, the tag should be present in the cache and also the valid bit should be 1. There can be cases where the cache line becomes invalid because some other master wrote the cache line of data in main memory. So any access from the CPU to that particular cache line will fetch stale data from the cache. So that particular cache line must be invalidated by making this valid bit 0 so that the CPU doesn't fetch from this cache line, but instead goes to the memory and fetches the latest data. This operation is called cache invalidation. Like cache clean operation, cache invalidation operation also happens for the entire cache line. So till now we saw the philosophy behind caching and the cache organization inside a CPU, the basic cache operations like the cache cleaning and the cache invalidation. So in general, the cache clean operation updates the change in CPU cache to the memory domain and cache invalidation operation updates changes from the memory domain to the CPU cache. Till now we have discussed the aspects of caching pertaining to a single CPU and the main memory, but in a real modern SOC there will be multiple clusters having multiple CPU and multiple caches and also multiple bus masters, which are transferring data to the same shared memory. So the coherency problem arises in a multi master system coherency means ensuring that all the processors or the device bus masters have same view of shared memory. Any changes to data held in one of the masters should be made visible to other masters. There are three ways to achieve coherency. One is by simply not caching the shared memory locations. Another is to make software managed coherency where the data is cached, but the device drivers must take care of cleaning the dirty data or invalidating the state data from the respective caches. The last and the most efficient method is where hardware managed coherency for which snooping has to be supported in hardware and has to enable in software. Before going into how each of the coherency methods are implemented, we should have an understanding on the memory attributes. Different memory regions can be handled differently in a 64 Linux system. Example, you can have few memory regions which are read only like your text segment from the program. There might be few memory regions which can be cached. This is the most frequently used memory regions. There might be few memory regions which should not be cached. Example, your device registers are something which you are not supposed to cache to prevent any side effects while accessing device registers. And there might be few memory regions which are to be accessed by a kernel only. So, defining such a memory region is possible because of memory attributes. The least granularity of memory region for which you can define memory attributes like above is called a page. The memory attributes for a page are present as part of page table entry in the page tables. There are two memory attributes which determine various properties that define the caching of a block of memory which are memory type and shareability. Since we talked about paging, we will see how the paging and the virtual memory is handled in any MMU. We have taken for example the CPU MMU. The concept remains same for any IOMMU as well. The PC tries to fetch some instruction and data. So, the virtual address is sent to the MMU. The MMU has a TLB which is nothing but a cache for frequently accessed memory translations. So, this TLB might already have a virtual address to physical address translation inside the TLB. So, if that is the case, the translation is picked up from the TLB. If not, the table walk unit performs the table walk to fetch the required virtual address to physical address translation from either the cache or the main memory because page tables like any other data can also be cached. That is why this path is available. Assuming the MMU picked the required physical address from either the TLB or from the memory or cache using the table walk unit, the physical address is sent on to the next step with a set of memory attribute which was also picked up from the page table entry. Now, this memory attribute should define whether the transaction which came via MMU has to be fetched from the cache or the main memory. Let's talk about the page table entry structure in M64 Linux. The PTE is a 64-bit structure with few lower bits defining the memory type. The bits 2, 3 and 4 which are called index define the memory type for the requested address. So, the memory types are of various types, but we will discuss the three main types which we mostly use in Linux systems. First one is the memory type normal. This empty normal defines a weekly order under cacheable memory. When we say weekly order, that means the processor optimizations can be applied for this particular block of memory and also this memory can be cached. This is mostly the majorly used memory type for all user applications and also most of the kernel data structures. Next memory type is memory type normal, but non-cacheable. So, since it is normal memory, it is weekly order. So, any processor optimization for the fetch can be applied on this, but this particular block of memory should not be cached. So, this particular memory will be used in cases where we don't cache the data to prevent coherency problem. The another type of memory is device memory, which is usually strongly ordered, that is no processor optimization for the memory fetch is allowed for this particular block of memory and this memory should also not be cached. This memory type is usually applied for the device register access. This is an example of a file where these memory types are defined in Linux. So, the memory type defines whether the current data access for the address should happen from the cache or bypass the same and happen from the memory. So, let's talk a bit about hierarchy in any complex modern SOC. There will be internally so many caches like L0, L1 and L2. The snoop control unit takes care of managing coherency at the L2 level between different cores. So, when we come out of a cluster, there comes our cache coherent interconnect, which takes care of managing coherency between different clusters or a bus master. For ease of reference, we will be assuming we are talking about a cluster like this, that is the blue box in our further slides for any bus master reference. Also, the snooping is a technique where caches in one master are snooped by the other clusters to pick the data copy for the address requested. So, if a address requested from this cluster is cached in this cluster and has the most recent copy of data, then this cluster is supposed to reply with a snoop response saying I have the particular data you requested here. Also, there is another term called the shareability domain and this domain refers to the set of primary bus masters which are snooped for coherent transactions. As you can see, the whole box is a domain and any address fetch from any of the master can snoop into any other master to fetch the data since they belong into the same domain. And typical Linux system use is that all the masters under the same operating system that is the Linux should be in inner shareable domain. In previous slide, we discussed about the shareability domain which defines the set of primary bus masters in the system that can share a particular block of memory and also can be snooped over the cache coherent interconnect. So, the shareability bits are defined in the page table attributes. The bits 8 and 9 defines the shareability memory type. So, the main shareability types are non-shareable, inner shareable and the outer shareable. Non-shareable means even though the page has cacheable attribute, it doesn't have a shareable attribute. That means this particular page can't be snooped for other clusters. Inner shareable means this particular page is present inside the clusters or DMV bus masters which belong to the inner shareable domain. Outer shareable means the requested page can be present either in the inner shareable bus masters or the bus masters in the outer shareable domain also. There is also another shareability type called system shareable which we have not defined here. Also, Linux assumed that the CPU clusters and the DMA bus masters are in the same inner shareable domain. You can see the PT shared attribute which is for almost all the memory types has the value 3 which means inner shareable. So, let's go about how to implement coherency. The first method is coherency by not caching. When a region of memory is made non-cachable, the address cannot be cached in any of the cache. All the transaction from either the clusters or the DMA bus masters happen to main memory. Since there is no intermediate storage of the data, data is always coherent across the clusters and the DMA bus masters. This way of achieving coherency keeps software development very simple but at the cost of hardware performance. Since the data is not cached in the caches, every transaction has to go to the memory and as we discussed in the earlier slide, if the access is happening from the main memory, then it makes the transaction slow and affects the CPU performance. Another advantage of having this is that there is no special hardware interface needed to manage coherency this way. The next coherency method is the hardware coherency which is possible via snooping. ARM provides ace-compliant interfaces like the cache-coherent interconnect which handles hardware coherency between bus masters and hence enables hardware coherency in the SMP operating system. Any shared access to memory in one master can snoop into caches of other masters to see if the data is there or whether it must be loaded from external memory. Masters to be snooped is decided based on the shareability domain attribute for the transaction. The ace protocol uses an internal MOSC state machine to maintain hardware coherency. Having coherency this way, software development is made simple but it comes at the price of the hardware cost. Hardware interfaces like the ace or ace-plated compliant MMUS and CCI are needed for achieving this. Next, we move on to the most legacy way of managing cache coherency that is coherency by cache maintenance in software. With software-managed cache coherency, calls to cache maintenance operation needs to be placed at right places inside the device drivers to prevent any device master accessing stale or non-coherent data. So what happens is after the CPU first fills the buffer which is the DMA buffer and before the DMA master accesses the same, CPU does a cache flush operation and after the cache flush operation the data becomes fresh in memory and can be accessed by a DMA master. After the DMA master has accessed the data in main memory, the CPU clash, if it has cached this particular address prior to the DMA transaction has to invalidate that address so that any subsequent fetch from the CPU sees the freshly available data which was filled by the device DMA in the memory. Software-managed coherency is relatively complex to implement because that needs a lot of analysis inside the device drivers but the advantage is that no special hardware requirement is there. Till now we have discussed caching and coherency from a lower level or a hardware perspective. From now on we will be discussing the same in a much higher level or software perspective. Linux provides a set of DMA APIs for allocating, mapping, and mapping memory which are used for DMA buffers. There are two types of DMA API consistent and streaming. Consistent API we allocate buffers which are readily coherent. DMA alloc coherent, DMA free coherent are examples of consistent APIs. If a driver uses consistent mappings it need not worry about caching side effects or cache maintenance in the driver. The other type of DMA API are the streaming DMA API. It comes in two parts. One is the mapping APIs and then other DMA sync calls. Mapping APIs are used to map buffers to DMA space. Assume we have user space buffers which were created using malloc calls and they need to be mapped on to the device space for the device usage. Such pages are mapped on to the device space using the streaming mapping API like DMA map single, DMA unmapped single, DMA map SG, DMA unmapped SG, DMA map page, DMA unmapped page etc. These buffers also need to be managed inside the driver to prevent any caching side effects. So we have the second part of streaming API which are the DMA sync calls, DMA sync signal for CPU, DMA sync signal for device, DMA sync SG for CPU, DMA sync SG for device. These are the calls used for cache flush or cache invalidation at respective points inside the driver. Let's have a look at different address spaces we will be dealing while handling DMA APIs. A actual DMA buffer in main memory can have three different virtual addresses based on from which master the DMA buffer is being accessed. A CPU master accessing the buffer in a EL1 that is the kernel mode will have a kernel virtual address to access the DMA buffer. The same CPU master accessing the buffer from a user application at EL0 level will have a unmapped user virtual address to access the buffer. Also the DMA buffer when it is being accessed by the DMA master has a device virtual address which the IOMMU allocates from the IOVS space. Also the three virtual address can have different memory attributes for the actual DMA buffer. Like example, when the CPU is accessing the buffer it can have memory attributes as cached for this buffer and the device when accessing the buffer can have uncached attribute for the same buffer. In next few slides I will be talking about my encounter with few of the DMA APIs. So for explaining the same I will take example of the before and to be to layer and the queue memory type as DMA content. So when we create a buffer memory with the type vp2 buffer vp2 memory nmap it calls DMA alloc coherent or the DMA alloc coherent takes the parameters dev for the device pointer and size the size of buffer which needs to be allocated and a handle to the DMA handle into which we have to fill the device virtual address and few kernel flags. So after the DMA alloc coherent successfully allocates the pages for the buffers it prepares a kernel virtual address and returns as CPU virtual address. The device virtual address is returned in the DMA handle. Since the buffer was created by the driver and the information about the pages and also the device for which it was created is available in the driver the subsequent calls for user mf will use that information to create a user virtual address for the same. Then if the driver was not creating the buffer by itself but in turn importing the buffer from vb2 memory user pointer or vb2 memory DMA buffer case then we will not use the consistent API we will use the streaming API to map the buffer pages on to the device address space. So in that case we are using DMA map SG attributes DMA sync device as the name suggests is for syncing DMA buffer before passing it on to the device DMA master. So the CPU cache has accessed the buffer in memory and now it's the turn of the DMA master. So before handing over the control to the DMA master we call DMA SG for device or DMA sync single for device based on whether the actual DMA buffer is scattered in memory or it's a single continuous chunk of memory. Based on whether your DMA master only reads from the DMA buffer or it also writes to the DMA buffer different operations happen in the implementation of DMA sync calls if the direction type is DMA 2 device means the buffer is only being read then only CPU cache flush operation happens if the direction is DMA from device then at CPU CPU cache flush and also invalidate operation happens. DMA sync CPU as name suggests is for syncing the DMA buffer for CPU usage. Based on DMA direction flag only a cache invalidation operation is done for the CPU caches if the DMA master has written in the DMA buffer. It also comes in two variants the SG and the single variant. This slide shows few important files in the Linux tree which we are going to use to enable coherency. So if we choose to add a device into hardware coherent domain we mention DMA coherent once the device tree contains the entry for DMA coherent it is read in the device dot c coherent is equal to OFDMA is coherent or not. So based on whether this flag was coherent the DMA coherent is set for the coherent flag. So when the memory allocation happens in the consistent API the call to DEV is DMA coherent picks up this property and sets the corresponding memory attributes for the CPU and device side pages. The CPU side pages get DMA memory attributes via call to DMA PGPROD. As you can see the DMA PGPROD calls with the attribute page underscore kernel. Page underscore kernel is nothing but a normal memory type. This means it's normal memory and normal cacheable with inner shareable attribute set. So the implementation of DMA PGPROD you can see based on whether DEV is DMA coherent it returns the shareable property which is the page kernel if it is coherent then it returns PGPROD DMA coherent which you can see has defined the memory attribute as empty normal non-cacheable. So this slide explains the CPU mapping how it was either page kernel in case the DMA coherent was present or it is a normal non-cacheable memory in case the DMA coherent was not present. This slide talks about the device side mapping how it is calculated. So assuming DMA coherent was provided in the device tree the OF divided dot C fix up the flag and the same is said to DEV DMA coherent the DMA mapping dot C while allocating the IOVA space the memory type is taken from IOPROD is equal to DMA info to PROD with the coherent flag and based on whether this coherent flag was set we set a flag IOMMU underscore cache this IOMMU underscore cache flag will be then referred to in IOPG table dot C where you can see in this if statement if this is enabled then the device side page table attributes are made as cacheable. If this flag is missing you can see the default value will be 0 which is for the non-cacheable entry. So till here we discussed the software and hardware perspective of caching coherency and the memory attributes. So it's time for us to go into the actual device use case with some real-life example to discuss the things my colleague Smitha will help you in explaining the same. Thank you. Thank you Ajay. Hello everyone. I'm Smitha from Samsung and I work in Foundry Software Group I will be taking MFC driver as an example to explain the effects of this coherency attributes in the next few slides. Firstly let me brief you all about MFC IP operation so that we can correlate the concepts better. Exynos MFC is Samsung IP which is used for video encoding and decoding it is expanded to multi-format video codec. It is used to encode and decode various codecs like HEVC which is also known as H265, H264, MPEC4, VP8 etc. It has the capability of handling multiple streams at once. MFC mainly is used in video playback pipeline that is the decoder functionality or the camera pipeline that is the encoder functionality. MFC is basically an M2M device that is memory to memory device. It has properties of both capture and output node. Output node properties meaning it is sending frames from memory to the MFC hardware. Capture node property meaning it is receiving the process frame from the MFC hardware into memory. In simple terms both the source of the data and the result of the processing is memory. The input side is called as output plane and the output side is called the capture plane. We provide buffers to MFC via cube of Ioctyl in the VB2 framework. At the output plane side we give the filled input data and capture plane side we give the empty destination buffers. Finally we call steam on Ioctyl for both capture and output plane to start the codec. Using the DQ buff calls we give back the filled output buffer and used input buffer back to user space. This gives an overview of complete stack buffers to kernel to MFC hardware. The application via the V4L to API talks to MFC hardware. This V4L to stack is responsible for sending input buffers to MFC and giving back the process data to the application. The memory allocation for the buffers can be done via MMAP, user PTR or DME buff with the help of VB2 framework. MMAP is when kernel allocates the memory, user PTR is when application allocates the memory and DME buff is when another driver allocates the memory and shares the buffer with our current driver. Frames affect the device as if it were an ordinary video output device with all the appropriate configuration done to describe the format of those frames. The driver takes frames return to the device and runs them through the processing engine and then makes them available for reading as if it were an ordinary video capture device. This is how an M2M device works. The user space fills input buffer via raw data for MFC processing. MFC fills DPP buffers post encoding or CPP buffers post decoding. The output buffer with the decoded data is passed back to the user space. This is taking a case of consistent API where there is no hardware coherency and using MFC IP in this use case. Consistent memory as previously discussed is memory for which I write by either the device or processor can be immediately be read by another processor or device without having to worry about caching effects since the kernel will take care of this internally. In order to disable hardware coherency we need to remove the property dma-coherent from the dating node. By using mmap way of allocation in this use case hence the dma buffers are created in the driver itself. For buffer allocation we call vb2dc-alloc-coherent which in turn calls dma-alloc-coherent which in turn calls dma-alloc-coherent which in turn calls dma-alloc-coherent which in turn calls dma-alloc-coherent which in turn calls dma-alloc-coherent is a consistent dma API which allocates a region of size bytes of consistent memory. This routine will allocate RAM for that region so it acts similar to get free pages. For MFC defined in the dating node In this use case, the dmsync calls are not called in prepare or finish function in the VB2 framework. As you can see in the slide, the kernel site page table entry for the buffer in this experiment we got the value of 70fx where index 2 to 4 bits indicates the memory type. It was set to normal non-cashable. Use the site page table entry for the buffer we got the value of fcfx which indicates again normal non-cashable memory type. The device site attribute that is mfcio-mmu page table entry was f4-3x which also indicated non-cashable memory type. Hence we could see the dt property effect in the page table entry at different levels. So all the hash defines in this presentation are as per the latest kernel code base. Considering the consistent API with hardware coherency using mfcip, we will be adding dma coherent flag in the dt node which will turn on the hardware coherency. VB2 mem type we are using is again mmap where the dma buffer is created in the driver. VB2 dc alloc coherent calls the consistent dma API, dma alloc coherent which will return pointer to the allocated region in the processor's virtual address space. It also returns a dma handle which is given to the device as a dma address base of the region. Since in mfc dt node we have specified dma coherent property buffer pages now become normal cacheable at both mfc and cpu site. Hence mfcio-mmu sends new request over cache coherent interconnect that is cci and cpu sends new response with cache data in this use case. Here again dma sync calls are not called in prepare or finish functions in the vb2 framework. As you can see in the slide the kernel site page table entry for the buffer we get the value of 713 hex which indicates the memory type is set to normal cacheable. User site page table entry for the buffer we got a value of fd3 hex which indicates again normal cacheable memory type. Device site attribute that is the mfcio-mmu page table entry was f47 hex which also indicates cacheable. All are in accordance to the dt property set for mfc. All the experimental results for these two use cases with respect to the processing time will be explained in the later slide. One thing we need to know is that consistent memory can be expensive on some platforms and the minimum allocation length may be as big as a page so you should consolidate your request for consistent memory as much as possible. Consistent dma memory does not preclude the usage of proper memory barriers. The CPU may reorder stores to consistent memory just as it may do for normal memory. This is for streaming apis with software cache maintenance with mfcip in the use case. Streaming dma mappings which are usually mapped for one dma transfer unmapped right after it unless we use dma sync calls and for which hardware can optimize the sequential accesses. In this use case the dma coherent property will not be added in the dt node for mfc hence hardware coherency will be turned off. Vv2mem type will be user ptr which means the buffer is allocated in the user space via mallot. Using the calls dma map single or dma map sg this user space memory is mapped to device dma address. Before we call qbuff to give the buffers to the driver we call dma sync sg for device in prepare callback. This flushes CPU cache data to memory since mfc iomu cannot snoop as there is no hardware coherency all reads go to memory. It is a coherent system since CPU cache is flushed in the previous step. Before dqbuff is called to give the buffer back from the driver to user space we call dma sync sg for CPU in the finish callback. This will invalidate cached entries in the CPU. Now CPU accesses and caches data in the memory. Data is coherent since invalidation has happened in the previous step. Even though there is no hardware coherency in this use case we have managed to coherence system using the software cache maintenance. And as we can see in the slide the user site page table entry value read was fd3hex which indicates that the buffer was normal cacheable memory type. But the device page table entry for mfc iomu was read as f43hex indicating noncacheable. The interfaces for using this type of mapping were designed in such a way that an implementation can make whatever performance optimizations the hardware allows. To this end when using such mappings you must be explicit about what you want to happen. In this slide we have captured all the data for previously discussed experiments in the table. This indicates the processing time for 500 and 5000 frames with respect to the previously discussed three use cases. When we consider consistent API where there is hardware coherency that is snooping is involved the processing time is much lesser. In the next case the consistent API where there is no hardware coherency that is noncached use case. The processing time is slightly higher but in the case of streaming API where there is no hardware coherency but there is software cache maintenance through dms sync calls. That takes the highest time for processing. Disabling caching sometimes gives good performance with software management but it is always better to profile before selecting what use case to use for the driver. Since it will be dependent on lot of factors like cache size, dmi buffer size, how CPU device accesses those buffers so the user needs to decide which implementation to use as per their requirement in the end. Before we end the v4l to vv2 references were taken from linux 5.4. Memory ordering that is the gathering, reordering, early acknowledgement were not discussed in the session. Linux offers support for various memory types and barrier calls to prevent side effects from accessing this weekly ordered memory. AXI protocol and AS slide or AS protocols were not discussed in the session. For complete understanding please refer to the following spec which is mentioned in this slide. Thank you.