 Hello. Good afternoon, everybody. I am Aditya from Sony. Today, I'll be talking about advanced xip file system, that is AXFS. So the agenda for today will be, we'll be looking into AXFS overview, profiling feature, its implementation, and about its performance. So just to give you a background about this talk, we recently pushed AXFS into 3.4 LTSI tree. So we think that AXFS is more powerful than its counterparts, like CramFS and SquashFS in many aspects. We think that AXFS deserves more wider users than it currently has. So in this talk, I'm going to give you some insight into AXFS so that more people can start playing with it and maybe include it in their products. And we can make it even better. So about embedded systems. So as we know, the cost, power and size and other constraints in embedded system make for the root file system compressed and read-only as an ideal choice. So commonly used file system in embedded systems like CramFS and SquashFS are also read-only and they support compression. Compression and read-only are basically very important in an embedded system. SquashFS supports compression of files in the file system and CramFS supports compression as well as XIP. CramFS supports execute in place at a file granularity. That is either the file can be an execute in place or it can be a compressed file. AXFS is a read-only file system which supports execute in place for page granularity. That's the main point about AXFS. So each page can either in a file, each page can either be compressed or it can be an XIP page. So we'll start with the introduction of what XIP is, what we mean by XIP. We'll look into the advantages and disadvantages of XIP approach. We'll look into AXFS file system image format and we'll take a short dive into AXFS source code. So what we mean by XIP is, XIP means execute in place. So wherever the file system image is, we basically directly execute from that place. So it's only possible in a memory mappable device which are byte addressable such as NOR, ROM and RAM, system RAM. So for XIP to be practical, it's very important that the speed of the device should be comparable to the system RAM for it to be practical. So what happens is the process virtual memory is directly made to map to the physical address of the XIP media on which the image is there. So we don't need to load the code pages from the media into the page cache in the system RAM. So basically this diagram summarizes what XIP is. So when we have a executable in the file system image, if we don't do it the XIP way, then basically we load the code and the data to the system RAM and then that pages are mapped into the address space of the process. The XIP way of doing thing is that we load the data of course but as for the code pages, executable pages, we directly map from the file system image, from the media. So there are some advantages that XIP gives us. XIP reduces the boot time and the launch time of the application. It is so XIP is faster because it does not load pages into the memory from the secondary memory. That saves a lot of time. The second thing is that it reduces cost. So it reduces the requirement for system RAM. It frees up system RAM. We can use it for some other purpose. On the other hand we can include less RAM into products. It reduces power consumption because less energy needs to be spent on refreshing the bits and the RAMs. And it helps reduce cost. XIP also has some disadvantages. One that stands out is that XIP is not suitable for pages which are hotspots in the system because generally the north flash is slower than the system RAM. So we really cannot put pages which are accessed in a loop, in a tight loop, or basically are very critical for performance. So those type of code do not go with XIP way. And yes, it would be cached. But if we have cached eviction, then in that case, again it has to be fetched from north. So usually, I mean if it's a very tight loop, then yes, it will always be accessed from cached. But if it is not a very tight loop, but something in an intermediate, then yes, there are chances that it will get evicted and you again have to fetch it from the north. So those kind of things are not generally suitable with the north. Yeah, yeah. So XFS is basically a 64-bit file system. That's very important because that differentiates it from CramFS and SquashFS. CramFS has size limitations as we know. It is a big Indian and it's a read-only file system. And as I told you, it allows XIP at a page granularity and that's very important. And it supports compression in sizes of 4kb to 4gb and it can directly mount from an empty device. So the main advantage of XFS is that the XFS image can be split into multiple devices. So this allows the XIP part of the image to be put on or flash and the non-XIP, the compressed part on the NAND flash. So what happens is that if we can split the images into two devices, then we can only have that much north as the number of XIP pages we want in the system. And the compressed pages can go into NAND, which is of course much more cheaper than north. So this device spanning though is only possible if the first device that is a north or wrong or whatever is memory mappable. So if the first part is XIP, then only this device spanning is possible. It's not possible in the reverse way that the starting of the image is NAND and the later part is north. This feature, as I told, helps in making the products more economical. And as we can see, the XFS can directly interact with the MTD layer or it can go through the blocked layer. So what is XFS profiling? So basically, XFS profiling is how we decide which pages should be XIP and which pages should be compressed. That's very important from system performance point of view. So how do we go about XFS profiling? So basically, the overall thing is that we typically, we run the applications, the general applications, the typical applications and the use case with the non-XIP XFS image. And we take the log of the pages which turn out to be executable during that profiling run. And that data we take and we feed it into the second step of building the image. So the image builder takes that log and takes the directory of course and gives us the XIP image. So this is a little detail of how we go about it. We build the image, the normal image. We take a directory and it gives us the output image. Then we run the typical application test cases. And then when we have the profiling data with us, we feed that into the image builder and we get an output image. So as I said, XFS profiling gives us the pages that should be XIP. So profiling to be effective should basically cover all the important aspects, all the important use cases of the application. So XFS for this purpose contains an in-build profiler. We can turn that on by a config option in the kernel. So after the profiling, we have a log which is fed into the image builder. So now XFS is a 64-bit file system. And yet it can have a very compact file system image. So how is that possible? So basically XFS byte tables is a scheme of things which makes XFS have very low overhead for supporting 64-bit offsets. So what byte tables is that each byte table, first of all byte table is a sequence of bytes. So each byte table entry contains only that many number of bytes that are required to hold the maximum number of the value that we are going to hold. Say for instance, if we are going to hold an offset, then the maximum value of the offset, we can only have that many number of bytes which can hold the maximum value of the offsets. The number of the bytes in a table, in an entry of the table is called the depth of the byte table. So let's say we need to store less than 256, a number of magnitude of less than 256, then we can have a byte table of depth only one. So in one byte, we can represent of course all the, from 0 to 55 of the value. And say for example, we need to store a number which has a maximum of 500, then we need a byte table of depth 2. So as I said, byte table scheme is the key scheme which allows AXFS to have 64-bit offsets in a file and still be pretty light. So following is the code snippet which basically sums up how the AXFS file system driver uses byte tables. Here we basically give the depth of the byte table we give the virtual address where the table is stored and the index into the table. And as we can see, it does a sort of pretty simple calculation and gives us the output of the byte, the value of the entry of the byte table. So of course the AXFS driver code only has, only deems with 64-bit numbers. So in the AXFS code we have 64-bit numbers and in image format, in the AXFS image, we have byte tables. So this is the format of the AXFS image. So we have super block, we have what we call as region descriptors and then we have a list of regions, contiguous regions. So region is a contiguous segment in a file system, in an AXFS file system image. Region descriptors are the descriptors for that contiguous segment of file system image. Basically they store the location of the segment and some of it attributes. The attributes stored include the size of the region, whether the region is compressed or whether it is XIP. And also if the region contains byte table, then it contains the byte table depth. So following is the on-media representation of the region descriptor, the region in the AXFS image. So we see we have offset which describes where the region in the AXFS image is located at, the size, the compressed size and so forth. So what are regions? Basically regions contain data as well as metadata. Basically they contain anything that file system holds. So most of the regions which contain metadata, they contain metadata in form of byte tables. Region which contain actual data are XIP regions, compressed region and byte aligned region. So XIP and compressed region contains the XIP pages and the compressed pages respectively. Bite aligned region contains a special type of data that is the data which does not compress properly. So if we don't get a good compression or maybe even get a negative compression, then we do not compress those data. We keep them as it is and they are kept in byte aligned region. So we have three types of region, XIP, compressed and byte aligned. So within a region, the data is stored in chunks of chunks called nodes. So these nodes are usually 4 kilobytes in size. These are basically the pages from the files, the data, the pages from the files. If it is an XIP, it will be as it is. If it is compressed, then of course it will have two, three blocks of data compressed together. And nodes in the region have types. So either they are XIP compressed or byte aligned. This is a super block structure for AXFS. We have various offsets stored. So I'm going to talk a little bit about a few of the regions in AXFS images in AXFS image. So we have string regions, node index region, node type region, inode array region. So string region, strings region basically contains the names of the files. And basically it is the only metadata that is not in form of a byte table. The offset within the strings region for a file name is pointed to by inode name offset region. The inode array index region contains the offset into node index region for each of the file. So basically node index region has the index into actual data. And inode array index points to the node index region. So following are the functions which access these regions from the file system image. So as we can see it takes a super block and the index of the inode. So here as we can see it gets the offset from the inode name region. And this contains the virtual address in the memory where the strings region is there. It adds and returns as the string. So we have node type region and node index region. So this is basically how do we get the data out of the, how do we interpret, how do we, how the file system driver gets the data from the image. So node type region basically contains the type of the node that it is pointing to. And node index region points to the index of the node. So basically index is index into one of the data nodes. And the type determines what type it is. So for example index of an xip type of node is just the page offset into the xip region, into the compressed into the xip region. And index for byte aligned type of node is an offset into a byte table of region which contains the actual offset into the byte aligned region. So it's a double pointer. So index into compressed type of node is a little complicated. It basically, the index for a compressed type of node basically is an index into two separate type of regions. One type of, so one region is C block offset region and another region is C node offset region. So what is C block and C node is basically, so C block is the compressed, is the compressed block of data. So how is C block prepared is that we have various and C node is the uncompressed data. So we have various x number of C nodes. We compress them and that becomes the C block. So basically C block is a compressed version of concatenated C nodes. We can put it that way. So when we uncompressed the C blocks, we get basically the concatenated C nodes. So basically the index into a compressed region has an index into C block which is taken out from the image. It is uncompressed and then the index into the C node that we had is utilized to find which C node in the C block is our data. So this is just a snippet of code that does the, that implements this index into two different nodes and gets the data. So let's talk a little bit about the AXFS performance. So we measured AXFS performance on following hardware. It's known as Navi Engine Board. It has an NEC SOC. CPU is ARM v6. It has four cores. We used 20 MB of memory. It has 32 kb of Alvin cache, iCache and dCache. It does not have an L2 cache. We used 3.0 kernel for performance measurement and we had 64 bit of North Flash and 4.5.1 is the GCC that we used. And the file system image was mounted on the North Flash. So basically these are the performance parameters that we covered in our performance of testing, performance evaluation of AXFS. We measured the application launch time, the application boot time basically. And second parameter was we measured total flash used by the memory, total flash used by the file system image, which is basically the size of the file system image itself. And third parameter, performance parameter was the RAM footprint of the file system. So before going into performance numbers, we also thought about some other ways to improve the AXFS performance. So one of the techniques that we played around with was following is the technique that we used. So we used it in combination with AXFS. So the technique was basically about improving the code locality of the program. So basically what we did was we implemented a tool which records the calling order of the functions in a program. And then the tool generates a linker script which basically places those functions together and in that order so that it gives us a better performance. So improving the locality of the code gives us less number of page faults. And basically the less number of page faults is very important because when the code falls in a file system such as AXFS, so we have to fetch the page from NOR flash which is a little bit costly. So this really helps improving the number of page faults in the system. So basically it improves the application speed and reduces the total RAM that will be used when the application is running, basically the RAM footprint. So this is the description of the tool that we implemented. Basically a simple tool that, so if you have any application, we just compile it with instrumentation option of the GCC. So the application is run and we have a tool implemented which attaches to the application. It traps the function calls and basically spits out a log of the order of the calling of the functions. This order of calling of the functions log is then given as an input to a linker script generator script. So this script gives us a linker script and so now we can rebuild the application and now basically the application will run faster. Okay, so coming back to the AXFS performance thing. So basically the application that we use to measure the performance is our basically dummy application. It basically simulates large number of page loading during the launch, during the launch of the application. So to give you details, we had around 500 functions called in pseudo random number order. So basically as you would know, pseudo random numbers are numbers which are produced in the same sequence if you give the same initial seed. So each function had around 1600 bytes of size. Mostly they were knobs and a few memory access instructions. And this application, we used it because of some constraint, we used it as an init process of the system. So we measured timing by basically embedding the calls to record the value of the clock at particular given position in the program. So calls were embedded just before the kernel spawns the init process and just after executing 500 functions. And we measure the difference between the clock time between the two calls as a performance measurement. So basically the measurement goes like this that currently we will have four file system images. One is the AXFS which is without the AXIP profile pass. A plain AXFS compressed image will have an AXFS XIP image which is generated after, as I explained, after getting the profile data and feeding it back into the image builder. Then we have the code locality improved program on top of the compressed AXFS image. And finally we have code locality improved program on top of AXFS XIP image. So in this measurement, we measured the application launch times. So application launch time basically is a difference between the just before launching the init process and just after finishing those pseudo functions, 500 pseudo functions. And so we just measure the time duration it takes to launch the application for all the four images. So basically this is the numbers for the launch of the application. As we can see the compressed AXFS takes the largest amount of time. The locality improved AXFS takes little less. And as we can see the benefit of XIP that it takes the least amount of time. Though locality improved AXFS and the normal XIP AXFS are almost have the same performance number. So basically this shows us the improvement that XIP gives us in terms of the application launch times. Next measurement that we did was for the total size of the AXFS image, the total size that it takes to store the images. Again we have four images. We have the compressed, sorry for that. So we have four images. We have AXFS compressed image. We have AXFS XIP image. And the other two locality improved on top of AXFS compressed and AXFS XIP. And the measurement is basically the total size of the file system image. So as we see in this parameter in this dimension the compressed AXFS is the best as it is supposed to because it is compressed. So XIP gives us a little larger image and then the locality improved images. The third performance parameter that we tested was system RAM footprint. Again we have four same images. And so what we mean by system RAM footprint, so the actual definition of system RAM footprint is the total size, total memory used by the application when it is running. We, however, here mean a little different, we mean it different when we say system RAM footprint. What we mean is that the total RAM used in the total system when the application is running and not just by the application but all of the RAM used by the system when the application is running. That includes all of the memories. So basically we measure the total memory used in the system while running our application. So it's fair as our application is the only process that is created in the system during that time. So since it's the only process, so it's a fair measurement. So as we see the total memory used is the total RAM footprint is lowest for the XIP AXFS. And it's a little more for compressed AXFS image. So key point of AXFS is that it allows XIP at page granularity. Though it requires a profiling pass and another important thing that AXFS gives is it can span multiple device. We can use it over the MTD layer. And basically mostly suitable for ports that are boot time that are launched, which are launch intensive are not very suitable for hotspots in the system. Any questions?