 Hello everyone I'm engineer from Intel my name is Liu Changpeng this is my colleague Liu Xiaodong so we have achieved spdkv host views target I'm going to present the slides before we do these things we want to tell you why do we want to create and we host views target if I listen to the morning session or if you have to use the container or VM for 9pfs maybe you will be more familiar actually in the beginning if want to use the user state we're using 9p we're also discussing about the disadvantages of 9p we have also talked about whatever FS so then we shift to views we have already do something and we have also done a simple compatibility with views target that's why I want to demonstrate that to you about what we have achieved before we start with this topic I want to tell you what is spdk actually is a set of libraries at the bottom is the QEMU and then these are the blocks and the storage layers and I want to highlight about what we are doing is up to two million process a call so we already reached and tens of millions calls levels on our web page we have described about what are the things we have done and we also have a comparison with the new technology this is the architecture of spdk from the bottom layer we have NME driven and also we have quick data copy engine and based on that we have storage services and we have block device drive larger volume can also be used for block management we also have decrypto we also have cat OCF cache and this is the open channel FTIL this is compression engine above that we have in storage protocols and NME target in the previous topic we have already talked about that in the V host we have V host dynamic V host host and today we're talking about V host to fuse target and this is a user state implementation and we also have block store which is a block management distributor this one is about unified as PDK V host user target we can look at this we support the host user block we support what house gossip and similarly we support the host VME so we can use VMV protocol to decipher and this is about the the host fuse target so so we can build on common V host to use a library if we don't know about the host actually what how is defining spec the host is a time limit basically there are two experiments kernel and in kernel and also we host use for the V host target the biggest advantage of the host is that eliminates MMI all access for submission with polling mechanism and the way to achieve that is to use the polling mechanism if we're sharing a file from guest to host or existing protocol is 9pfs there are also two ways to achieve that I have listed here the first one is this what about 9pfs if you use keymult process all processing will be within the queue meal so this is what how 9p and there is also V host 9p I didn't add user here because this is a kernel solution for the kernel solution there is also upstream minimal strong so this is the V host in 9p kernel now let's look at the back end for the front end it's the same we have 9p kernel we have 9p driver for the back end we use what house 9p to send request and then we use the standard port to call that to the standard system from front end to back end we're talking about what's I've used it's actually file system in user space it can assist of two parts one is fuse driver we FS what your file system is a common layer in Linux and the fuse driver is a kernel space and another one is Libre fuse the user just use Libre use and they will be able to achieve a customized fuse demo if we're going to use a fuse in virtualized environment how can we use that in the virtual environment what kind of changes do we need to make this is about what IO FS is the major drive of redhead I see in the community there are some Chinese colleagues that are already submitting patches this is the shared file system that lets virtual machine access a directory tree on the host actually it is designed to offer local file system semantics to replace 9p because 9p is still a network file system right now what IO FS we started and is being developed it's not finalized yes what how has been submitted to the community if you have any opinions you need to raise about your request for new features or new solutions or any other request that need to be added actually it includes four parts the first one is Linux so you need to add what IO transfer layer above Linux and the second one is Kimu and then a fuse and Kata container in 1.7 the modification has already been added in this back it has added a shared bar to be allowed to present a assistant memory so it will have the bypass package this page is about the projects we have mentioned they have already realized the open source and for tower capacity it is in the community data there is such a process so a few to evaluate actually it's already working and pass through why we call it pass through even if it has received the a few requests it didn't have any implementation it will just pass through to the existing file system We have the, we host a user library so that will mean that for the very host of the target we can go get the refused very host request and then we will hold the block past API and then there is a block driver to be put into the physical board so you can see that the part is kind of empty and that's part so far we do not think there is any proposal which can bypass the air and about the director access for cells it is where the bar memory ready to provide the 4G cache to provide the kernel space by parse cache so about the SPDK block s and block store for the SPDK block s and block store the current implementation is very simple it is mainly for large bells for small bells it is not good enough at this moment and for application it is interacts with trunks of data which is called a blob so one blob can be consistent consisting of different trunks and then the APPs can do not consume block just as the block SDB and then designed for faster media that is a more friendly optimization and what is more important is the following points the first is the synchroners there is no blocking, no queuing and no waiting it is fully parallel, no locks in IEPOs so this is about the blob store but for the blob APPs there will be locks then the underlying layer for logic volumes we have the snapshot and the thin provisioning functions so this is the layer of the blob store we can see that it is different kind of pages for NB, so here you are from BA0 to LBAN so the space is cluster and cluster 7, 8 is a larger one so this blob store you can regard it as an array of the pages and this is the blob store it is stored in pages in the reserved regions of the disk where we call it met data well quickly go through this because this all vary details so this is about the tables actually for views it is a different kind of request and for blob FS it is different kinds of APIs so you can correspond them together for example for lookup if you are going to lookup a file we have an iteration we can see that it is from the first file to the next file and then that got tribute the open release create, delete, read, write, rename, flash so these are the common APIs we have the asynchronous APIs provided for the new compared to FSD it is totally asynchronous but the synchronousness have some limitations means for the underlines you do not need to log but for the files itself you need to have the log and these are also some details which is about the views corresponding we can see that it has been embedded in WinRing so for the general a few requests it will also put in WinRing so this is about the open and close operations in views and ASPDK for example if on the top it is open then there will be two requests one is lookup, the other is open and for lookup we will have the sbfs to iterate to the file and then open it so that is kind of preparing resources for close it is just the opposite like open it has two one is release, the other is forget so that area is related to the sbapi so this is some of the implementation details with read and write as well as how you embed the data so we can see that on the bottom it is posix-read and then it has been converted into views-read and then to put it into the views in header and views-reading and then for the views in header and views-reading to embed the data into a layer and then for the view-host target we can see that if it is different there will be a new request then it will get the views-request and then it will call the file-read aid to process the request so we can see that this is the views-read and then data buffer and etc to summarize as I have already mentioned the Blob FS currently can only support limited file APIs and another big limitation is that we can only support append-write and another thing which is not available is DAX is very useful but if you enable DAX or disable DAX if you do some evaluation measurements the performance would be quite different so that is why for MAP we didn't quite realize and then friendly for write the second point the workload is very friendly and we also need to think about some using scenarios for example if to use it in containers whether it is to write is friendly or to read the read cache feature is very simple and the third one is the big files it is not friendly to small files and the next step plan we actually because for file system it is very complex there are some limited scenarios originally we wanted to use the ROX DB but now user scenarios has been changed so that means we are not limited to the ROX DB so that is why if we are going to expand to a few we have a lot of work to do including the existing models for block storm the underlying layer is totally asynchronous and there is no lock but if it is related to the file systems for example if you have different request to process or to control the different files at this time you need to add the lock if it is different kind of files you can put it into parallel benchmark for the Vota FS itself it was just realized recently so our next step plan is we will make it more mature either to wait until the community is mature until the block FS is mature then we will do some benchmark test then on the bottom there are some useful links because for Vota FS you can see that from the second link you can get it and for the SPDK related support it is on the number one and number three links so what you need to do is to prepare the KMU Linux and the benefit of this is that we only need to pay attention to the timing of the block FS for KMU kernel and views we just give it to Red Hat or community to do it so that means we don't need to pay too much to the upstream and next about our SPDK community currently we've got a different kind of we have the class lack which you can find us and then can track some tasks and then GitHub, Jenkins and the weekly calls so anyway it is an open source program you are welcome to join any questions? Hello, about DX I still have some confusion because based on our understanding for the NV address and block address needs to have a mapping and then the mapping will be saved to the page table and then you can access the NVD physical address but on this chart I don't see how does it function how do you do the mapping? I don't understand it fully actually when you look at what IO it has done for our space it doesn't mean that you need to have a physical device for how do you simulate MDM you can use memory so I can inform my guests this part of the memory is persistent and then it will be bypassed in the memory suppose I have 2G sorry I can't find it suppose I have 2G data here I know it if I have 2G data on the host because if we have requested DAE and then send it to the guest so we have a persistent memory with the 2G data and it's a device and then on the what tile it will be on the hosted memory actually when you have the address it will bypass that so it is bypassing this data cache if you map that if you have a 2G file space actually you can see it on the host okay I got it I have another question about PMDK I think it's also included in SPDK PMDK and SPDK are two different things but normally when we have summit it's together I think for the SPDK in the roadmap it's included our roadmap is saying that we're supporting PMDK we can call the API but they're still running their own community okay thank you I'm still asking about the DAX because you haven't set up about the mapping and I'm mapping as VPI can the block do this? yes we still do not support random write random write it's not a random write for the 2G bar memory because you need to map the FD in the chemo for your Vhost user Dima I understand block FS can only be accessed by Dimo instead of QM for Vhost solutions actually Qimo is registering the whole memory on the host because Qimo is an independent thread and Vhost and target is also a thread actually they can have memory sharing that's why we require huge TBFS that DAX is only the guest memory DAX is in the additional host and Qimo needs to change that it doesn't need to change it also needs to register on the host I don't remember that we have such a thing you can take this part of the memory Qimo is reserved by Qimo for 2G space it will be started together with Qimo that includes 2G memory that means for the Vhost you are going to operate this yes we can do that but the key is that what is the reason for you to use pretty cache so maybe you want to flush after writing the memory but our backend does not support random write even when we support the block FS can only do the flush when we support that because someone has written on the map ready so it is read-only cache we can only have read-only cache we can have anyone to update and write directly and update that to require if we can achieve that the DAX can be achieved if you are doing that will that be on the Vhost daemon because the memory belongs to Qimo I still want to ask about DAX I don't know in the DAX kernel if you use the FX what is the purpose for that? is it a must? it's optional and we have also a market that is unstable if we don't have these back anything we will have them in DIN card which is for persistent memory if you don't have that card when we have a map to the DRAM it can be lost actually it will go directly to pdcache if you are going to write back and flush actually the things in the memory will be flushed to the disk if you have 2G disk space which is corresponding to the LB space so which side is maintaining the table if you don't have that it will go to pdcache if you have DAX the backend block device is persistent that means it will go directly to the block but if it's a virtual it's another story it will just tell the DAX kernel that it is persistent it will bypass the guest so ultimately it will still result in that part of the memory but actually for the cached data or cached metadata maybe you are connecting that with MIT there's no memory block so the cache will lose are you doing that for performance? the purpose is to reduce the pdcache footprint for this solution it's also a host memory and for the guest it's also a host memory I don't have physical persistent device so the performance should be the same but when you have pdcache you can have some cache so you don't need to go to vertical normally if you don't have a DAX device if you're going to write SSD actually it's in pdcache when you have this device it's no longer necessary I think the reading performance will be worse because I can ask some memory but now we are accessing the disk when you are accessing it's not as faster as pdcache when you have a physical disk compared with DRAM of course the performance will be worse because I've seen some tests for Apache desktop and non-Dex access sometimes the performance will be worse yes, thank you