みなさんこんにちは。私の名前はヤスノリゴトです。私とランサンは、エネルギーのデブロクションのフォアフロントについてお話しします。ここはアジェンダーです。まず、エネルギーのデブロクションについてお話しします。エネルギーのデブロクションの基礎はエネルギーのデブロクションについてお話しします。そして、ファイルシステムタックの問題は、ダックスはダイレクトアクセスモードです。次に、ランサンは、エネルギーのデブロクションについてお話しします。ファイルシステムタックについてお話しします。エネルギーのデブロクションについてお話しします。では、結局についてお話しします。私の自分を紹介します。2002年から、レナックスとレナックスのオーエッセスについてお話ししました。レナックスとレナックスのオーエッセスのオーエッセスについてお話しします。今、フジツレナックスのデブロクションのリーダーを開催しました。数年間、NVDMについてお話ししました。NVDMの多くのインファンスメントをしています。例えば、フォルトロケーションやフォルトプレデクションを紹介します。レナックスの基本のNVDMを始めましょう。ここは、NVDMのキャラクターリスティックです。NVDMのパーシステントメモリーのデバイスはディーラムの中に入れています。NVDMについてお話ししましょう。でも、このデータのパーシステンシーはパーシステンシーの中に入れていません。でも、NVDMのディーラムのアクセスレーテンシーは、NVDMEのアクセスレーテンシーではなく、ディーラムのフュージャーのアメリーではなく、ディーラムの使用方法は、メモリーデータベース、ハイアラッキーカールストレス、ディステビュータストレス、 etc。そのように、最も有名なプロダクトは、データセンターパーシステントメモリーのモジュルです。NVDMのインパクトは、その普通のアイオレアがNVDMのリダンダントになります。まず、ページキャスがNVDMのリダンダントになります。スローアイオストレスが、NVDMは、まず、ページキャスではなく、次に、シンクシステムコールは、リダンダントになります。シンクシステントメモリーのプロダクトは、CPUフラスのインパクトは、それから、パスタンシーが足りない。オディッションでは、システムコールは、リダンダントになります。アプリケーションが、NVDMのリダンダントに立ち、リダンダントのリダンダントになります。その時、システムコールは、多分、カネルモードとユーザーモードに関するために、新しなインターフェイスを、NVDMに行われている。しかし、NVDMは、普通のソフトウェアでは、多くのソフトウェアが、その記憶は、丸く見えません。意味してください。あれは、NVDMのソフトウェアを使用し、まず、スタッフのパワーを準備する必要があります。GPU cacheは still volatileです。スタッフのパワーを突然にすると、そのデータは not be stored.次に、NVDMのデータストラクチャーのコンパチビリティを必要です。NVDMのデータストラクチャーはないのです。如果、スタクチャーが変わるとソフトアップデートはデータストラクチャーのためにオーディションで、コラスピングデータを担当する必要があります。データが壊れた場合、ソフトアップデートを担当する必要があります。最後に、データエリアを担当する必要があります。ソフトアップデートはフリーエリアではないのですがデータを利用する必要があります。オーディションで、カーネルマスターサインでオーソリティチェックでコンフリクションを担当する必要があります。ファイルシステムは使用しています。ファイルシステムは前回のコンフリクションでフォーマットコンパチビリティのファイルシステム、データコレクション、ジャナリングのコーピオンライト、レジョン・マネジメント、オーソリティチェック、 etc。カーネルマスターサインで使用しています。しかし、ファイルシステムは使用しています。と言うと、ファイルシステムはもっと重いです。そしてシープユキャッシュフラッシュは足りないのでファイルシステムに新たなアクセスを担当する必要があります。前回のコンフリクションでファイルシステムにアプリケーションを担当する必要があります。アプリケーションはIo interface like SSD or HDDです。アプリケーションはこのモードを使用できません。次はファイルシステムダックス。ページキャッシュはファイルシステムダックスを使用しています。アプリケーションはアクセスエヴィリムエリアを担当する必要があります。でも、ファイルシステムを担当する必要があります。次はIo XT4のアプリケーションを解決する必要があります。次はデバイスダックス。アプリケーションはアクセスエヴィリムエリアを担当する必要があります。デイスラッシュデブスラッダックスはアクセスエヴィリムエリアを担当する必要があります。では、リード・ライトを使用できません。デバイスを売り、新しいアプリケーションでアクセスエヴィリムエリアを詳細に、PMDKが協力されています。ファイルシステムダックスとデバイスダックスがアクセスエヴィリムエリアを知り、スラッシュデブスラッシュ第一のスライドアクセスエヴィリムエリアをFileSystemDux-dev-pmm is created and DeviceDux-dev-dux is created.NUDControlCommand can create this device when it creates namespace.Here is the example of FileSystemDux.NUDControlCreateNamespace command creates-dev-pmm0.You can make FileSystem on-dev-pmm-s or-dev-pmm.Please note DeviceDux is character device.Since you cannot use read or write for DeviceDux, you cannot use daily command for backup.As I said, PMDK is a set of libraries and tools for FileSystemDux and DeviceDux.It includes many libraries and language bindings and tools.It is for not only Linux, but also you can use PMDK on Windows.I would like to introduce some of them.The first is Live-pmm.It is a low-layer library.It calls a map to use NVIDIM and calls suitable CPU cache flash instruction and etc.Live-pmm object is a high-layer library which supports transaction of the object on Dux.It is for general use case and it's highly recommended library in PMDK,but users need to understand how to use its transaction.As I said, DeviceDux is character device.Then you cannot use daily for backup.So,Dux.io command is provided instead of it.I'd like to introduce two new libraries of PMDK.The first is Live-pmm.2.It is a new low-layer library.It introduces new concept, granularity.pmm.2 granularity page is for traditional SSD or HDD.pmm.2 granularity cache line is for persistent family.It is the case for process needs flash cache to make persistency.The third is interesting.pmm.2 granularity byte is for persistent family too,but it is for the case for platform support CPU cache persistency.It introduces new functions to get unsafe shutdown status and bad block.This library uses library of nd control command internally to get this information.Unfortunately,it's interface is different from old live.pmm.The second new library is liverpmm.It's new library for rdmm.The first library for rdmm is liverpmm.It's experimental status due to no user.Liverpmm is created with user's requirement.There's a difference between old liverpmm provides and customer's expectation.Here's a quote of its liverpmm presentation.Currently,official release of main library of this library is planned at2020.I need to say file system does is still experimental status.It is very expected interface.The management way of nvdm is almost same with traditional file system.Operator can use traditional command to manage nvdm area.If application call nmap for a file on file system docs,then it can access to nvdm directory.In contrast,device docs requires full management by tools of PMDK.Otherwise,a software need to possess whole of the namespace.However,the experimental messages is shown when the file system is mounted with docs option.Because there are some issues in kernel layer.So,experimental status has been for few years.So,in this talk,I would like to talk why it's experimental yet and what is the obstacles.Okay,let's start issues of file system docs.What is solved and what is current issues.In summary,there are two big reasons.First is that file system docs combines stress and memory characteristics.This causes corner case issues of file system docs.The second is a more additional feature was required,but it is or it was difficult to make it.The first issue is configure docs on and off for each iNode,like directory or file.Second is coexistence with copy and write file system.So,let me introduce corner case issues.Most important problem is updating metadata of the files.In file system docs,we expected that application can make persistent data with only cpcache flows as I said.However,this also means there is no chance to update metadata by kernel or file system.As a result,update time of the file may not be correct.If an application use write some data to file on the file system docs,and a user remove some blocks on the file by truncate system call,cunl cannot negotiate it.Data of the file may be lost.If data transferred by dma or rdma to the page which is allocated as file system docs,similar problem may occur.Here is the current status of this issue.For general write access,it was solved by introducing new map sync flags of mmap.When it is specified,page fault is occurred when write access,then kernel update metadata at the time.PMDK already specifies these flags.For rdma or dma data access,for kernel or driver layer,it's solved.It can wait truncate until finishing rdma.However,user process layer like infinite band or video card,it is not solved.Truncate cannot wait the completion of transfer because it may too long time.Fortunately,there is a walkaround.It's on demand paging.In OTP,usually driver or hardware does not map the pages of dma or rdma area for application.And it maps the pages when application access them.Then kernel or driver can coordinate metadata at this time.Melanox newer card has such future.Next issue is DAX on and off for each inode.Here is the expected use cases.The first is need more fine grain settings.Users may want to change the DAX mode depending on each file.The second is change DAX attribute by application.Configuration is always painful for administrator.Then,if application can detect and change it,it will be helpful for them.The third is performance tuning.Since the right latency of nvdm is a bit slower than run,user may want to use page cast by DAX off settings.Finally,it is good for workaround when file system DAX has a bug.So,what is the problem?If file system changes DAX mode attribute,file system need to change methods of file system between DAX and normal file.But they may be executed yet.Data of page cast must be moved silently when the DAX attribute becomes off.This problem was very difficult. Fortunately,it was solved.And how it was solved here?The DAX attribute is changed only when its inode cast is not loaded on memory.File system can load suitable methods for each attribute when it reloads inode to memory.So,page cast of the file are also dropped.User can use this feature with a new mount option,a mount-dux-inode.The DAX attribute is changed by this command,like,this is an xsx example.xsxio-change-atm plus x is DAX on,and minus x is for DAX off.Please note the following.All of applications which use the target file must close it to change the DAX attribute.File system will postpone changing the DAX attribute until dropping inode cast and page cast of the files.Currently,admin administrator may need to operate drop cast to achieve it yet,eco3vmdrop caches,because inode caches may remain due to race condition.However,this operation affects all of page caches and slabs of the system.So,we are trying to solve this problem.We made patches to evict inode and de-entry cast as soon as possible when the DAX attribute is changed.It said I don't cache flag and de-cache don't cache for it.Then,sync inode before eviting inode.The final issue is co-existing with copion-write file system like reflink dedupe of xsfs or butterfs.In this feature,if there is the same data block on different files,file system can merge it as same block.So far,if only file system managed such block,it was enough.But in file system DAX,it becomes not enough.The first problem is that memory management layer also needs to understand merged block.Reverse mapping for plural files are necessary.A merged block has only one abstract page,but there is no way to how to save offset of plural files.Very difficult.Second is need-enhanced IOMAP and file system DAX for copion-write.IOMAP is newer interface instead of block IOLayer.In other words,structured buffer head.It can lock and submit IO for plural pages at once.File system DAX depends on IOMAP feature,but it does not support copion-write.Though xsfs already use IOMAP,but IFES also needs to use it.Ransom will talk how to solve them.Ok,it's my turn to talk about how to solve the issues of file system DAX.I'll do it in two parts.First is how to support reflink slash dedupe for file system DAX.Second is how to improve the nvdm-based reverse mapping.Please allow me to introduce myself.My name is Ranshiyam.I'm a software engineer of Fuji's Nanda.I used to work in in-baptic system.Currently,I'm focusing on Linux file system and persistent memory.Here is the background of the issues we need to solve out.Currently,FSDAX is still in experimental statutes.It is because that reflink and FSDAX cannot work together.Here we try to use them together and see what will happen.Firstly,create a new xsfs file system with reflink feature enabled.And try to mount it with DAX option.Then,the error message will be shown like this.After little investigation,we found that the dmessage told us the reason.DAX on XFS is experimental.It cannot be used together with reflink.So,what are reflink and FSDAX?Why they cannot be worked together?I will explain in the next pages.What is reflink?It is a file system feature that files can share their extents for same data block.For example,copy file A to file B.The normal copy will take some time to duplicate data extents.Each file will have its own data extents.As a result,the larger file we copy,the longer time it will cost.But what will happen with the reflink copy?We execute copy command with a reflink because always flag.It will finished immediately.Because it won't duplicate any data extents for file B.Just map its data range to file A.So,it won't cost more time or occupy more disk space.Even,file is very large.As we key,reflink has two advantages.Fast copy and save storage.Since these two files are sharing data extents,what happens if we want to write some data to the file B?In order to prevent both files from being modified,we need a copy-on-write mechanism here.It copies the shared data extents before new data is written.This process is shown in the figure below.When we are going to write data on file B,the system will allocate new extents and copy data from theextent1 to the new extents.Then,write use data into it.At last,remap the new extents to file B.So,this isreflink.And what is FSDAX?It also called filesystemDAX.FSDAX is a mode of an open-dim namespace.In this mode,page cache will be removed from the IO path.It also allows mmap to establish direct mappingsto persistent memory media.So,why reflink and FSDAX cannot work together?We have invetiscated this problem in depth.We found two main issues.Firstly,we need to enhance IO map and FSDAXfor copy-on-write mechanism.The IO map interface needs to be extendedto support copy-on-write.The implementation of copy-on-write should be addedto the new FSDAX.Filesystem should also adapt to the new IO map interfaceand support coexisting of the two features.Secondly,memory management layer also needsto understand merged block.To achieve this,we need to improvethe current MDM based reserve mapping.I will explain how we solve these issuesin the next few moments.Let's start with the first issue.We need to enhance IO mapand FSDAX for copy-on-write.Let's take a look why we need to enhance it.Look at the flowchart on the right.The normal right will firstly load datafrom disk to page cache.Then write data on the page cacheand finally flush the page cache to the disk.This process implies a copy-on-write mechanism.But in FSDAX mode,as is mentioned before,page cache is removed from IO path.Data will be written without a copy.As a result,the data block in whichthe data is written will be incomplete.So,to solve this issue,what must be implemented?As we know,XFS uses IO map model.It implements IO map begin and IO map end.The FSDAX implements actual interface.Reflink is implanted by XFS.New extent will be allocatedin IO map begin.But it also needs to store source extentfor copy-on-write.In actual interface,the wholecopy-on-write mechanism needs to be added.copy source data from source extentto new extent allocatedin IO map beginand write user data in it.Finally,the new extent should be remappedto the file.Otherwise,it seems nothing is changedwhen we read the file to checkwhether the data have been written.In summary,we need to enhancethe current framework in these three levels.IoMap,FSDAX,and XFS.Let's start from IO map.IoMap model provides a structurenamed IO map to storethe distance where the data to be written.IoMap begin allocates the new extentand fill its properties to theIoMap structure.But it is not enoughfor copy-on-write mechanism.The copy-on-write also needsthe source extent info,such as the start block number,which means where to copy from,the length,which meanshow long to be copied,and maybe some others.So,we introduced a new typeof IO map to distinguishcopy-on-writewith normal write,and another structureIoMap called SourceMapto remember and passthe source extent infothrough the whole IoMapprogress.The code div is shown as below.After creatingthe new source map,what we need to do isto fill its members.The structureIoMapis always filledat the end of IoMapbegin.The structuresource map is only filledwhen it is share extent.If it contains real data,then store the extentinfo into the source mapand set the typeto be IoMap copy-on-write.In this way,the copy-on-writeis able to executein the writePASSand MAPPASS.Let's see the writePASS.First of all,the DAX driver providesan interface calledDirectAccessto translate the offsetinblock deviceto physical memory addressin persistent memory.With this feature,read or write datawill be easily replacedwith memoryCopy function.So in the writePASSwhen the IoMapCopy-on-write is setwhich indicatesits needCopy-on-write executed,we can use this interfaceto obtain theaddress of source extentfrom the source map.And thenmemoryCopy the source datato distant extent.After that,memoryCopythe user datato the distant extent.In the MAPPASSwe will get a virtualaddress by mmapping a file.Writingat the virtual address will causePageFort.FSDAX supports PTFortand PMDFort.ThePageFort handler usesIoMapModel as well.So,SourceMap for MAPPASSis similar withAlignate for writePASS.Just translatesourced address by direct accessfrom the source mapand memoryCopy source datato distant extent.The final stepis a bit different.Justassociate the page to theVMA.VAPPASSwill be done in userspace.Duplicateis another important mechanismfor reflink.It allows the existing filesto share the same data extentin order to save storage.It requiresa compare function.For normal mode,the generic dedupe functioncompare data in page cache.Butit is not suitable to FSDAXmode because of no page cache.So,we introduced a DEX compare functionfor FSDAX.Again,we use direct accessto get address of two dataextents and thenuse memoryCompare functionto compare whetherthey are same.If same,it means thatthe two files can share the data extent.However,we should pay attentionto check if the two filesare both enable FSDAXor neither.Files with differentDAX flags cannotbe dedupted.At last,we also need to remapthe new extent to the fileor clean upif error occurs.Till now,we areable to make reflinkand FSDAX work togetherin the right pathand a map path.However,it just looksfunctional on the surface.In depth,there is another issueneed to be fixed.We usually thinkof nvdim in FSDAXmode as a block device.So,we can seefiles share same data extentbecause of reflink.But,sinceit is nvdim,we also need to thinkof it as a memory device.In another word,files are sharing the samememory pages.So,the memory managementlayer also needs to understandthe shared block.As a memory device,nvdim may failin hardware level.That means the page is brokenand cannot beaccessed anymore.The system triggers memory failureto handle this.When memory failure occurs,the system will trackall processes associatingwith the broken pageand then send signalto kill those processes.The track from memory pageto a file is usuallycalled reverse mapping.In this case,we callit nvdim-basedreverse mapping.The current nvdim-basedreverse mapping can only supportone page to one file mapping.However,forreflink,because filesare sharing the same page,we need to improve itto support one pageto multiply files mapping.To solve thisproblem,I have thoughtof many ways.Here isthe approach to solve this.Each idea is based on the resultsof the previous one.In idea 1,I createda dexarmap arbitraryto store more file mappingswhich caused huge overhead.I reduced the overheadin idea 2by introducingstorage lostinterface.Furthermore,in idea 3,I removed the useless overheadcaused by storage lostand addinterface to support morethe sdx mode.Let's start from idea 1.To support oneto multiply mapping,thefirst thought is to store morefile mappings in one page.The old implementationuses pages mappingto store file mappingand pages indexto store its offset.It can only store onefile mapping.So,I introduceda dexarmap arbitraryto store morefile mappings.To save memory usage,weassociate file mappingand offset to pages mappingand pages indexat the first time.It's same as the oldimplementation.Because the page is sharedwith many files,theexarmap arbitrarywill be created at the second timeassociating.And insert file mappingand offsetas the trees noteat the second timeand later.The arbitrary's rootis stored in the unused memberpagespreventat the second time associating.It is todistinguishif a page is associatedonly once or more times.If it is twiceor more times,the page index is usedas reference count.After associating,it is ready to reversetrack in Memoraphilia.As is shown in the figureon the right,the Memoraphilia getsbroken page locked first.Then iterateFifthMapingsimmap to findVMAs of processes.Collect a list ofprocesses need to be killed.Thenunmap those page mappingsand kill processes.Finally,unlock the pageand exit.With theDexarmap arbitrary,we only need toadd an iterationof it outsidethe process tracking.This double iterationmake one tomultiplyreverse mapping possible.However,in some cases,each page may containoneDexarmap arbitrarywhich may cause huge overhead.It should not be underestimated.Also,this per-page tracking methodonly works for filebecause the associationonly supports file.Metdata cannot be handledif the Memoraphilia hits on it.To reduce the huge overheadwhile still being ableto build reverse mapping,we then look intothe file system feature.XFS has a featurecalledRmapBtreewhich recordsthe owners of each dataextent.The owner could beFile,Metdata,and other file system data.So,we introduceda file system interfacecalledStorageLostto return an owner listinstead oftheDexarmap arbitraryisto reduce the huge overhead.The interface is implementedby file system.Firstly,currytheRmapBtreefor ownersof the broken blockone by one.Secondly,returna list of ownersfor the memory failure use.The association could be very simple.Just store the five systemssuperblock in pages mappingas the specific colorwhenwe execute tracking.And storereference counts in pageszone device data.It is not usedin Memoraphilia case before.The track processis similar with previous.The difference iscurryStorageLostto create an owner's list.This list is usedfor tracking frompage to file mappings.Then,iterateeachfile mappingsimmap to track processes.The rest partis same asprevious.In this way,the huge overheadhas been reduced.However,itstill leaves a little overhead.We needextra memory spacefor the owner's list.It's better to remove the overheadcompletely.Instead of creating the owner's list,we handle every ownerduring the curing.We still call storageLostto curing ownersof the broken block.Since the file system curiesowners one by one,we can call MemoraphiliaHandler to handle each owner when it is found.Then,as usual,track owners'immap to findVMAs of processes.As a result,theextra memory spaceis not necessary and removed.The one to multiplyReverseMappingis implemented.Furthermore,there is room for improvement.Currently,Memoraphilia can only supportFSDAX mode.It is necessary to make ita common thought handler forNVDM.In order to support more thanFSDAX mode,wedive into driver level.We introduceda driver interface called Memoraphiliafor each kind ofNVDM device.The PMEM device isimplemented by callingStorageLost.The DAX device is going to befinished in the future.The one tomultiply ReverseMappinghas been implemented so far.There isno useless overheadwith Memoraphilia interface,it is compatiblefor all NVDM modes.With StorageLost interface,it is compatiblefor all five systems.Butcurrently,there are some remainingdifficulties.The first one is thatthe PMEM driver cannotobtain five systems fromLVM orOtherMAPT device.We can useGetSuper functionto get five systems whichis created directly onthe device.But it is not suitablefor MAPT device.Sothe other method needs to be found.The second one is thatStorageLost requiresfive systems has a featureRMAPT treeto track owners from block.For example,the realtime device of XFSdoesn't support it.So we cannot trackfor realtime devicefor now.And so does otherfive systems.Conclusion.We talkedabout the following.Basis of NVDM forrelax and issues ofFileSystem DAX.SupportRifling and Deliv for FSDAXand fix NVDMVest Reversed Mapping.Communityhas made many enhancementfor NVDM on Linux.And we have worked for NVDMto remove experimentalstatus of file system DAX.We expect itwill be achieved soon.Thank you very much for listening.