みなさんこんにちは。私の名前はヤスノリゴトです。今日は、リラックスカーネルのフォアフロントを話します。ここはアジェンダーです。最初、リラックスカーネルの価値について話します。リラックスカーネルの基礎です。そして、ダックスシステムの価値について話します。次は、リランスカーネルの価値について話します。サポートリフリンクデリュースフォアフロントをフィックスエビディンベストリバースマッピングで話します。では、お�し訳します。リラックスとRELATED OSES20.02の誕生日を誕生日からレナックスカーネルのトラブルを支えることができます。現在、フジ2レナックスカーネルデビュラプメントチームのリーダーは私は最近NVDMを作りました。では、レナックスでNVDMを始めましょう。NVDMはパーシテントメモリーデバイスをディムスロットライトディラムに設定させることができます。CPUはNVDMを直接読むことができますが、データパーシテンを保持していることができます。レイテンシーはパーシテンを保持していることができます。レイテンシーはパーシテンとコストのキャラクタリスティックがディラムとNVDMを保持していることができます。例えば、メモリーデータベース、ヘラリカルストレージ、ディストリビュータスレージ、キーバリストアとエッテストラーなど、フェーマスプロダクトはインテルデータセンターパーシテントメモリーモジュールです。NVDMの影響はとても大きいです。カラリシナラアヨレアが安心に雑音している。例えば現在、ページキャストは安心に雑音したというのは、スロアイオストレージで作ったことができますが、NVDMはよりセレクトンテリ ل liquids dijoだけて、シンクシステムコールは安心に雑音することです。シンクシステムコールはコストの融合を使うと、蓄積があり、フラッシュ、・インストラクション、有効な於に簡単に運転できることもできます。システムコールはさらに怒りでくれた状態で、NVDMを担当することもできます。ユーザーモードやユーザーモードの間に多くのユーザーモードと交換することもできます。新しいインターフェースはNVDMについて期待しています。しかし、NVDMは普通のソフトウェアで難しいです。たくさんのソフトウェアを使用しています。なので、NVDMのソフトウェアを使用する必要があります。まずは、必要があるために、CPUを使用する必要があります。CPUが全てのCPUを使用する必要があります。如果、System Power Downが必要で、そのため、データが失敗することもできます。次は、NVDMのデータストラクチャを使用する必要があります。NVDMのデータストラクチャは必要です。もし、データストラクチャを使用する必要があります。ソフトウェアを使用する必要があります。3.データを確認し、クラスピングデータを使用する必要があります。もし、データが失敗すること、ソフトウェアを確認し、クラスピングデータを使用する必要があります。4.データエリアの管理し、ソフトウェアを使用する必要があります。そのため、ユースエリアでも使用する必要があります。2.オディション、Carnal Master Assigns an Area to Suitable Process with Authority Check.その結果、コンフリクションの必要があります。ファイルシステムは、プレビュアスコンシリュレーションにより、フォーマットコンパツビリティーのファイルシステム、データコレクション、ジャーナリングやコーピオンライト、リンジャーバネジメント、オーソディスチェック、 etc。そのため、ファイルシステムの使い方は、しかし、ファイルシステムは太くスローです。トラジショナルIoスタークは、太く重いのかと言った。そのため、CPUキャッシュフラッシュは、足りない限り、ファイルシステムによって、新たなアクセスインターフェースを取り組むために、ファイルシステムにより、プレビュアスレーション、リンジャースコンシリュレーションにより、アプリケーションにより、ファイルシステムの使い方は、グリーンラインとストレッシュアクセス。アプリケーションは、トラジショナルIoスタークのインターフェース、SSDやHDDの使い方です。そのため、アプリケーションは、このモードを使用できることができます。次は、ファイルシステムダックスは、ブルーです。ページキャッシュは、ファイルシステムダックスにより、リードアウトライトを使用できることができます。オディションは、アプリケーションは、エンヴィリムエリアを直接取り組むため、ファイルにより、ファイルシステムを使用できることができます。このモードは、エクセフスやEFT4やシステムを使用できることができます。このモードは、エンヴィリムの正面のアプリケーションを使用できることができます。3. デバイスダックスウェッドレンは、アプリケーションは、エンヴィリムエリアを直接取り組むため、スラッシュデブ、スラッシュダックス、このデバイスファイルは オープン、NMAP、クロス的に描いています。他にも言うと、リード、ライトの中に任せなければなりません。このNVDMで新たなアプリケーションの サイズにおいて、新たなアプリケーションを公表します。オーディションは、私たちのクロスタイルに、スラッシュデーブ、スラッシュピーミン、Number Sを設置するため、ファイルシステムダックス、スラッシュデーブ、スラッシュピーミンを設置するため、デバイスダックス、スラッシュデーブ、スラッシュダックスを設置するため、and the controller can create this device when it creates namespace.Here is an example of a file system docks.You can make a file system on スラッシュデーブ、スラッシュピーミンS、 or スラッシュデーブ、スラッシュピーミン。プリンスノートスラッシュダックスは キャラクターデバイスです。デバイスダックスは リードやライトを使いません。ディリークマンドのバックアップを使いません。つまり、PMDKは ライブラリーやツールス、ファイルシステムダックス、デバイスダックスを使います。ローレイブサポート、プランザクションサポート、 ライブラリーやツールス、 etc。一つの興味は、リナックスではありません。でも、ウィンドウスを使いましょう。ライブラリーやツールスを紹介します。リブ・PMMとリブ・PMM2は、ローレイヤーライブラリーです。リブ・PMM2は、新しいライブラリーやリブ・PMM2です。でも、新しいフィーチャーがありますが、 リブ・PMM2は、新しいライブラリーやリブ・PMM2です。次は、リブ・PMMオブジェです。リブ・PMM2は、 新しいライブラリーやツールスを紹介します。次は、 リブ・PMM2のオブジェクションサポートのオブジェクションを紹介します。一般的な使用の場合は、 リブ・PMM2のライブラリーやリブ・PMM2のライブラリーがあります。でも、使い方は、 リブ・PMM2のオブジェクションサポートを紹介します。次は、 ダックサヨウを使います。では、dx.dev.dxはキャラクターデバイスです。では、dx.io.comandを使用しません。dx.io.comandはdx.io.comandを使用しています。しかし、dx.io.comandはファイルシステムを使用しています。私は非常に具体的なインターフェースです。NVDMの管理方法は同じです。ファイルシステムと同じです。iTajethar liter can use the traditional command to ManageNVDM Area.Not only the application can access the NVDM area directly but also it can use traditional system calls.In contrast, device.dx requires professional management by tools of PMDk, Otherwise, our software needs to possess a whole of the name space.多くのシステムを使うことはできません。しかし、エクスプリメンタルのメッセージはファイルシステムはダックスオプションで上昇しています。彼らは彼らは数年前に難しい問題があります。そのため、私はその理由を話します。私はファイルシステムダックスの問題を話します。私は今の問題を解決しています。そのため、彼らは二つの理由があります。まずは、ファイルシステムダックスをストレージと記憶者に合わせています。このコーナーケースの問題をみて、ファイルシステムダックスは常に難しい問題です。二つの問題は、更に多くの数が必要としていますが、それは難しいことになります。まずは、私たちにとってのダックスオンライトを作り、二つの問題は、コーピオンライトファイルシステムで合わせています。ファイルシステム。最初の問題はメタデータのファイルの アップデートを作り、ファイルシステムについては、アプリケーションはデータを作り、CPUを使ってパステンシーを 作り、といえ、と言った。しかし、このもとは、カーネルやファイルシステムの メタデータでアップデートを作り、実際にアップデートを作り 、ファイルのタイミングは 間違いない。if an application use write some data to file on the file system darks and a user remove some blocks from the file by truncatecarn cannot negate itSo data of the file may be lostif data transferred by dma or rdma to the page which is allocated as file system darks similar problem may occurhere is the current status of update metadata problemsアプリケーションで general write access by application was solved by introducing a new map sync flag of a map.PageFort occures every write access, then kernel can update metadata.PMDK specifies this flag already.DMA or RTMA data access, if it occurs in kernel or drive layer,It was solved by waiting truncate until finishing RTMA.However, if it occurs in user process layer like infinite or video card, it's not solved.truncate cannot wait the completion of transfer because it made too long time.However, workaround is found, it's on demand paging.In ODP, usually driver or hardware does not map the pages of DMA, RTMA area for applications.It maps the pages when application access them.Colonel or driver can coordinate metadata at the time.Melanox and NVIDIA newer card has the future.Next issue is unbind problem.Unbind is basically a ccface interface to disconnect or hot remove a device.Each device driver provides its handler for it.Though NVIDIA is not hot bug device physically, its interface can be used to disable and switch the mode of NVIDIA with namespace.For example, it's to change namespace mode from file system.to device.And it's to allow that user can NVIDIA like normal run.Shame is an example of how to use device.to access namespace like a normal run.It's right unbind ccface file.Unbind is likely surprising remove interface.There is no way to fail of unbind even if a user is using it.So, it must be disabled forcibly.So, a race condition was reported between file system.dux and unbind in 2021 February.To solve this problem, file system.dux needs to disable a range of NVIDIA area immediately.Currently, this is not solved yet.File system.dux will be solved after the end of the runs and work which will be talked by him today.His new code will help to solve it, I hope.Next problem is.dux on and off for each iNode.It's expected use case is following.The first is need more fine-grained settings.Next is change the dux mode depending on each file.Next is change the dux mode attribute by application.Configuration is always painful for administrator.If application can detect and change it, it will be helpful for them.The third is performance tuning.Since the right latency of NVIDIA is a bit slower than the run,user may want to use pescast by dux off settings.A final reason is work around when file system.dux has a bug.So, what was the problem?If file system.change.dux attribute,file system.need to change methods of file system between dux and normal file,but they may be not executed yet.Data of the pescast must be moved silently when the dux attribute becomes off.These problems were very difficult. Fortunately, this issue was solved.The dux attribute is changed only when its iNode cache is not loaded on memory.File system can load suitable methods for each attribute when it reloads iNode's memory.ページ cache of the file is also dropped.User can use this feature with the new mount of shown.mount hyphen or dux equal iNode.Dux attribute is changed by the command.Please note the following.All applications which use the target file must close it to change the dux attribute.File system will postpone changing the dux attributeand dropping iNode cache and pescast of the file.Remain issue is coexisting with copy-on-write file system.She has a copy-on-write feature of file system.If there is the same data block on different files,file system can merge it as a same block.So far, if only file system manages such block,it was enough.Since a pescast is allocated for each file of the block,managerent layer don't need to know it.In file system.dux,it becomes a problem.merged block equals merged memory itself.It affects the memory failure case.So,what is necessary?The first is a need-arch-chart copy-on-write implementation for file system.dux.Currently,there is no implementation of reflink-delift for file system.dux.IoMap,which is a new IoBlock layer instead of buffer-head,has interface for copy-on-write file system.XFS file system.dux also use IoMap,but there is no code to use copy-on-write and docs at the same time.Next issue is a need-to-chase plural files from a merged block.When a memory failure occurs,carnel need to kill processes when which uses the memory.To achieve it,carnel need to find all processes from the merged page or block.But a merged page has only one struct page,no space for plural files in it.Ransom will talk how to solve them.Hi everyone.I'm going to show you how we did that to solve the issues of file system.dux.You can do it in two parts.The first one is how to support reflink-slash-dupe for FSDAX.The second one is how to improve the unwitting-based reverse mapping.My name is Ranshiyan.I'm a software engineer of Fujitsu Nanda.I used to work in embedded development.Currently,I'm focusing on the Linux file system and persistent memory.Here is the background of the issues we need to solve out.FSDAX is still an experimental strategy on XFS file system.It is because that reflink and FSDAX cannot work together.We can try to use them together and see what will happen.Firstly,create a new XFS file system with reflink feature enabled.And then try to mount it to its DAX option.Then we will see the error message.And more detailed reason is shown in the message.DAX or XFS is experimental.They cannot be used together.So,what are the FSDAX and reflink?And why they cannot work together?I will explain them in the next pages.The first one,what is reflink?It is a file system feature that files can share their extents for sitting date blocks.The figure in the right shows a comparison between normal copy and reflink copy.The above part is normal copy.It costs time and storage space to duplicate data extents.The below part is reflink copy.It won't actually duplicate any data extents.Instead,it just remaps the original data extents to new file.So,without data duplication,reflink has these two advantages.Fast copy and save storage.Since these two files are showing data extents to prevent both of them from being modified,we need a copyright mechanism here.It copies the shared data extents to a new destination before user data is written.So,this is what reflink means.Then,what is FSDAX,also called file system DAX?It is a mode of a medium namespace.In this mode,page cache will be removed from the alpast.It allows a map to establish direct mappings to persistent memory media.So,why on earth reflink and FSDAX cannot work together?We have investigated this problem in deeps and found two main issues.The first one is,we need to support Corvion write and dedupe mechanism in FSDAX.The almap interface needs to be extended to support Corvion write.The implementation of Corvion write and dedupe should be added in FSDAX.Another one is,we need to improve the current NVIDIA based reverse mappingby supporting a 1 to N reverse mapping for NVIDIA.I will explain how we solve these issues in the next few moments.Let's start with the first issue.Support reflink slash dedupe for FSDAX.Firstly,I'd like to compare what's the difference between the normal buffer dial with FSDAX.Here is a simplified write process of buffer dial.The main purpose is to describe what it does in IOMAP framework.Initiate write from user space,then come to IOMAP framework.We will get the destination from IOMAP beginning in XFS.In buffer dial case,it allocates the delay extends.Then in actor,destination data is read from disk to page cache.User data is written from user space to page cache.Then mark the page cache dirty.In the last,there is some cleanup work to do in IOMAP end in case of error.The dirty page cache will be synced to disk later.This job contains remapping new allocated extends.As it's shown in the figure,the process of using page cache indicates a copy-on-write mechanism.But this is quite different in FSDAX,even though it uses IOMAP framework as well.In FSDAX case,it allocates immediate extends in IOMAP beginning.This also do the quite different things.Get nvdem address by calling direct access and write user data directly to nvdem without any page cache involved.What's more,there is no extra work in IOMAP end.Since there is no longer need for sync,there is no opportunity to remap the new extends.By comparing with the previous buffer dial,we can see that copy-on-write mechanism is missing in FSDAX.So,to solve the issue,what must be implemented?Looking to the IOMAP framework,we need to allocate new extends for copy-on-write useand store source extends info somewhere.So,we introduce an SSMAP to store the source extends.Then,we copy the source extends data to new extends and write user data into it.This is a copy-on-write operation,which is needed in write or a map path.Remap is also necessary after a copy-on-write.IOMAP end is a good place to do that.In the last,we still have to implement a DAX specific dedupe method.Let's start to implement them.The first necessary thing is source extends info.IOMAP framework only uses a structure named IOMAP to tell actor the destination well and how long the data to be written.But it is not enough for copy-on-write mechanism.To implement it,the source extends info is also needed.Including its start block number,which means where to copy from.The length,which means how long to be copied,and the flag if is needed.Now,ColonelDeveloperGoldwing has introduced another structure named SRCMAP to remember and pass the source extends info.The next necessary thing is to fill the members of SRCMAP.XFS only fill IOMAP as the end of IOMAP begin.We need to let SRCMAP to be filled too.As it's shown in the flowchart,when it starts to write,we find the destination extend firstly.If it is a shared extend,which means needs copy-on-write,we allocate new extend.Then the destination we found should be treated as source data extend because all changes will be made in the new allocate extend.So,the SRCMAP is filled by destination extend,the IOMAP is filled by new allocated extend.And then sets the IOMAPF shared flag for actor use.In the DEX actor,adding copy-on-write operation is necessary.The current actor only writes user data to destination by a DEX-specific interface called DirectXS.ATVs use to translate IOMAP to NVDM address.So,before user data is written,we need a pre-copy.Let's see the flowchart.We add a copy-on-write branch to get source address from SRCMAP.And the copy source data to destination address we got in the beginning.After that,we write user data to destination address.In this way,copy-on-write is able to execute it in right path.In the MAP path,adding copy-on-write is also necessary.Different from normal page fault,FS DEX has its own specific PTE fault and PMD fault.It is 4KB page and 2MB page.It uses IOMAP framework too.But for now,it only finds the destination page and associate it with VMA.To support copy-on-write,we need a pre-copy before associating.We use DirectXS to get destination address.In addition,PFN,which is for associating use.Then,similar with previous,get source address from SRCMAP and the copy source data to destination address.After that,associate VMA with PFN we got in the beginning.At this point,copy-on-write mechanism has been added in FSDEX.Since FSDEX is synchronization,we need to remap extents we changed before right now.Otherwise,because the new allocated extent is not mapped,the made data hasn't been updated.As a result,the file will not contain the copy-on-write extents.IOMAP end is a perfect place to do this job.In addition,if something wrong happened in actor,it is also a perfect place to clean up those extents.Of course,if it's not a copy-on-write,there is nothing needs to be done here.Beside copy-on-write mechanism,the duplication is also necessary to be adapted to FSDEX.It is used to reduce resident data on storage costs.The call function is to compare a range of data from two files to check if they are the same byte to byte.There is a generic function for normal files by comparing data ready in page cache.However,FSDEX has no page cache,so this generic function is not suitable for it.Thus,we need a newDEX compare function to compare data by accessing them directly from nvdm.As is shown in flowchats,direct access to files to get their nvdm address.And then compare data on it by calling memory compare to get the result,same or not.If same,it means the range of two files can be deducted to share same extents.However,we should pay attention to check if the two files are both enable FSDEX or not.Files with differentDEX flags cannot be supported.Till now,we are able to make replink and FSDEX work together in right and mmap path.However,it just looks functionable on the surface.In depth,there is another issue needs to be fixed.As a block device,nvdm permits files on its share same data extents thanks to replink.Since it is a nvdm,we have to think it as the memory device.In another word,files are sharing the same memory pages.So,the memory management layer also needs to understand it.In the next page,I will show you how we solve it.As a memory device,memory pages may fail in hardware level.That means the page could not be accessed anymore.The kernel triggers memory failure to handle this failure.When memory failure occurs,the system will track all processes associated with broken pageand then send signal to kill those processes.The track from memory page to a file is usually called reverse mapping.In this case,it is called nvdm-based reverse mapping.Nvdm-based reverse mapping can only support one page to one file mapping.However,for reflink,because files are sharing the same page,we need to improve it to support one page to multiply files mapping.To achieve it,I have thought of many ways.The first idea is described in the right figure.It was simple to be implemented and worked,but it is not a good idea,because of the huge overhead.After that,I have tried solving it in many ways,but any of them was not perfect.The current strategy is to change file system internal to find the one-to-end relationship,but there are still some difficulties,such as,memory failure information is basically page unit,but we need to find where it is in file system,and file system may be created on partition or LVM or others.It affects the relative of that in file system.In that,in the next page,I will show you how memory failure signal works through the associated layers.As we can see in the middle of the figure,two processes are sharing one dex file,and MC triggered,because of shared page inside the file was broken.memory failure take over this exception and initiates a reverse mapping from the bottom to the top.From MM layer,device driver,block device,file system,files,and finally to all processes using the broken page.After that,send signals to processes to kill them by signal bus.So,the enhanced reverse mapping is the key to solve the problem.Since it spans many layers,we need to implement the reverse mapping on each layer.The first one is from MVDM driver to dex device.The second one is from dex device to file systems.This is the most complicated one.We need to introduce dex holder registration mechanism to correspond to different ways of using a storage device.The third one is from file system to files,which requires ARMFB3 feature.The last but not least thing is compatibility for no reflink or no ARMFB3 file systems.For example,the EXT4 file system.Then,we will start from the first one.The memory failure always accepts pfn as its argument.It is the page-free number of system memory.So,we need to translate it into the offset in MVDM firstly.And then,according to the mode in use,the offset needs a further translation by each driver.For fsdex mode,pman driver translates the offset nearly.For devicedex mode,the dex driver needs to calculate the offset according to the dex range property inside.In this session,we only take fsdex mode into consideration.So,for now,we have got the offset inside the pman.Pman is also a block device,so it can be used in many ways as any other block devices,such as making file system directly,parting in many partitions,creating lvm to combine many pman devices,or even creating nested partitions and mapped devices.To make it suitable for each kind of usage,we introduce a dex-holder registration mechanism to abstract them into one behavior.So,the holder represents the inner layer of a pman.It is registered when the holder is mounted or initialized.The one behavior for each kind of usage is notifying failure into inner layers.They need to implement a notify failure interface.Let's start from the easiest one.The inner layer is file system.This case is created by making fs directly on a pman.There is no partition inside the pman.The reversed mapping translation only needs to remove the fixed block device header length.Then the second case is that inner layer is partition.This is created by partition tools.It could be one or more partitions inside.We need to find which partition the broken page locates in so that we can get the offset in inner file system.The translation is that remove the start offset of the partition we have located.The mapped device case is a bit complex.It is created by lvm to set or other tools.It creates many dm targets.The dm targets can be used in many ways as well.Such as linear target, rate, script, and so on.So, before introducing translation method,we need to introduce reversed mapping for each kind of dm targets first.This reversed mapping is the reverse progress of the existing map interface of dm targets.With its help,reversed mapping for mapped device is able to achieve.We delete all dm targets in a mapped device to find out which target contains the broken page.Then remove its start offset.After the translation,we need to handle layers inside the mapped device.Here is an interesting thing.The inner layer could also be file system or even partition.But thanks to the holder registration mechanism,inner layer also is a holder.So that,reversed mapping is able to go on.Finally,it comes to file system.Reversed mapping from file system to files requires a map feature.By giving an offset and length,we are able to search for extents containing it.Fortunately,XAFS provides this query API.This result could be file content of file system mapped data.For the former case,we need to send signal to care processes using this fileand try to recover file data.For the later case,it is hard to recover file system online.Just shut down file system and report the error.Now that,it is possible to find all files by the one to end reversed mapping,the original page-based process collection and the queuing functionshould be modified to file-based.As we know before,the ext4 file system doesn't supporteither reflink or unmapped v3.So we need to keep the compatibility for file systems like it.Thirdly,keeps original one page to one file reversed mapping.The relationship is created by associatingpage mapping and offset to a file mapping and offset.It should be only associated once.Error is reported if something wrong to associate it more than once.But to make it comfortable with one to end reversed mapping,we make some restrictions to avoid the error.Just make it associate only once and only first time.Secondly,keeps original reversed mapping routine.With the support of the first compatibility,the original page-based reversed mapping still works.So keep it and fall back to itif the one to end reversed mapping routine gets operation not support error.So the one to end reversed mapping has been implemented so far.It is comfortable for all NVIDIA modes.Comfortable for all usage of PMEM.Comfortable for all file systems.We have some future work to do,such as fixing the race condition against unbind.With the help of the one to end reversed mapping,this can be fixed.We hope my code can help such case.Thanks.We talked about these topics.Community has made many enhancement for NVIDIA on Linux.We have worked for NVIDIA to remove experimental status of file system tasks.We hope it will be achieved as soon as possible.Thank you very much for listening.