 Morning everyone, I think we're right on time we can start so my name is much much efsky my friend Krzysztof sitting next you'll be hosting a different session later today, so Let me drive this one With persistent memory, I will try to explain what it is actually that we're talking about How we can see it and linux or actually other operating systems as well What it is good for so how we can actually use it and then the most Nice things around it. I will be showing What are the outcomes of using it? So what kind of benefits can your applications expect? not really the linux kernel yet because we Haven't started to modify the kernel. It's a probably years effort to actually make it work on persistent memory, but We are already doing some work around the applications to convert them and see Is it good? Is it bad? What kind of benefits we can Possibly have with it. So persistent memory We all know how deem looks like right deem form factor small piece that goes to the motherboard Normally it hosts DRAM memory In this case, we are talking about DRAM like chips that can also retain the data upon power cycles So if your machine goes out crashes, whatever happens, you take the deem out non-volatile deem and it contains all your data there and doesn't lose anything and Actually hardware like this is already on the market since some time Thing with it that it's mostly based on the regular DRAM plus battery plus flash some logic to dump the data and therefore it's actually more pricey than the regular DRAM and Number of use cases is very limited because of it some really specialized Software or use cases that can utilize it For the way we are taking another look at it Is actually the Announcement we have made as Intel last year and this is my last marketing slide first and last With the new type of materials that can give you actually the latency is close to DRAM and Capacity which is much higher than DRAM and can fit perfectly in this idea of having non-volatile deem and When we are talking about capacity out of one socket so one CPU Normally with DRAM we can get like half terabyte. Maybe one terabyte of DRAM, which is very expensive With this kind of product that we will be releasing We can get up to three terabytes of non-volatile memory that is of The similar endurance to the DRAM so we don't worry that it works off like the regular flash Drives and with the similar Latencies so we can use it pretty much as a DRAM Which is kind of interesting How we have approached this problem on one hand we have hardware that is evolving write the materials and so on The other it's software. We need software to utilize the hardware. We need software to Make programming easier for everyone and Actually make the product success. So a couple of years ago Siniya has opened a working group around non-volatile programming and they came up with the Programming model which is standardized by Siniya and This is the base of all of our implementation and programming model is On one hand side complicated on one hand side. It's pretty straightforward, but it gives you a couple of options so Basically when we have the hardware we have the driver on top of it and The driver Can expose the device as a Row access characteristic device it can expose it as a block device to the file system and have regular file system on top of it and use it as a disc But we have also a path here that I will talk a bit later that gives you actually the direct mapping for your application Direct mapping of the addresses here. So when you have here the system physical addresses we put on top of On top of regular file system extensions these are DAX direct access extensions That when file put over here gets memory mapped we get direct mapping From the virtual address space of your process through the page tables To the system physical address on the enemy team. We don't have page cache so It allows for a direct load store operations from your application and that's the fastest way to go actually to the NVDM It omits all the kernel space. It's a direct user space operation, right? So we don't touch kernel at all. We don't touch file system So basically entire Maintainance of the maintaining of the data here and all the bookkeeping around it needs to be handled by your application Which is quite complicated So we'll get back to this model a bit on and on But this is the basics of What we are talking about So how does this Hardware looks for the operate to the operating system. So at first we have the ACPI and fit tables which are standardized and By what is exposed to the kernel we can see already some of the capabilities So we have basic the system physical address ranges exposed we have Additionally information about interleaf structures because we have Interleaving right this is regular memory controller. It has deems it can interleaf the ranges It has also some block structure control regions to Utilize to allow you to use it as a block storage to provide atomicity on the block level and How does it look in the linux when you launch your machine a launch canal you can see in the DMS? This is the memory map E820 and there are regions here that are presented as a Nonvolatile deems. So this is a persistent type of the memory that type may differ But we have regular memory ranges with an exception that those are actually reserved. It's not set here, but Those are not directly usable by By your kernel or applications. These are not visible These are visible to the driver and exposed by the driver So the driver needs to do something with it and what it does it exposes as a device right block device mostly and When we have a block device like this, what can we do with it? so looking at it that we have a huge capacity of DRAM like Thing plugged in we can use it as a memory. We might not care about the persistency of the data We might have some temporal big sparse matrixes that we want to just put it it put it in Calculate something on it and then dump it. We don't care about it. So the first Usage models is just to use it as a memory and for this we actually have also a couple of approaches We actually couldn't do it in this straightforward way that With the E820 memory map table You could imagine. Okay, let's not put it as a reserved but just a regular memory and so Application can see it can allocate within the space problem with it. It's actually Differs with the parameters. So you want to have some differentiation between a bit faster DRAM this memory maybe high bandwidth memory and You would lose this data you could think about the NUMA nodes behind it But it's a hacky thing to do to do in this context. So what we have done is Based on the SNEA programming model, which is file based So as a user you have a regular file system You create a file your memory map it and then you do within the address range of this file, whatever you want and For this case, we are only allocating objects when the program terminates We kill the file or when the machine crashes next time program starts it kills the file And also a couple of approaches here. We have some libraries written for it The first one Libvm alok actually Replaces during the compile time all your malloc calls to the ones that we provide and All your allocations goes to persistent memory, which is a bit slower than working on the DRAM But maybe this is what you want. It's the easiest way to deploy actually because it requires no changes to the application a Bit more complicated is when you want to actually tune your application and Have some of the data put on the DRAM and some on the persistent memory Which can give you some advantages, but you need to change your application, of course And you have then separate APIs to allocate the memory Another one, which is very similar to the previous is the memkind library, which is also open source all of this open is open source, of course and This one is made actually from the high performance computing world Where the guys have on the on five machines high bandwidth memory, which they want to differ This is on on die memory and they want to differentiate. Okay for some allocations. We want this super High bandwidth memory on some regular DRAM on some non-volatile DRAM and their applications are actually Tuned to use it like this. So you you put the kind of your memory size and tell Gordon's take care of the rest So this is volatile usage. The next one would be the Easy way block storage you've seen there and the ACP and feet. There are some structure that Allowed hardware to handle the block atomicity You can also do it on the software side with the din driver within our library We have also support for block storage It has one huge advantage because it doesn't require your application to change at all, right? It operates on files. It's still paring some files in the same manner and we guarantee the same consistency of the data within the files The drawback of the block storage is this. This is the storage stack diagram for linux Right, so I won't go through it because it's not my domain but When you're working with the files each time you want to change something update You need to go somewhere through this tag which cost you time and Which also cost you CPU cycles It didn't matter so far that much because we were working on the relatively slow mediums SSDs is most popular now, but it's still slow compared to the NVRM so It's usable but Not really the point of interest to to me at least So the the most interesting thing is to actually use all that medium gives us is which is the byte persistency. So we have this Deem hardware, which is byte addressable, right? This is very important So this is not covered in the it's not spit in the blocks. It's byte addressable and it's persistent to actually use it there are a couple of things that we need to remember that we need to address so To make the data persistent we need to we need to have this data go through the medium Normally when we operate on some data, it stays in the registers of the CPU in the caches and It goes to the underlying storage or RAM When the time comes right so CPU basically on its own manages it the cache rotation in this case We want to make sure that when we update something we want it already to be safe and to do it we need to have some sort of Cache flash instruction that we invoke so there's those are the examples that we can use and Caches of the processor is one thing the second one is also a memory controller itself It has its own buffers. We need to make sure they are off the data as well And only when we are there with this deem NV deem Power failure safe domain Then we are good to go right we can be sure that when next time we restart something system After crush or whatever the data will be there So this is fairly easy problem to be solved The next one is the atomicity so what we have with the regular programming with the Flashes to the drives memory mapped files we basically copy some Data then we invoke em sync and data sync or something like this and the data is persistent on the drive With the our Libraries that I will talk a bit later. We have additional We have replaced the call to them sync with our own Which does this cache flashing in a proper manner proper order with proper fencing and it works like with this example it will work We're safe What happens when we have a complex structure like hello world and When we are flashing the caches We have a power failure What will be the outcome and I did All ideas are good, right? We don't know because with the ordering that can change of the instructions because and With power failure that can happen before any cache happened after all after partially Flash cached we have no clue what will be the outcome and this outcome is already stored and Next time when our application starts up, we will expect that this data is valid We have no ways to actually determine that the data is invalid. We would need to store additional tones of Metadata to check whether The data has been set or not which would be complex and not easy So this is the the second one and this is only for a singular object we can we will be also talking a lot around With around transactions so How to give to the software developers of their programmers Tools to have transactions support on the NVDM and my friend Tomek will be talking later on on concepts around it the third challenge that we have is Around positioning of the data. So we have files, right? We open the application. We give it a file handle and Then let application decide where to put its objects on the file It's a lot of work actually because this is a malloc operation, right when we memory map a file to have this direct path We have only the entry point. We have entry address and we can use SBRK call to Navigate through it. We don't have a regular malloc function and this is something that we Need to provide in some sort of the library because I cannot see and All of the application developers actually develop their own mallocs Some handling around the files Joining those files because application grows you want to allocate more data. Okay. What now we give a second file created, right? We want Maybe a continuous address space within the application, right? We need to handle this and Every pointer that we allocate so every result of allocation API is actually a persistent pointer So when you run your application on DRAM All you care about you you're doing a malloc call you give you're getting an address and Next time you run your application. This others will be for sure different No, not always but pretty certain and In this case, it's actually on the contrary you want these others to be the same because the data is already there It cannot change right and this is a persistent pointer which Gives us problems like fragmentation, right? We are allocating within this file All the time we are freeing resources and we cannot defragment the data. It needs to stay where it was originally So there is a lot of challenges to actually make it work for the applications in a good way And those challenges are addressed by the applications that we are Open sourcing so it's actually already in the production quality. It's included in some of the distributions like Susie and Fedora others, I believe And as far as I know, this is the only incarnation that this of library like this, which is in the production quality So there are some others approaches to the persistent memory programming and incarnations of this persistent memory programming model, but this is production quality already fully open source github Welcome to take a look There is actually It's not a single library. It's a set of the library. So we have a support for the basic persistency handling like this flashing in the proper order detecting whether we are working on a Non-volatile memory or maybe the file that was given as a start point is regular SSD, right? We can work with this as well, right? We will just invoke M-Sync It's not a problem the most Important and base one is the leap in M option which gives you transactional object store which is For me as an application architect the most important thing Some management of the pools with the leap in and pull replication basic replication functionality that we are working so still on and It's pretty well documented. We have plenty of blog posts that describe the basic functionalities Now let's let's go to the clue the applications because this is something that that I'm working on and What do we want to how do we want to modify our application? We can take a look at the allocation. So Our data whether it's temporal data whether it's our storage within applications Where to put it do we know where to put it? Sometimes we know sometimes not it's the decision that as an application architect We need to understand how it works. Where is actually the bottleneck of the application where we want the data to? to be processed faster or slower That's all of the data needs to be persistent for sure not right But still we can put it on a persistent memory because it's huge and maybe we want our Temporarily data Which is huge to be still faster than the paged one Through the swap and When to guarantee persistence, it's also a big question, right whether Which every storage of the which with every update update of the object do we need to flush everything or not? needs some thinking around it and The But this was the easy part the storage then we have the application engine itself so Question whether we want our application to resume after failure. It might be application failure. It might be System crash kernel panic. It might be power failure, right? Do we actually want to resume it? Depends right this is actually a good use case for the high-performance computing where we have Big chunks of the data that requires couple of hours for computation then it could be actually quite beneficial to Introduce some resume operation right and they are doing it through the checkpointing which takes a lot of time If we put all this data on the persistent memory, then this time is cut off and Let's take on how complicated it is So this is some pseudo code don't don't look at the Whether it should run or not we have some iterative calculations that run over some iterations and When we want to shift it to the persistent memories, it's okay. Let's move it all At first what we notice is that actually this iterator cannot be zeroed in this Loop we need to zero it at the first runtime of the application But only at first right because next time we run it It might be after a crash and we need to detect whether it was running before or not and zero it afterwards or not Then it gets simple right you use some This is actually our transactional macros that Gives you transactional allocations of this you put the singular iteration in the transactional part of the code and Then after everything is finished Your zero your iterator Who thinks that this will work? Can you rise your hands will this work or maybe someone sees a flaw here? Yeah, yeah exactly very nice So the flow is here the increment the iterator is not persistent right so when we have a crash over here in the code more or less Right, then we have already Results updated with for this iteration, but iterator is not so we are calculating this time second time and When we put actually the close for the transaction all over here It doesn't make sense anymore because we don't have resumed right So it actually requires a lot of architectural thinking even on the small pieces of the code To have the best out of the persistent memory and I have some examples on What we have done with it And what kind of benefits we can actually Expect so the first thing we have looked at is this in-memory database. It's another database. It's key-value store. I Think everyone knows read is more or less it's quite simple and We thought that okay since we have this large memory It's the most compelling use case to have in-memory database and use persistent memory One note here, right This is all on some developer machine So it's not real hardware because we haven't announced the performance results yet from the real hardware So all the values here are relative So around the benefits and the rate this radius is actually open source as well on our github There's not actually we have two versions of read is changes to within it The obvious advantage is startup time So when you put all the data to persistent memory and work with it directly, we don't have any memory copy We don't have any Loading from the disk to the memory the startup is instant instant And we can see that it's actually not right. It's one second Let me shortly explain the other modes the rdb is the Mmm, this radius is the in-memory database, but it has also Persistency modes to rdb and a o f rdb is snapshot, which is taken once a second at at highest frequency And I of this up and only file. This is journaling and Which is configured here to work like every update and PM persistent memory So with the native persistency modes from ready's that we are loading All the data from NVMe drive here in this case, so we were working with NVMe fastest on the market we still get on this relatively small pieces of data like 20 30 seconds and We've heard yesterday. I think that it takes even hours for the large datasets and with persistent memory We are thinking of large datasets like couple of terabytes, right? So it's quite compelling to have this down to zero Why we don't have zero. It's our choice So this is the architectural choice to actually leave the dictionary in the RAM and this one second is Due to rebuilding this dictionary putting it in the RAM from the persistent memory data We could also put this dictionary in persistent memory and have it Really zero right, but this is our choice if you would shift it to the persistent memory the ready separation overall would be a bit slower But startup time is zero, right? So this is your choice The other Straightforward one is DRAM usage. We are not using it at all right, so with all those slides on the horizontal axis you have Object the different object sizes, but we have the same number of objects. So this is half a mill objects With the differentiating sizes and so we have like six seven gigabytes of RAM allocated for the data plus meta data and in DRAM with ready separational and With persistent memory all of it is actually on persistent memory Only dictionaries in the RAM and it's not visible because it's like 10 megabytes So you can basically cut the DRAM from your machine more or less Now let's see at the the ones that are more More useful But not that easy to see what happens when you're operating ready's or any other application and Your data size exceeds the physical size of your RAM on machine you're hitting swap unless you configure not but normally you're hitting swap and It basically kills the ready's when you do it the performance drops to zero close to zero With the original so this is the original version of ready's When you're actually doing the Snapshotting you're dying at half of the RAM because snapshotting is fork copy on the right You need double of this memory and this is all the All the update operation so your all of those data is from the update operations It's not mixed get something. It's as the worst case scenario all updates So you die sooner With the journaling it's a bit better But it's still not that good the problem behind it when you're running the application environmentalism devop is It's very hard to detect it Right, it's very hard to detect how much data can I actually allocate more on my ready's instance before it dies? Right you need to literally turn off swap and look for the arrows with the allocator Which will kill you ready's probably With persistent memory. We don't have it. We have we don't use RAM So we don't have this problem when we run out of the persistent memory the allocator from the library will tell you That it couldn't allocate your new object or change it and then you're pretty much safe to go You can handle it power much easier The drop you can see here. Don't worry about it. This is the dose results. So actually this chart is quite complicated to generate and turn on it involves a lot of hand work and We have done it like in December when we are still in beta beta version of NVML and Right now we are we have changed some allocator Bugs and we are more flat here. So similar performance But to the performance because right now you can see some Snapshot of the performance the performance we can best see with With the IOPS with the updates per second that we can actually get out of ready's and We will compare only two modes, right? This one is the persistent memory operation only and the second is this journaling because when you configure ready's to work in the snapshot The highest frequency of the snapshots one second, which means that actually when power failure comes you lose some data always There's no way around it. You lose like 100,000 of your data updates up to on this machine So it's not really Comparable to persistent memory because persistent memory gives you full consistency of the data Which is also giving you this up and only file journaling with the option turn to one after each update file gets update fsing gets gold and Persistent performance is Actually pretty comparable as you can see it's more or less the same we could say right on this emulated environment at least So is it better in this case? How to say right AB? Probably yes, because you have larger capacity of persistent memory and for ready's you still need a large large chunk of RAM but What happens when we actually? Stuff the machine further so this one X so the first two charts means that there is one process of ready's which gets updates and What happens when we get to the same socket same CPU? More data, so we around on the same socket for instances of ready's and Make each one of them receive the same amount of updates as much as it can consume With persistent memory we're flat with the performance and this is performance pair instance and With the journaling you can see a lot of drop Right and why is it because with journaling you need to go through the file system, right? you have fsing and Each update goes through the kernel and goes to the disc Going for there we can see that we drop on this machine around Eight instances six eight because this is eight core machine So the last one we took with is ten instances and then we are Off around 15 times faster and overall operation right of a singular ready's instance Which means that basically when you run on persistent memory you save a lot of CPU time Maybe for other databases you have maybe for the more complicated one or some other Applications vm's that you can run on the same machine The same chart so this chart if we represent it in the total data throughput So multiply those ten instances by the number of operations that we can get and by the data What we can see easily is that with this NVMe drive operating under ready's on this machine we get a cutoff Off around here less than 200 megabytes per second of real data updates in key value store So this is not the FiO operations pure data transfer. This is real logic behind the database And with persistent memory we don't have this cutoff It's probably somewhere there with the bandwidth of the medium, but we are not using CPU that much So it doesn't limit us. This is the actual CPU cutoff Which gives you a lot more power for your applications Maybe the second example we have still some time so we can go through the second example. This is the HPC world again This is math library Which is also open sourced and it it's widely used as an algorithm to some computations and One what we have taken is the operation on the sparse matrix So it's quite interesting because sparse matrix is a huge Huge amount of data It's then compressed so it doesn't take overall that much space, but it can right and it's used everywhere. It's finite element calculations all of the engineering mechanical engineering Climate it's used everywhere, right? so We have changed this algorithm to work with the persistent memory and what we can see right we weren't expecting much because Okay, we are changing the load times to zero which is visible sorry visible here Right, but if you look at it the load time takes only three seconds, right and we are changing it to I'm sorry not here It takes 1.6 seconds and we are changing it to zero or close to zero But the calculations we are expecting okay, we are working out of the persistent memory We will need a bit more, but it's not that bad actually most of the time is spent on the CPU calculations within caches So computations only around two percent slower. So overall we do not notice a large drawback of using persistent memory But we don't need any DRAM which is good thing because We can actually process a larger Data matrices and on the singular note Problem behind this is actually a lot of work to make algorithm shift to the persistent memory The second example we have which gives you a bit more on the architectural decisions As the solver of sparse matrix. So this was previously was simple multiplication. Now we have a solver and The original implementation. This is a C++ Takes less than 50 seconds and consumes half a gig of memory DRAM memory When we shift Decided to change it. Okay, gradually. Let's shift first the data To the persistent memory and leave for the temporal data or the computations still in DRAM we get this Small drawback, it's still less than 50 seconds But the DRAM occupations dropped in half It shows us that our data is actually input output data is Half of the DRAM, which is used for the temporal stores and The third approach is to shift everything right so to go deep into the algorithm and have all the allocations made for the persistent memory and This already you can notify that You can notice that It gives you some performance drawback. So it's like 50% slower than operations on the RAM or the mixed mode But then your DRAM Requirement has dropped to zero or close to zero. So this is This involves programming persistent memory involves a lot of architectural decisions a lot of Rethinking on how your applications work, right? Whether it will actually Have some benefits for sure it will right, but is it Real to make That vast changes, right or maybe started slower. You can start it slow, right shift some of the data There is no universal recipe for this You can approach it from a Different ways and there is a lot of work for the architects. So not only the software developers But for the people that understand all the operations of the of the applications and what is actually the market's requirement behind it But for the developers, there's Huge amount of coding, right, but it's fun right and we have plenty of Topics yet that needs to be addressed that are addressed right now work in progress because persistent memory in the form use Easily widely usable form is a new thing Right, it will require plenty of time for adoption right the programming Paradigms have not changed since decades. I can say right So only what we have experienced from the market is having faster drives faster memory faster CPUs Maybe we have a big change with the number of cores growing, right? but from the storage mode not really right and this involves a lot of Changes for the applications and rethinking it which is a nice thing and Probably we are hoping that this will revolutionize actually the computing that we have right now Thank you. That's it from us So we we have time for a couple of questions Yes There's a mic if you can Your thanks. I want to know if we want to use this this persistent memory Other architecture like our PPC. So what kind of hardware change we needed? Yeah So can we use it on different architectures not only x86? So as a Intel company, right? We are focused on x86, but others are Surely possible right what we have done from the instructions set architecture that Was introduced to support this is actually optimized cache flashing So depends how power PC or arm deals with the caches and what it how it Allows for the user to flash all the caches to the medium. So this is one thing you need to think about The second one is this memory controller We had we actually had a separate instructions to flash the buffers from the memory controller Which was actually deprecated before it hit the the market hit the real hardware Because we decided okay. We have also motherboard featured to deal with with this Challenge so those are only those two things right? It's not much So is there any research on the other architectures usage? me Is there any research or is there some preparation that he's working on I? Think so so I haven't actually dig deep to see another architecture using it One thing we can see is when we have our library that Gives you this programming model We have Google groups behind it and there are questions, right? So like just like yesterday or today we had a question. Okay, I tried to run it on the power PC And it doesn't work because it doesn't have those instructions right CPU instructions and how we can assist So those questions are starting to pop up, right? I think it was there was also something from arm couple of months ago, but I'm not sure so people are actually looking through it Okay, last question It's currently easy some of this I guess some of the CPU should should can use this Persistent memory, but some of that may not so can you give me a list about what type of CPU currently can support the persistent memory So for the matrix which CPUs can support persistent memory You need to actually go to the marketing guys, right? They have all the matrixes and For the current persistent memory, I think the that is on the market, right? This NVDM type F or type N Most should be able to use it which supports ADR platform feature For the ones that are based on the 3d cross points. We haven't yet announced it, right? So it's not publicly disclosed Okay, thanks. Yeah. Thank you And other questions. Ah, yes Would be possible through the memory controller to declare them the RAM as non-cached and would in that case not have to do the flush Yes, it's possible to not use cash at all, right and it's in some cases we are doing it. We are for example Pushing all the data through the non-terpola stores, right? There's AVX instructions to make it faster, but normally when the CPU computes the data it always uses cash Yeah, and and the current architecture of the x86 I don't think it's possible Maybe other right. I think that's it. Thank you