 Welcome to this presentation about persistent memory in Kubernetes. My name is Patrick Oli, I'm a software developer at Intel, working on Kubernetes and storage related technologies. Most recently I've been leading a project called PMAM CSI, which is a storage driver that makes persistent memory available in Kubernetes. Usually I give presentation in front of an actual audience, so now doing it in front of a computer is a bit lonely. Therefore I brought my faithful companion, the rubber duck. She is very good at debugging and who knows, perhaps she will even have questions about this presentation. So let's get started. Before we get to PMAM CSI, let's talk about persistent memory in general. As the name implies, persistent memory is storing persistently in contrast to DRAM where data is lost when you power down the machine. It's also addressable like memory. You can do reads and writes with your CPU and you immediately get the actual data in contrast to SSDs where you first need to load an entire sector into DRAM before you can do anything with the data. In terms of performance and capacity, it also sits in the middle between actual DRAM and SSDs. It has read performance that's almost as good as DRAM. Write performance is a little bit lower and it has capacity that is closer to actual storage, hutch higher and higher capacity than DRAM. It is an industry consortium that defines how to use and how to attach persistent memory to a system, but it's also an actual product from Intel. Introduced in 2019, there has been a product called Intel Optane Persistent Memory. It's a DIMM that you put into the same slot as a DRAM and then can configure in different ways as we will see in the following slides. In 2020, there has been a refresh and enhanced variants will have higher bandwidth. A key point to remember for this presentation is that you can build machines, individual boxes in your server rack that have much higher capacity in terms of memory than systems built with DRAM. It's possible to build machines with up to 6 terabytes of PMAM in a two-stocked system. That makes it, together with the byte addressable and memory-like characteristics, an ideal solution if you have applications like in-memory databases or caching systems like memcached or Redis, where traditionally you would have used DRAM but were limited by the amount of DRAM that you can put into an individual machine. So how do you use PMAM in a system? There are different ways of doing it, different ways that are set up by the administrator. The first mode is called memory mode. That's the one that's completely transparent to the application and to the operating system. It's configured in BIOS in this mode. The total capacity of a system is determined by PMAM and DRAM is just acting as a cache. So it's not addressable separately. You do lose some capacity that way, but you can use it with legacy systems that don't know about PMAM at all, because the software itself doesn't need to know about it. It's all handled in hardware. For me as a software developer, the more interesting mode is the so-called app-direct mode. Here, PMAM and DRAM are exposed to the operating system separately, and then the operating system and applications get to decide what they do with the PMAM. The operating system typically will make it available as a file system. That's how persistency is handled via the usual file operations. But then applications cannot take a large file and memory map that into their address space, and then that address space is byte addressable. If they get the full advantage of PMAM and with special CPU instructions, they can ensure that individual cache lines are flushed persistently and available even after a sudden power loss. So applications then can use that characteristic to implement things like a warm cache or to speed up restarting the application after rebooting the machine. One application that does that is memcached. Traditionally, it has been used for in-memory databases, in-memory caches, and that leads to a problem. If you need to update memcached and restart it, it will start up with an empty cache and performance will be worse while it still needs to populate that cache. With data in PMAM, it immediately has access to the same data that it had before restarting it. Finally, you can also do traditional file I.O. over the file system that is mounted on an app-direct namespace. On Linux, XFS and X4 have specifically been enhanced to support PMAM. They will do some operations such that data is stored directly in PMAM, completely bypassing the Linux kernel page cache. You get very high performance I.O. The downside is a bit that data movement is done with the CPU core, so you don't get DMA. This is perhaps not the best way of using PMAM, but it's available if you want to. Some resources that you may want to know about for Intel, there is a tool that actually configures the individual DIMMs, that's called IPM-Cuttle. Then at one higher level, when they're independent, we do have ND-Cuttle. That deals with the things that are defined by the PMAM standards, the so-called regions and namespaces. Regions can be assigned or created on a single DIMM or combine multiple DIMMs in interleave mode, and then regions get split up into individual namespaces for different purposes and different applications perhaps. These namespaces then get mounted by the Linux kernel and exposed as a dev device where you mount your file system. For developers who want to write app-direct enabled applications, there is PMAM I.O., a website that collects information and offers different tools that simplify the task of a software developer. Persistency in particular depends on doing some things correctly, like flushing data when needed. It can be a bit hard to program if you do it manually. The Persistent Memory Development Kit helps with that by implementing higher level data structures that then work efficiently on PMAM. And for applications like memcached that need to store data intelligently there's lip memkind, an ML log replacement that can give you data or memory chunks either in PMAM or in DRAM depending on what you want. Oh, RubberDuck is getting a bit impatient. She wants to know how all of that works in Kubernetes. So let's get to that part. For memory mode, you don't really need to do anything on the software side. You just run Kubernetes and you do have access to PMAM and DRAM as a cache for PMAM. But for app-direct mode, you need some software that actually manages your storage, your PMAM. That's where PMAM CSI comes in. It is a container storage interface. It's called CSI driver that manages your local storage. We have two different ways of creating volumes dynamically. One is working at the level of Andy Cuddle using a library that's coming from the same source code. Here, each volume is a separate namespace. The advantage is that you, in theory, can use the PMAM for something else in parallel. But this mode suffers from fragmentation issues because if you create multiple namespaces, you have one namespace left in the middle of your region, then the remaining space in the front and after it can only be used separately. You can't create one large namespace that combines both parts. With LVM, that isn't a problem. We first allocate a certain amount of memory and create a volume group, and then that volume group can be split up arbitrarily into logical volumes. The downside is that you need to set aside PMAM permanently for use by PMAM CSI, which may be okay, so it's actually the recommended mode at the moment. Then once the volume has been created, PMAM CSI also makes sure that once in the port wants to use it, there is a file system on it, and we support X4 and XFS, mounted with the suitable options that enable PMAM support. Or we can also provide a volume as a raw block device if an application wants to do that part itself. This slide here gives an overview of the different PMAM CSI releases. The exact timeline isn't that important. I'd rather use the opportunity to talk about the features that we've added over time. Since the very beginning, since the first public release in August 2019, we've had Docker images available on Docker Hub, and together with Jumble files in our repository, it was possible and is possible to deploy PMAM CSI. At the end of 2019, we added support for raw block volumes, the ones that I already mentioned. Perhaps more importantly, we also added support for fml inline volumes. So I assume for a second that your application doesn't need persistency. You are just creating ports using some higher-level controller like deployment and Kubernetes, and now you want to add PMAM to that. Now, by specifying inline in the pod spec that you want storage, the Kubernetes controllers will automatically create and destroy that storage for you. And while the pod runs, it does have PMAM. That's ideal for local non-persistent scratch space. It's much easier to manage less work that you need to do for persistent volumes. Later on, we figured that Jumble files perhaps aren't that ideal when it comes to updating a deployment. Operators are much more elegant way of doing the same thing, so we implemented that. It is on operator hub, and once you have the PMAM CSI operator installed in the cluster, you can use that to create the actual PMAM CSI deployment. The operator then will also take care of updating PMAM CSI when we do new releases. One problem that we ran into with local storage in particular is that the Kubernetes scheduler doesn't really know where storage is available. We've solved that with PMAM CSI by implementing scheduler extensions. When a port needs to be scheduled, Kubernetes is asking PMAM CSI which nodes are suitable, and PMAM CSI will query its own database and figure out where the pod may run, and then Kubernetes tries to start the pod. It's not perfect. I'm working on something that hopefully will work more reliably, but for now, it's good enough. We also added support for Cutter containers. For those of you who don't know, Cutter containers is a secure environment where applications run inside a VM. That posed a challenge for mounting the PMAM file system. We've solved that such that you're going to get full native performance because the PMAM volume really gets memory mapped inside the virtual machine, so the performance is the same as if you were running without Cutter containers. Finally, just to release this month, we decided to call the core features of PMAM CSI production ready. We've made sure that upgrades and downgrades work seamlessly. We are testing, as always, on all of the currently supported Kubernetes versions, and we've added also versions queue testing. As we will see on the next slide, PMAM CSI actually has different components. During an upgrade, some of them might be an old release, some of them might be a new one, and that all has been casted now and it works. For the administrators among us, we also added metric support, so you get some insights into where storage is available, how many volumes you have. Things like that can be monitored with Prometheus. How does PMAM CSI really work? The biggest challenge that we encountered, the conceptual issue that we had, was that Kubernetes doesn't really know how to deal with local storage or not well. All of it that I make provisioning of volumes assumes that you have a central component that interacts with a control plane that works with the API server when you create a persistent volume claim. The Kubernetes CSI sidecar, the so-called external provisioner, will see that request and it wants to talk to some CSI component, the control part, and ask that part to create a volume. We do have that in PMAM CSI. We have a component that runs alongside external provisioner once in your cluster. That part of PMAM CSI knows about all of the different node instances of PMAM CSI because those register when they start with the central control part and then when a volume needs to be created, the control part reaches out to the individual nodes. It's actually using CSI calls for that, but it's almost like a custom protocol here where we then find a node, create the volume, and once that's done, we report back to external provisioner with information about topology that the volume now exists and the topology information ensures that Kubernetes becomes aware of the fact that a volume is only available on a certain node and then we'll make sure that the pod using that volume runs on the node where we have created the volume. If all of that looks a little bit complicated, then you get the right impression. It is a system that is a little bit fragile. We found race conditions, for example, if we still need to address. So my long-term cunning plan is to move all of that complexity into Kubernetes. So in Kubernetes itself, I managed to get two new features into 119. They are in alpha, so not immediately available, not always enabled, but they are available. One is taking care of storage capacity tracking. It's an API that allows a CSI driver to publish information about its capacity and then the Kubernetes scheduler can query that information directly by talking to the API server. There's no need anymore to implement a driver-specific scheduled extension. That's much nicer because those scheduled extensions are kind of hard to set up, unfortunately. It depends a lot on how your cluster is configured, how to do that specifically. And then the storage capacity tracking, that will be built into Kubernetes itself. It will work the same way everywhere. The other feature was motivated by some limitations of the existing support for ephemeral volumes. The current approach is that Kuplet on a note asks a local CSI driver to create a volume, but only after a port has already been scheduled to the node. And if it then turns out that the node doesn't have enough storage, which can happen, even with storage capacity tracking, it pot scheduling might have used outdated information. The driver really can't do much and Kubernetes itself also will not recover from out-of-memory situations. The port will basically get stuck. That limitation doesn't exist when using dynamic volume provisioning because in that case volumes get created before pot scheduling and if volume creation on a certain node fails, we can try on another node. That's the idea behind generic ephemeral volumes. They rely on the normal volume provisioning mechanism. There's a new controller that creates volume claims automatically for a port and then a completely unmodified CSI driver can be used to create the volume. That's the other big advantage for CSI driver developers. They don't need to do anything special for this feature to work. It's all in Kubernetes. And because we are using the normal volume provisioning mechanism, we also get support for all of the other features that come with that, like restoring a snapshot, restoring a clone that you took earlier so the volume doesn't even need to be empty anymore. Last but not least, the central component of PMCSI, that is what I want to eliminate by deploying the external provisioner on each node alongside the local PMCSI part. That's currently under investigation. There is some preliminary code. The biggest open is how to do volume provisioning then when the volume may be created on different nodes. We have to coordinate between the different external provisioner instances. But there are some ideas and perhaps that'll be something that actually works in practice. We'll see. And with that, I'd like to conclude my presentation. There are things you can do right away. You don't even need to buy a PMM. Of course, if you could, can please go ahead until certainly you won't mind. But if you just want to try it out, it is possible to bring up a QAMO cluster where a PMM is emulated. You find instructions and subscripts for that in the PMM CSI repository. And the PMM CSI repository also has deployment examples, deployment files for memcached so you can even bring up a real application on that virtual cluster. With my Kubernetes 6 storage head on, we would also love to get feedback for the new Kubernetes alpha features because without feedback, we'll not be able to make them stable or we'll not be able to move them towards beta because that really depends on feedback from actual users. So you can reach me personally via my Intel email address or you can hop onto the Kubernetes Slack and usually find me and all of the other Kubernetes CSI developers on the CSI channel or the 6 storage channel. And with that, Robert Duck and I would like to say thank you for listening and I hope you have questions after the talk or later on after watching this video. Thank you.