 Hello, everyone. My name is Qingcai Wang from Alibaba Cloud. Today, my colleague Junbao and I will give our talk about the large-scale practice of persistent memory in Alibaba Cloud on Kubernetes. This is our profile. You can contact us through GitHub or email. This is our, this is the agenda of today's topic. In the first part, I will introduce the system architecture of PMAM stack. The second part, we will introduce the practice of PMAM stack used in our company. And the third is the demo and the fourth is the other related works about PMAM. So why is persistent memory? Persistence memory is a higher performance by two addressable memory device. Being on the memory bus alone memory has the memory. DRAM like access to data. And persistent memory has a full filter. The first one is data persistence. And the second is higher performance and the large capacity and low price. And why we use this PMAM in our company? It is first is cache database needs large capacity memory space. And the second is large. We want to save our cost. And then let's say the necessary of using PMAM in Kubernetes. PMAM is typically used as fill system, volume to application, which can be provided as persistence volume with CSI. First, containerized. More application I containerized in currently. Many payloads I deployment on Kubernetes platform and use a storage resource as volumes. So PMAM used in container becomes more and more requirements. Secondly, automatic. Kubernetes manage PMAM device automatic with Kubernetes customized control. PMAM can be initialized as defined policy. And also PMAM device can be provided as namespace and fill system automatic. Suddenly PMAM can provide more than TB capacity on one note. So resource sharing is necessary between different users. So the resource limit is the requirements in the production. And then is capacity aware. Capacity on a single note is limited. So we need to report to resource capacity of the node to the schedule. Scheduler will determine the location of the new port based on the capacity of the PMAM in the entire cluster. This is architecture of PMAM stack. That's the doc with architecture diagram on the right. At the bottom of the graph is node resource manager. Node resource manager is responsible for automatic initialize PMAM device. And it also format the device and mount it as a fill system, which can be used as directly by access by application. And node resource manager is deployed as demo set and works on each note in the cluster. It also report the node resource capacity, which can be used by schedule. PMAM is CSI layer. As we mentioned, PMAM can be used as storage to application. So we can manage it using CSI plug-in in Kubernetes. PMAM will be used as persistence volume and it can be managed by Kubernetes volume system. And through CSI component, PMAM is formatted as EXT4 fill system with project quota feature enabled. And we also provide volume provision, volume mount, volume capacity resize, and volume monitor for CSI feature is implemented. Above CSI layer, we provide schedule plug-in and auto-resizer component with auto-resizer. The PMAM volume can be extended according to the capacity usage. The PMAM volume can be extended according to the capacity usage and with schedule extender. We resize capacity awareness of PMAM resource at the top of the graph. There are some CRD or config map, which I used to record capacity information, autify PMAM resource top logic. And then last week, look at the detail implementation of the node resource manager. Node resource manager returns every node of the cluster and roaches the config map with defined PMAM resource top logic. In node resource manager, please add to the PMAM definition on this node. It will start to check PMAM device and guarantees that it generates the target namespace. Node resource manager consider device as storage and format it to fill system. Application use PMAM with project quotas fill system. So node resource manager will enable the quotas filter on the Eakt4 fill system and also enable the DAX filter. Node resource manager records the node resource capacity to node resource CRD for each node. The node resource CR will be created for every node in the cluster. If there are more than one project quota fill system in one node, they will be recalled in the same node resource CR. Node resource top logic defines the PMAM resource planning in each node. There is an example of the definition. Name means the fill system directly on the node and the key and operator and the value is used to match the target node. If one node matches the key value, it will follow the defined PMAM policy. Top logic defines the resource planning. It defines the resource type format options and PMAM regions. With node resource manager, the PMAM stack implement initialize resource automatic and resource capacity reporting. And then let's introduce the PMAM CSI. PMAM stack implement the four CSI filters for PMAM type volume. Unlike distributed fill system, PMAM resource is the local resource on the node. So the provision glitch operations will be executed through GIPC. Volume provision CSI controller will be responsible for creating volume. It called GIPC to the remote server on CSI agent side. Volume provision is only supported in the lazy bounding. We set the volume bounding mode as wait for first consumer. Volume delete with GIPC call. CSI controller will remove all the fills under the subhouse which is related to the PMAM volume. If you don't want to remove the data, you can set the reclaim policy as retain. Volume resize. If the volume capacity does not meet requirements, you will want to expand the capacity with PMAM CSI. You just resize the PVC resize and the system will expand the capacity automatic. Next, let me introduce the volume auto resize cyber system. We have introduced the volume online resize in the PMAM CSI part. But the only online resize is not enough in the production cluster. Automatic resize one volume capacity when the current usage is in the emergency. So we developed the auto resize controller. First, they introduced the auto resize policy. It is designed as a CRD which defy the resize policy for the PVC. It can be defy when which volume apply to a policy and when one volume need to expand capacity, the policy supports a critical volume of size of the percentage. Then let me introduce how to implement scheduling to use PMAM more reasonable and efficiently. How the implementation is based on the scheduling framework. The scheduling framework is a block wall architecture for the Kubernetes scheduler. It adds a new set of blocking APS to the existing scheduler. The scheduling framework defies a few extension points. At the same time, we have also implemented some strategies in the big data and AI through the scheduling framework, such as gone scheduling and the elastic quota of capacity scheduling. These functions are open source in the scheduling scheduler plugin project on GitHub. Welcome to try them. Then let me introduce the plugins for capacity scheduling. There is two selection policy in capacity scheduling. First, we allocate pulled request to the same region based on capacity. Second, when some when we choose the region, we will use bean pack strategy to give the priority to the region which with the lost remaining amount so that they can meet pulled with large resource requests. In the future. For capacity scheduling, we may implement the following four extension points filter filter out to the nodes that can't meet the requirement of PMAM score select the optimal PMAM scheduling result by algorithm. Resolve the optimal PMAM scheduling result to prevent reallocation to other poles. And if failure occurs in the bonding cycle, it will clean the PMAM scheduling result pre bound up to the PMAM scheduling result to the annotation of the PVC. And then we will introduce the new malware scheduling because CPUs located in different human nodes have different delays to to access the PMAM of different sockets. So we will have the following scheduling policies based on human. First, try to allocate the CPU of the same human node to reduce the switching of applications across new man. Second, select the combination of PMAM and CPU with shortest distance. Similar to capacity scheduling. We also implement for extension points in human of their scheduling filter filter out to the node. That can't meet the requirement of support for CPU score select a combination of PMAM of CPU and CPU with shortest distance. Resolve also results the optimal CPU schedule result to prevent reallocation to other poles. And if failure occurs, it also will clean the result. The force is a pre bound and it will create to pose sick ropes request. It is a CRD to store the CPU scheduling result. And if you want to know more about the CPU scheduling, you can click the link below. It will introduce them more details. So this is all the all the cloud native PMAM stack introduction. The following content will be introduced by my colleague, Jun Bao, please welcome. In the second part, I will introduce the practice of PMAM device using a developer cloud. Let's take a look at how it is used on persistent memory stacks. As the left graph shows a persistent memory device is format as DX file systems and that is access to memory device with direct I will buy map system calls for the right operation. Ready to write data directly to DRAM and ready also write data to persistent memory until the right operation between successful for the read for the read operation difference process logic is applied to give business data and metadata. For metadata, ready to read from DRAM. And if there is no data matched ready to read from past memory device and catch the data to DRAM as a hot data for business data, ready. We will read the data directly from the past memory device. The right graph is overall architecture. As the bottom layer is a persistent memory and better metal hardware resource provided by Alibaba cloud. The top layer of hardware is about cloud Linux OS and the past layer is cognitive communities platform, which provides a number of extension plugins such as auto resizer, scheduler sets plugin and the source manager. On the top of a system, it is customizing the radius engine and radius applications. In the business on PMAM stack, ready service achieves containerized deployment and automatic operations. There is a benefit of using past memory stack for the ready service is all the ready system because of limitation of memory capacity already service can't meet the requirement of live capacity catch system. So because of the high cost of memory, ready service cost remain high. And in software architecture, all system is using both catch and storage resources. In the new system, past memory source can provide a lot of radius instance, which is important in large database service scenarios. We can start in our radius instance online and get the following statistics. In terms of performance, compared to DRAM, persistent memory is achieved 90% percent performance and in terms of cost compared to DRAM, past memory achieves 70% cost so the past memory source gets a good performance and cost benefit. And for the software architecture, in the old system, the architecture has three layers, will contains application, radius catch and persistent volume. In the new system, the architecture has only two layers, will contains applications and radius cluster, no need to persistent volume. Then to see the radius cluster itself plays both catching and data persistence. The third part is a demo. First, we will, we should create a Kubernetes cluster from our cloud, ASCII service and add a node with contents. The contents nodes. We can see this class has four nodes. And one node is P-memory type. We can look in the P-memory nodes and find the model information for the P-memory. You can see there's no path mounted by the P-memory device. And the P-memory is on the DV path. There's no reading, there's two namespace P-memory device. We will install the node source manager. And check the node source manager code is running. OK, our code is running. And then we will apply the traffic map and check the traffic map definition. The traffic map defined the target node as the code path type. Which means the device will format as EXT4. This is a path system with product quota level. And it is mounted to the path of MNT and path 1. As a key operator and the value defined as a path node, which will follow this policy. Now we can check the path is mounted to the P-memory device on node. OK, the P-memory device is mounted by the path MNT path 1. OK, the step 3, we will check the set plugin install and install the third class. The set plugin is installed and the port is running. And we will apply the, we will check the third class. The third class defines the volume type as code path, which is seen as the traffic map. And the load path as MNT path 1, which is seen as the traffic map 2. And we will apply the third class. And the step 2, we create a step set application with the P-memory volume. In the third set, yeah, you can see the third class name is defined as Alibaba cloud P-memory quota, which is seen as we created right now. Now we check the code, which is creating. Now we check the PPC. The PPC is created right now and the PV is all created. And we look in the port and check the mount point. We find the target path data is mounted by the P-memory device. And the size is 3GB, which is our expected. Okay, we resize the PPC with the application online. We change the size from 3G to 4G. And we can find the PPC and PV size. PV has changed from 3G to 4G and the PPC is some slowly. And last step we can, we can check the auto size controller features. We can find the auto size controller is installed in our class after the port is running. Okay, the PPC has size finished from 3G to 4G. And we can check the auto size policy. And the PPC selector means it's defined which PPC will be watched by this policy. For example, this policy will force the PPC which has key APP and value index quota. And the conditions define the volume condition which can retrieve the actions. As opposed to the free size type and the percentage type. It means if the free size is less than 2GB, it will trigger the actions. And the actions define the volume upgrade actions. It can contain more multi actions. And here it only contains one actions. And this actions defines the top as volume expand. And it will expand the volume 2GB every time. And the next size is 20GB condition. Okay, we apply the auto size policy. Okay, and we, and then we copy data to the target path currently and check the capacity change. This will copy the files to the target path. And I will check the volume change, volume capacity change. The capacity trader will going to be like this graph. As the user's capacity grows and the volume is continually expanded. Okay, this is a demo. The last part is summary that works on the past memory. In addition to being used as storage device. In addition to being used as a storage device. Past memory can be also be used directly as memory device. We work on the memory mode of past memory in following two scenarios. First, past memory use as a template FS as a graph source. We format the past memory device as dv, dx type and add it as a new manual. Then CSR probably will mount the past memory new manual as a local volume by temp FS and provide it to application with PVC and TV. The template FS volume can be dynamic provide. In this scenario, the past memory device is used as high performance local storage. Secondly, past memory can be used as memory cache. Same as the buff scenario, we make the past memory as a memory cache. You can run the online application on DRAM. And if the application against stop, the system can move the memory data from offline application to past memory. When the offline application with the past memory can be used as memory cache. We run the application on DRAM and if the application against stop, the system can move the memory data from offline application to past memory. When the offline application with cups, the system can import the memory data from past memory to DRAM immediately and start the application. In this scenario, past memory device is used as a memory to cache the application data and split up the application stop. And last, I will introduce the field works related to related with the past memory device. We will look forward to find more business scenarios to past memory device. For example, in the biggest area. We will look forward on combination source scheduling to enhance the device performance and make the source pass between TMM and DRAM and CPU. In the field of memory power, the past memory can be used as a chip memory source and we expect to implement the memory dynamic provision in cognitive environments. Then we can attach memory instance whenever the port is running. That's all my topics. Thank you very much.