 Hello, everyone. Welcome to CubeCon Valencia. We're here to talk about cloud native storage and the CNCF storage tag. We're recording this and there are going to be some virtual questions potentially as well. And Lisa Marine-Afi over here is moderating the discussion. So, a little bit of introductions. My name is Alex Kierkopf. I'm the co-chair of the CNCF storage tag. And this is my co-presenter. Rafael Espazuli. I work for that as an architect, an open-shift architect in consulting. We also had Zhinyang, who was supposed to join us, but unfortunately she couldn't travel at the last moment. But we do have a little video clip from her later on, so that's cool. So we're going to talk about a few things today. Today we're going to talk about a little overview of the tag, and maybe if you're interested, how you can join and help with the community. We're going to talk a little bit about cloud native storage and what it is and why it's important. And then we're going to cover some of the documents and materials that we've been working on in the tag, like our landscape documents, the performance and benchmarking, and also the disaster recovery. And finally, we're going to finish off with a little view of some of the projects which are in the CNCF landscape for storage. So, by way of introduction, the CNCF tag storage. Tag is the technical advisory group. A couple of years back, and originally it was called a SIG, but then we realized that there was too much confusion between Kubernetes SIGs and CNCF SIGs, so we renamed the CNCF SIG to tags. We meet every two weeks or twice a month on the at 8 a.m. Pacific, so 5 p.m. European, and we have all of our calls online and everything is public, so please feel free to join. The tag is a group of diverse people. We have individual contributors. We have representatives from different storage vendors, but fundamentally they all provide expertise in our storage space. We have a number of coaches, tech leads, and we liaise with TOC, which is the Technical Operations Committee. One of the things that the tag is designed to do is effectively to help scale the CNCF. So, as the CNCF keeps on scaling with hundreds of members and end user organizations, we work with the TOC to help review projects and provide expertise in our particular area, and we help review and do the due diligence on storage projects that are going through incubation or graduation within the CNCF. And finally, we also prepare content to help end users understand cloud native storage and what cloud native storage does in the environment. So, I'm just gonna touch on a couple of minutes on why cloud native storage is important. I guess you're all here because you know a little bit about it, but I'll go out and say this. There is no such thing as a stateless architecture. All applications everywhere store states somewhere, whether it's a database, a file, a key value store, an object store, or a file system. So, this is an integral part of each part of the environment. And whilst in some of the cloud native spaces, stateless was the focus for a number of years, stateful applications are a thing, and all the database providers, all key value stores, all the object stores, et cetera, provide a plethora of solutions for this. So, a little bit about what cloud native storage is. A few of the key things are that it's declarative and composable. So, in much the same way that developers should be able to define CPU, memory, networking, load balancers, et cetera, in for their environment and for their application, they should also be able to define the requirements that they need from storage and the data services for high availability and disaster recovery, and we'll touch upon a little bit of that later. The other key important thing is that cloud native storage is application-centric, and what do I mean by that? Today, a lot of storage solutions are server-centric, and we think of storage in a very server or operating system way, where, say, volumes are presented to a particular server or databases are installed on a specific server. But what we're looking at with cloud native storage is the portability of the storage environment and the data services therein, right? So, whether you're accessing storage by an API or whether you're accessing it with a database or whether you're accessing storage volumes, for example, you want the storage to be portable and move with the application and as your application scales and fails over and gets automated by things like Kubernetes. And I'll touch on this last point here as well, which is, it's a bit cliche than some people roll their eyes, but I also think that cloud native storage should be agile, and what do I mean by that? In cloud native environments, you have lots of moving parts, nodes come and go, we get rebuilt, clusters scale up and scale down on the bounce, and really we need to think about how they perform and how security and how availability is made in that ever-changing environment. So, in the cloud native... in the CNCF storage tag, one of the things we did to help and use this work is we created a storage white paper which describes some of the attributes that we would like to look at in terms of cloud native storage. This was supposed to be presented by Jing, so I'm going to play a quick five-minute video of the presentation here. This is the audio. There's no audio from the video. Storage systems have several storage attributes, availability, scalability, performance, consistency, and durability. Availability defines the ability to access the data during three-day conditions. Scalability can be measured by the ability to scale the number of clients throughput or number of operations, the capacity, and the number of components. Performance can be measured against latency, the number of operations per second, and the throughput. Consistency refers to the ability to access newly created data or updates after it has been committed. A system can either be eventually consistent or a systemally consistent. Durability is affected by the data production layers, levels of redundancy, the endurance of the storage media, and ability to detect corruption and recover the data. There are several storage layers that can impact the storage attributes. For example, rather than directly access resources, a hypervisor can provide access to resources which could add access overhead. Storage topology describes the arrangement of storage compute resources and the data link between them. This includes centralized, distributed, sharded, and hyperconverged topologies. Storage systems usually have data production layer, which adds redundancy. This refers to RAID, Erasure Coding, and replicas. Storage systems usually provide data services in addition to the core storage functions, including replication, snapshots, clones, and so on. So this system ultimately possesses data on physical storage layer, which is usually non-volatile. It has impact on the overall performance and the long-term durability. In this diagram, we can see that workloads consume storage through different data access interfaces. There are two categories of data access interfaces here. We call them volumes and API. Container orchestration system has interfaces for volumes, which supports both block and file systems. Under API, we have object store API that stores or retrieves objects. Note that Nearest to Kubernetes 6-Storage subproject called COSI, Container Object Storage Interface, which introduces Kubernetes APIs to support orchestration of object store operations for Kubernetes workloads. It also introduces COSI as a set of GRPC interfaces so that a object storage vendor can write a driver for provisioning accessing object stores. This is targeting alpha in Kubernetes 1.25 release. Under API, we also have key value stores and databases. Now, let's take a look of the orchestration and management interfaces. The control plane interfaces here refers to storage interface directly supported by SEALS. This includes container storage interface CSI and the Docker warning driver interface. This orange box here is an extension of control plane interfaces. For application API, including key value store and databases, SEALS currently don't have direct interfaces for it yet, but we could have operators to support key value stores or databases to work in Kubernetes. That's all I have for the storage landscape white paper. Now, I will hand it over to Alex to talk about the performance white paper. So we'll move on. So after we put together the storage white paper, which effectively covers the attributes and the different layers of the storage system and how they interact with each other, which is so important to understand nowadays because so many of the systems that we use are formed of composite layers. So for example, you have file systems made up of object stores and databases built on key value stores, et cetera. And therefore they inherit attributes around the way they scale and the way they perform based on those different layers. And we then took the next step and said, what are we going to do after the storage white paper? And we decided to pick on two attributes, the performance and recovery and availability to do a bit more of a deep dive into how we can get to the bottom of some of these things. So we put together a white paper on performance and what we're looking to build in here and we're still open to drafts, by the way. So anybody who's open to contributing is very welcome. We're covering a number of different concepts. So we're looking at how to benchmark databases primarily and volumes as two of the first main things. And what we're looking to, and what we're covering is the basics. So for example, what do we want to look at in terms of either operations or throughput because sometimes it's more important to measure the number of operations per second. For example, if you're talking about databases and transactions and sometimes it's much more important to have say sequential throughput capabilities for things like analytics for maybe something like elastic search. And those sort of systems can have very different compromises in terms of how they're put together. And then trying to understand and measure how things like the topology, the data protection, the data reduction and encryption affects the overall performance of the system both in terms of adding additional latency or affecting throughput, but also in terms of the additional levels of complexity or the additional topology differences that you see, say between hyperconverged or desecrated systems or remote systems, for example. In many of the discussions, latency tends to be a really big factor in determining a lot of these things. So we'll see compromises or different pros and cons between different systems, between the data protection and things like encryption and also the topology and the way that affects latency tends to directly affect things like transactions per second in an environment. So often latency is one of those key things but also concurrency. So one of the things we want to measure is how applications scale in these environments and often concurrency is one of those key factors there where we're talking about how many clients can connect in parallel, how many parallel threads and how many parallel queues can operate. And of course, in all of these systems, caching happens at multiple layers, at the operating system layer, at volume manager layers, at the file system layers, page caches, device layers. So it's important to really understand what you're measuring and how much of an impact the cache is having to those environments to actually get a real understanding of the system. And of course, we also want to set a level playing field, right? So understanding how to manage the environment where you're doing performance testing or benchmarking and making sure that you have the right headroom in both the environment, whether it's bare metal or clouds and the headroom that the client needs to actually maximize the performance testing. One of the key takeaways though is that it's really important to focus on testing your own applications in your own environments. It's really, really hard to compare published results without a deep understanding of the test conditions. And in fact, so much of the paper is actually dedicated to the pitfalls and the common issues that people encounter when doing performance benchmarking. I can't tell you how many benchmarks I've seen where somebody has published, for example, oh, I just got two gigabytes per second in my file system and then you ask them what they're running on and they're running on a hard drive that can only give you 200 megabytes per second. So they were really only testing the speed of their cache rather than the speed of their file system, for example. So those are the things that we need to look at. The next thing that we covered was disaster recovery which Raphael is going to take us through. Thank you, Alex. Can you hear me? Okay. Yes, in the cloud native disaster recovery, white paper, we examine a question which is how should disaster recovery look like in the cloud? And we submit an approach to you with proposing idea which surprise we call cloud native disaster recovery. And it's an approach that you should know about. We don't say, we don't think you should always use it, but I think it's good to know that this exists and it's an option and we have done some studies around it. Now to compare and contrast cloud native disaster recovery, this approach that we describe here in the white paper with traditional disaster recovery, we let's use this table and look at the main differences. So first of all, the deployment, the architecture that we see normally in big organizations, in large organizations, for traditional disaster recovery is active passive, meaning we have an active data center or an active cloud region. And when something goes wrong, the workload is moved somehow to a passive location. In cloud native disaster recovery, we want to do active, active deployments. The trigger for the detection and the trigger for the disaster. So the decision that we are experiencing a disaster situation and we have to start moving the workload or reacting to it in traditional disaster recovery, it's typically a human based decision. Something goes wrong, all the alerts fire, people meet and then they decide, okay, we cannot recover, we need to trigger the disaster recovery procedure. In cloud native disaster recovery, we want autonomous decisions made by the system. So autonomous decision and then also as we see in the next line, automated response to the disaster. The procedure itself, normally what we see in large organization is a mix of automated and manual tasks. The better the organization is, the more automation there is, but the trigger like we said is often human and then human task and then there are some other things that typically are done manually. In cloud native, again, we want a fully automated recovery procedure. And then with regards to the two main metrics that you can use to measure how well you're doing disaster recovery, the recovery time objective, which is how long does it take before you're up and running again? And the recovery point objective, which is how many transactions you have lost because of the outage when you recover. In traditional DR for the RTO, we usually are from zero if it's very good organization to more likely ours. In cloud native DR, BOOM! Wow, this is a disaster. In cloud native DR, we are close to zero and it's really seconds. It's the time for the Alchex to react and understand that there's a disaster and for the global advisor to swing the traffic. And for the RPO, it's again zero to hours depending on what approach I use to persist data. But in cloud native disaster recovery, it's exactly zero. So I never lose data if I'm using a strongly consistent approach. And it's theoretically unbounded, but for practical deployment, close to zero, if I'm using a venture consistent deployment. And then looking at more at the organization side, the owner of the disaster recovery procedure in traditional DR, it's actually the storage team. Formally, the application teams have to build their own business continuity document or process, but what they usually do is they turn around to the storage team and they ask what is your disaster recovery procedure? What are your SLA? And then they adopt whatever the storage team is doing in that organization. So it's really the storage team is driving the disaster recovery for the entire company. But in cloud native DR, it's the application teams are the owner of the disaster recovery procedure. And because the DR is really a responsibility of the middleware in cloud native DR, so it's gonna be a responsibility of the database, of the queue system, of the cache. And so those are, the middleware are now owned by, middleware products are now owned by the application team. And then in terms of capability, this is a finding that we did, that we discovered by actually implementing these architectures. In terms of capability, for traditional DR, we usually rely on storage capabilities in order to implement these architectures and the disaster recovery procedures. So we use things like backup restore or volume replication, synchronous or asynchronous. But in cloud native DR, what we really need are capabilities from networking, more than storage. We need an ability to communicate east to west between our regions, because remember they are active-active, so traffic is flowing and going through all of the locations. And we need a good global of balancer, something that not only spreads the traffic, but it also has some intelligence to understand which locations are active and can swing the traffic automatically when there is a problem. Okay? So in the white paper, what you can find is this definition that I just gave you, then some other more technical definition about what a failure domain is, what HA and DR is. We covered the CAPTRM a little bit and then of course, if you're interested, you should go and read about it much more in depth, but all these new generation of middleware that can be deployed the way I described is really built around the concepts of the CAPTRM. So it's something that we should know. And then we talk about the anatomy of this distributed state for workload with shards of replica and I'm gonna show a little bit about that. And then we talk about the consensus protocols that are needed to coordinate all of the instances because obviously we have a multi-instance deployment for these state for workloads. And then we look at some reference implementations or reference architectures for both strong consistency deployment and eventually consistency deployments. So I'm gonna pick some of these things just to give you a little bit of overview of what you can find in this white paper. And I find this one interesting, this is the anatomy of the stateful application. So if you abstract this stateful middleware enough, you will find that they all look the same, whether it's a cache, whether it's a Q system, whether it's a database or no SQL database, they always have a similar structure. They have replicas, so that's how they achieve availability, and then they have partitions which is what they use to achieve scalability, right? And then in each instance they have layers, at least conceptual layers, they have a storage layer, they have a code which is the thing that communicates with the disk, with the actual volumes, and then they have a coordination layer which is what helps coordinating with the rest of the instances and then they have an API layer which really defines the identity of the type of workload. So they have a Q system as a different API than a SQL system, there are no SQL systems. And then in the coordination layer we can identify two kinds of coordination. There is an interreplicar coordination to make sure that every replica is doing the same thing and they're always aligned on the same state, the same view of the state. And then we have an interpartition coordination which is needed when this software supports interpartition transactions. So it's when, for example, if you need to put two messages in Kafka in different partition with a single transaction there will be an interpartition transaction. So then what we did is tried to take some of these new generation middleware and analyze them in terms of the coordination protocols that they use. And what we find, what we found out is that Raft and Paxus obviously are the two most common consensus protocol for the replica coordination, Raft now being the preferred one because it's easy to implement. And on the shared consensus protocol or partition consensus protocol, the two phase commit or derivations of the two phase commit is actually what's being used. And then I'm gonna close with a little overview of the reference architecture for a deployment of these kind of workloads on Kubernetes. So you can see we have three here. This is a strong consistent deployment. So we need three failure domain at least. This is because of the CAP theorem. And so we have the state of workload that is deployed in each of these failure domain from the yellow arrows indicate the fact that they need to be able to talk to each other even across failure domain. That's east to west network capability that I was mentioning before. So if you're in Kubernetes, you need a way to have intercluster communication. And there are several ways to do it. And then we have probably a front end in front of our database or state for workload. And then we have a global load balancer that is supposed to do some health checking and decides to swing the traffic. The other picture here is another analysis that we do because it should not just analyze what happens when you lose an entire region, but also what happens when a region is network partition but it's still available and it's still communicating with the standard clients. It's still able to communicate with the global balancer. In this case, we strongly consistent workloads because the instances that are partitioned don't have the strict majority so they cannot create quorum. They will put themselves offline and the global balancer will be able to see that that location is not available and will swing the traffic to the available location. So even in this situation, you will get those RPO and RTO that we were discussing before. So close to seconds of lack of availability and zero data loss. And with that, I'm gonna give it back to Alex or Singa actually, I think. Okay, great. So we talked about the disaster recovery. So thank you, Rafaela. We talked about performance and the white paper. One of the common things that we get asked in the tag is how do we work with the TOC in terms of the projects that get approved into the TOC and then to join the foundation. So Singa's gonna talk a little bit about that now and hopefully I can get this. Here we go. CSEF projects have three stages. Sandbox is the earliest stage, meaning the project is still experimental. Donating to CSEF will help build a stronger community around the project. Incubation is the second stage. It means the project has been used successfully in production and has a healthy number of committers. Moving from sandbox to incubation is supposed to be difficult. It needs to go through the due diligence review. And graduation is the highest stage. It means the project has mainstream production use, past security audits, and has committers from multiple organizations. I'm going to talk about graduated and incubating CSEF story projects. Yoke is a graduated project. It's a cloud-native story orchestrator for Kubernetes. It has stable support for CSEF and alpha support for NFS. Litus is a graduate project. It is horizontally scaling of MySQL. EDCD is a graduated project. It's a distributed key value store. All Kubernetes clusters use EDCD as the primary data store. And TypeKV is a graduated project. TypeKV is a distributed key value database built in Rust. Kerber is a graduated project. It's a cloud-native registry project. Dragonfly is an incubating project. It's a P2P-based cloud-native image and a file distribution system. Langhorn just became an incubating project. It's a distributed block storage system for Kubernetes. And finally, we have a QBFS, previously, TrubaFS that's newly incubated. It's a distributed file system and object store for cloud native apps. So here shows a list of other story projects in CSEF. There are a few more sandbox projects shown here. That's all we have for CSEF projects. Awesome, thank you, Jing. And I just wanted to sort of cover off some of this process because we sometimes get questions from maintainers in this where we wanted to just clarify that the sandbox project tends to have a low bar to entry and those projects are there for, specifically there to help build the community. The incubation projects have the most due diligence performed on them and those projects are there once they have production users and they have a number of maintainers within the environment. And graduated projects then is the final step where they go through a process to perform security audits and additional checks and have final governance and distributed maintainers to ensure the longevity of the project. Okay, so with that, I'll be ending the talk but before we end, I just wanted to again invite everybody to join the community and join our attacks. We'd love to hear from you if you're able to contribute to any of the white papers that we're building or help us with the due diligence processes that we're working on in projects. And of course, happy to take questions from anybody. This is more to do with the automated DR parts of the recovery process that you showed us. But a lot of the workloads that are being run on Qboard are now databases. And in that case, just a CSI snapshot wouldn't guarantee a consistent backup of the database itself. This specific database tooling needed to get that sort of an automated RPO going. Do you feel there's some synergies there that could happen between databases and basic storage projects? Yeah, we use an extended definition of storage. So for us, also databases, Q system and cache, caches are storage, not just file system and object storage and block storage. So yes, to build those architectures, you need stateful middleware that you can deploy that way. So I call it this new generation of stateful middleware that was built around the cap theorem. So not monolithic, but inherently distributed middleware. And when you do that, as I was trying to show with the picture, let me see if I can. Yeah, you see the synchronization happens at the stateful middleware layer, not at the storage layer. So these disks here, these volumes are completely unaware of each other. You don't need to take backup or volumes in AppShot and restore all the synchronizations happens at the transaction level and stateful workload level. So the key is choosing something, a product that can actually do that. And just another point, it's how all of the different layers work together, right? It's not about, like we discussed in the white paper, you get certain attributes of availability or scalability through those different layers and they need to integrate well. There was also one question online that I will mention, but it was, is the talk gonna be recorded? And the answer is yes, of course, all the talks are recorded and they are posted very quickly. So the person online, that is the answer to that question. Hi everybody, hi guys. Thanks a lot, great presentation. I was wondering, there's a lot of movement for use cases that they need to achieve low latency and we move the workloads to the edge. Are you considering these sort of edge setups into your analysis and white papers? Your microphone hasn't magically started working yet Alex. Okay, I'm not gonna need to go for a run today. So the question was low latency in Kubernetes environments. I think what we're seeing is, and certainly what I'm seeing with sort of both the community that we work with and customers that I work with is that the latency comes down to two factors primarily which is the physical media and the overall latency through a storage stack but also the networking, right? So some of the discussions that we were having earlier around the different attributes and where you make compromises are exactly the sort of things that affect latency. So for example, if you employ replicas, you tend to get lower latency but if you employ a razor coating, you get higher latency. If you do replication across regions like Raffaella was discussing, you could get higher latency but you could also have eventual consistency in some of those scenarios to improve latency but we're certainly seeing environments doing extremely low latency, much lower than sub millisecond for database transactions and it is definitely possible to achieve extremely low latency in Kubernetes environments and we're seeing both in clouds environments and then on-prem, the availability of NVMe disks, for example, that support hundreds of thousands of IOPS and 10, 40 and 100 gig connections on cloud instances which can help with all of those situations, I don't think. Yeah, no, it's not on. Yeah, I think there was an edge component in the question, right? Yeah, we don't explicitly talk about the edge in this, I don't think we have analyzed that scenario much but I think, and I'm not an expert but the little that I've done for edge deployments, I think you are willing in those cases to sacrifice latency, I mean consistency for latency. So I have a quicker local response and then when the network is available you synchronize with the central data center. Yeah, you have to pick the right middle for that kind of workload. Please correct me if I'm wrong but in your table of comparison between traditional deployment and Kubernetes deployment is there is some kind of mismatching between HA and DR? For example, in traditional deployment if I want to provide HA, I deploy several instances on the same data center on different data centers and provide the east-west communication between them. It can be active-active, active-passive so it's very close to your example in Kubernetes. On the same hand, in Kubernetes maybe we need to provide some backup as well. Okay, we deploy on different data centers but backup and restore we will require two maybe. So, I don't understand. Yeah, go, yeah. So, you're absolutely right and when we were putting that table together we had quite a lot of debate as to what each of the terms should mean and what we decided on was it's impossible to count all of the different possible scenarios and all of the different edge cases. So we decided to put together something that kind of represented the most informed scenario and the most common scenarios that we were coming around with. In the actual document itself we then cover a lot more details as to the variations of those scenarios. Thank you very much. Thank you.