 Hello, welcome everyone. This session is around the CNCF storage tag, which deals with all stuff storage in the CNCF landscape. My name is Alex Krikop. I'm a product architect in Akamai, and these are my two colleagues, Jing. Hi, I'm Xin Yang. I work at VMware in the cloud native storage team. I also co-chair with Alex on their tag storage and also co-chair in the Kubernetes 6 storage. And Rafael Spazoli, consulting architect at Red Hat. Great. So thank you all for joining us. We want to use the session to talk a little bit about the tag and what we're doing, some background around cloud native storage, and then we want to go through some of the documents and some of the community work that we're doing around the projects, as well as our white papers that we've been working on, which would all like to finish off this presentation with maybe a few volunteers that might want to help with our growing community. So, a little bit about tag storage. So tag stands for technical advisory group. They used to be called CNCF SIGs a couple of years back when they were first created. Our calls are bi-monthly, on the second and fourth Wednesday of every month. They're all open to the public, so please join if you're interested in this space. And all of our minutes and recordings and the mailing list is also public. And again, please join if you want to join the community. Who are we? Each tag has a number of coaches and tech leads as well as a broader set of vendors and end users who join the community, and we're supported by our talk liaison. So talk is the group that makes the technical decisions within the CNCF, and at the moment that's Aaron Boyd. So what do we do? The concept behind the tag was to help the CNCF scale, right? We have an ever-growing list of projects and technologies as part of the ecosystem. And we're here in helping to provide SME capabilities specifically in the storage space to the CNCF. So effectively, we do a number of different things. One of the key things that we're trying to do is to help educate and provide content for end users looking to adopt storage in a cloud-native way. We also help the TOC in reviewing projects and going through the due diligence process to allow projects to move from sandbox to incubation and finally to graduate. We work with the community in sessions like this and with our calls. Finally, we provide that subject matter expertise in the storage space for the TOC. There are a number of projects which form part of the storage landscape within the CNCF. Some of these are very big projects, which you will already be using on a day-to-day basis, things like EtsyD. Others like Rook provide an object store which is a distributed database, Harbor is a container registry, and TIKB is a distributed key value store as well. And we also have a number of incubating projects like DragonFly, QFS, and Longhorn which are providing those... which are going through the process for graduation. On top of that, there are whole... there's a whole suite of sandbox projects. I think in total there are maybe 80 sandbox projects today. And it would be good to have a look at the sandbox projects folder on that link to have a look at some of the projects which are looking for a community and looking for end users to help join them. So we spoke about sandbox incubation and graduation. Some of you might be familiar with this, but I thought it would be useful to just kind of go through some of those levels and kind of explain what we mean by that. So sandbox projects are the projects which have the lowest bar for entry. And the idea here is to help foster... help use the resources of the CNCF to help foster those communities and those projects. So projects in their early stages can look to help build adoption, help build their governance, help with their IP and licensing and things like that. From sandbox, we move into incubation. Incubation is probably the... almost the hardest level to achieve in the CNCF projects, because incubation has quite a high bar. A project which is in incubation is therefore use in production and it's stable and has a lot of requirements for that aspect. And therefore the bulk of the due diligence that we do is on projects moving into incubation. And we look at things like their successful use in production. We look at end users who are using them and interview them. And we look at healthy project metrics as in the growth and the governance of those systems. And that is one... is the last final step. And those are four projects like Kubernetes and that CD for example, which also go through additional security audits and have committers from multiple organizations for example to help show the final step of maturity. So we're talking about cloud native storage. Why should we be thinking about cloud native storage? Isn't it a fact that all containers do this well? There's no such thing as a stateless application. Every application is going to be storing state somewhere, whether it's a database, a file system, a key value store. And then the question is, do you run those stateful parts of your workload, those stateful parts of your application together with the rest of your cloud native application or do you run it outside of your cloud native application? And of course we argue that having cloud native storage within having the storage or the stateful workloads as part of your cloud native workflow gives you the automation and the scale and the performance and the health and checking and failover that's kind of built into all of the patterns and architecture that you've been using for cloud native. So why go through the process of having runbooks and workflow and CI CD and security scanning for your cloud native and a different set of processes for your databases and key value stores and message queues, etc. And the reality is now there's a huge incredible broad ecosystem with CSI support and standards for implementing volumes of all types and stateful storage of all types within your cloud native environments. There are possibly 150 different CSI drivers maybe. I think there might even be more. And also apart from that what we're now seeing is a very healthy and mature ecosystem around operators for databases and message queues and everything else. So if you want to run Postgres, Mongo, RabbitMQ and a variety of different other systems, all of those are simple options which can be now implemented declaratively with automated upgrades and failovers and data of operations within your cloud native environments. So in order to help people and understanding this, we put together our first storage white paper a couple of years ago and Jing is going to take us through some of that. Thanks, Alex. So I will talk about our storage landscape white paper. In this white paper we talked about storage system attributes. We talked about the different layers or stacks of our storage solution and how they would impact the storage system attributes. We talked about the definitions of data access interfaces and management interfaces. Storage system have attributes. Here we have availability, scalability, performance, consistency and durability. And availability refers to the ability to access data during failure conditions. This can be measured by the uptime as a percentage of availability. And scalability can be measured by the ability to scale by the number of clients, the number of operations, the throughput, the capacity and the number of components. Performance can be measured against latency, number of operations for seconds, throughput. Consistency refers to the ability to access newly created data or updates that is already committed that can be eventual consistent or it can be strongly consistent. For strongly consistent it is synchronous. If it is eventual consistency then that's async. And we have a durability affected by the data protection layer, the level of redundancy and the endurance of the storage media and the ability to detect corruption and recover from it. And we have different storage layers that would affect those storage attributes. This includes host and operating system, the storage topology, data protection layer, additional data services provided by a storage solution and finally the physical non-volatile layer. That's all I want to cover for the storage landscape by paper. Now I want to move to a new initiative that we are working on. We are collaborating with data on Kubernetes community to write a white paper to describe the patterns of running data on Kubernetes. So in the first version of the white paper we will be focusing on databases. However, most of the things we describe in that paper will apply to other type of data workloads as well. In the paper we talk about storage system attributes, how they would affect running data in Kubernetes and we compare running data inside versus outside of Kubernetes and we describe some of their common patterns used while running data in Kubernetes. So we have a draft paper out. Please help review the paper. We have a link there. So I talked earlier that there are storage system attributes and those apply to running data in Kubernetes as well and we added a couple more attributes here, observability and elasticity. So for cloud native workloads in cloud native environment we have many different microservices running in a distributed fashion and if something happens it's hard to determine which component is causing the problem so it is even more important to have a comprehensive observability system built in so that we can detect problem early and prevent the failure from happening. And elasticity refers to the ability to quickly scale up and down and this can be referred to as an on-demand infrastructure. It can refer to the ability to release resources quickly when it is no longer needed. It can refer to storage tiering so we can move data across different storage tiers based on how often the data is accessed. And I mentioned earlier that there are different storage stacks that could affect the storage system attributes and that is also true for running data in Kubernetes. And regarding disaster recovery, Rafael will talk about that later. We have options to run data inside or outside of Kubernetes. Deploying and operating databases without proper automation is a lexipattern that is not recommended. So there are mainly two options left. There are management managed database services provided by most cloud providers. And you can also run data inside Kubernetes. You could use a Kubernetes operator to facilitate the running data in Kubernetes. And you have the benefits of supporting Modicloud and cross-cloud portability. And the operator uses a declarative API that reconciles the actual state with the desired state. And it automated operations such as upgrade backend restore data migration and so on. We can also use other tools such as Prometheus and Grafana for monitoring and so on. There are common patterns and features used with running data in Kubernetes. Typically we use a operator to facilitate that. And there are things that we should consider. We are writing an operator. We should look at what kind of configuration parameters we want to expose to the users. It's not necessarily the more the better of between flexibility and complexity. An operator should support upgrades non-destructive upgrades and also should manage different versions of their CRDs. And it should also support periodic operations like re-indexing a backup restore. An operator typically will use persistent volumes to store data. And persistent volumes are typically provisioned by a CSI driver. CSI driver basically allows the storage to be consumed by the containers running in Kubernetes. So that's all I'm going to cover for the DOKY paper. I'm going to hand it over to Alex to talk about the performance by paper. Thanks. So the other thing that we wanted to share as part of that white paper is trying to get an understanding of how each of the different systems have pros and cons and they have different facilities and different optimizations where they can optimize for availability or optimize for performance or optimize for consistency. And each of those different things affect each other in different ways. So systems that are optimized for the lowest latency might not be suitable for the highest throughputs. And similarly systems that are optimized for synchronous consistency and availability might have lower performance and vice versa. So one of the things that we kind of decided was after describing these different attributes and the different pros and cons between them is we wanted to go into a bit more depth into things like availability and so we came up with a performance white paper and Raphael is going to talk about cloud native disaster recovery because some of this stuff is complex and it's layered throughout the storage system. So we have in our white paper we go through the definitions of some of the performance concepts. So common concepts like how to actually measure this and how do you do benchmarking of things like volumes and databases. And more important than that we kind of also defined a lot of the criteria to help users avoid a lot of the common pitfalls that people come across when they're measuring performance. So understanding difference between operations or requests per second versus throughputs or megabytes or gigabytes per second understanding the effects of the topology and the data protection and data reduction and encryption and how that affects things like latency and concurrency and also understanding how caching operates at so many layers in a cloud native stack all the way from operating system layers to application layers and that impacts how you benchmark and what you're actually benchmarking. And so one of the things one of the most important takeaways was that it is actually incredibly hard to do apples for apples comparisons for any of these environments. So published results from vendors can actually be very difficult to evaluate. And you actually just need to understand what your application objectives are when you're looking at these performance requirements and to run your own tests in your own environments with your own applications typically is the best solution because I've lost track of the number of times where I've seen for example a benchmark saying oh I've managed to get 2 gigabytes per second on this volume and then you realize that the volume is a fraction of the size of the system memory and actually testing is how fast their memory performance works. So understanding these things in your environment with your sizing and your applications in your types of server instances in the cloud or on-prem or whatever is the most important thing and we try to help you do that. And now I'm going to pass over to Rafael to talk about disaster recovery. Thank you. Okay so what can you find in the disaster recovery in the white paper? You can find an approach to disaster recovery which we call cloud native disaster recovery of course and the message here is that it's an approach that you should know about we don't claim that it's always the best approach or that you should use it every time but we think we should know it and consider it in the next time you design your disaster recovery procedures. And so how does it compare to traditional disaster recovery? We prepared this table here to help with that comparison. So for example the type of deployments that traditional organization use for disaster recovery are usually active passive in particular for stateful workloads. For cloud native disaster recovery we want active active all the time so every failure domain can receive writes and by failure domain here we mean a data center or a cloud region. Then what or who detect that there is a disaster and start the recovery procedure? In traditional disaster recovery it's usually a human so there is maybe an outage a committee of people meet and after trying to resolve the problem they decide that it's hard to resolve or it would take too long and they click the button of let's start the recovery procedure. For for cloud native disaster recovery we want that decision to be autonomous so the system needs to realize that there's something wrong and we'll start the recovery by itself and then for the execution of the recovery procedure itself in traditional DR what we see in many organizations is that it's a mix of manual and automated tasks if the organization is good they probably have more automated tasks and otherwise more manual tasks but it's always a mix instead for cloud native DR it must be fully automated and then with regard to RTO and RPO which are the main metrics for disaster recovery RTO just as a reminder is what is the interval of time how long is the interval of time that it takes for the disaster for the application to come back online when there is a disaster and RPO is the extent of transaction in time that we have lost that we lose because of the disaster so for traditional DR for RTO we have from zero to ours again depending on how good the automation is for cloud native DR the RTO is close to zero essentially just a few seconds just the time for the health checks to realize that something is wrong and for the load balancer to swing the traffic automatically to the healthy failure domains and then for RPO depending on the kind of storage replication that you have it could be zero if you have synchronous replication or it could be ours if you have backups scheduled backups for cloud native disaster recovery it's going to be zero exactly zero if you use a strongly consistent middleware and it's going to be theoretically unbounded but from a practical standpoint close to zero if you use a ventral consistent middleware and then less technical and more on the organization side the owners of the disaster recovery process traditionally are the storage teams because they have the technology they to replicate the storage and then even if conceptually the owner of the business continuity plan is the application owner they always turn around to the storage team and ask what SLA can you guarantee and that's the SLA that is the solution instead for cloud native DR the owner of the disaster recovery procedure is squarely the application team because they are the one they're going to choose the middleware the stateful middleware and we don't rely anymore on the storage team for that and then implementing these architectures what I've noticed is that traditionally we rely like I said on storage capabilities to build our disaster recovery procedures so we have things like backup and store or volume replications but for cloud native disaster recovery we rely more on networking capabilities in particular a good east-west communication between failure domains and then which is needed for the middleware the stateful middleware to coordinate and then we rely on a global load balancer a smart global load balancer that can as L checks and can identify when a failure domain is down okay so with this premise this is a little bit of the content of the white paper you will find some definitions the cap theorem which is the theorem that determines a lot of this what is possible in terms of behavior for distributed stateful workloads and then we have a description of the anatomy of the most stateful applications and then a description of the consensus protocols that people can choose and finally some reference architectures so just an example of some of this content this is the anatomy of most of this stateful work modern stateful workload or distributed stateful workloads if you look into them this is what I found out doing my research if you look into them you find out that they all are built on the concept of shards and the concept of replicas so shards are used to achieve the ability of being able to scale indefinitely horizontally indefinitely so shards can process a request in parallel because they break the data space into multiple shards or partitions and then replicas instead help achieve the resilience so that if failure domain goes down we have a replica of the shards somewhere else and we can still process the requests so what's interesting here to understand is that there are protocols to keep the shards to keep the replicas in sync and the shard coordinated when that is needed so for example if you're doing a software selection and trying to decide which storage you want to pick which storage product you want to pick one of the lens that you will use is disaster recovery and looking at the way of the protocols and the consensus protocols that they use to coordinate shards and replica gives you an immediate feeling of what the product can actually do so these are part of the summary of what some of these modern state of workload used for that purpose and finally we have some reference architecture here I just showcasing one which is a state workload that is distributed it's deployed on Kubernetes it's distributed across three data center we assume that this workload is strongly consistent so when the workload comes up all the instances can communicate with each other and can decide how to distribute the replicas and the partitions and they can do it because there is east-west channel or communication pathway that is that arrow that yellow arrow where it says state sync so as you can see the synchronization doesn't happen at the volume there it happens at the application layer and then in front of that we may have a front-end application we may have some load balancers but in front of everything there is a global load balancer that decides to which failure domain your traffic should go and that must be a smart one where if we lose for example one of those data center the global load balancer realizes that and starts steering traffic to the LC ones okay so this summarizes what is in the DRY paper great thanks for our failure so I'm going to finish off I'm just going to finish off this with a very simple call to action we're looking to continually expand the community and we're looking for more and more inputs from people who have end users or problems or use cases that they would like to discuss we're also very keen to hear from projects who might be interested in joining as sandbox or incubation and on joining the NCF and on that note we have lots of presentations from current projects and new projects covering everything related to storage and just for the avoidance of that storage is in just volumes but it's also different file systems and object stores and key value stores and databases and message queues and anything else that can persist data as mentioned we've done a number of different documents if you have expertise in this area or if you just an end user with an interesting use case would love to hear from you and please feel free to contribute to the drafts or to join the calls for discussing these documents and we're also always looking for additional roles in the tag so if you're interested in working more with the NCF and would like to have a tech lead role in this function or even eventually join as a coach or join the TRC working with the tags are really good ways of doing that and please consider that as a role within the community and help contribute to the projects, help contribute to this information we do this through a wide selection of individual contributors and vendors and customers and users of the NCF community and it's always better when we can crowd source the information from as many people as possible so would love to hear from you come join our calls and with that I'll hand over for a couple of minutes of questions and please don't be shy somebody stick your hand up and ask a question and it will be very embarrassing no questions, okay so you all get a couple of minutes back unless there's one, oh fantastic I'm just curious perhaps some misunderstanding my name is Wase it's in I'm from Red Hat the cloud native disaster recovery white paper is that meant to have native cube disaster recovery built in and to replace like third party disaster recovery for cloud native apps so it's a general concept that doesn't require Kubernetes but we have in the paper we have some reference architecture and now you would build it on Kubernetes I think the most important concept there is we don't try to recover the clusters we just worry about the applications running on the cluster right and so it's the application that needs to be able to do the things that we have discussed did I answer the question? I think so, I might come up afterwards I think we have a question here in the front I have a question regarding the Kubernetes storage attributes CSI was stated as the protocol is that because the first version is database related and it is related to block and file or is there also something coming for COSI container object storage interface the question is about COSI the container object storage interface right? so that project right now is half our stage so if you are interested in that project you want to see where that is going we also need more contributors for that project we are actually now waiting for more vendors who have object storage to write drivers right now we just have one for Azure but we would like to have more and we have bi-weekly meetings so eventually of course the plan is to move that from alpha to beta but we are not there yet but there are still meetings going on there are contributors but we do need more thank you is there a work on data warehousing in a cloud native manner? MapReduce or machine learning? not currently but that sounds like an interesting aspect to take we today we are focusing on the data on Kubernetes white paper which includes a number of different cloud native use cases but there is a strong focus on databases we would love to hear more about things like data warehousing and machine learning and those sort of you know patchy spark and Hadoop and all those sort of things in Kubernetes too we do actually have some sandbox projects that work in that area and that are working in that part of the ecosystem and there should be recordings of those of those project presentations online but if you are interested in working on that or helping us work on that we would love to hear from you I just wanted to add that so we are working with the data on Kubernetes community on this white paper that I mentioned earlier in the first version we are focusing on databases but in the data on Kubernetes community they also have other use cases there are actually a lot of people in that community working on machine learning data analytics workload so that's something maybe we could do maybe in the v2 of the paper maybe over time so thank you all for joining our presentation and we will be around if anybody has any more questions thank you