 Okay, hello. Hello and welcome. It's amazing to see so many people here looking to find out a bit more about cloud native storage and the CNCF storage tag so It's going to be a fairly packed agenda. So there's a lot to put into the 30 minutes We hope to have some time for questions at the end, but we can always have Questions outside on the in the hallway What we're going to talk about today is an overview of the tag and and what the tag does in the CNCF and a quick overview of cloud native storage, and we also wanted to then share some of the White papers and contents that the tag and and the different members have been producing So a little bit about the tag the the CNCF tags are technical advisory groups So we work with the TOC. That's the technical oversight committee and the tech and the TOC is The group that makes the technical decisions for the for the CNCF and works with the projects through their maturity cycles We are an open community and we'd love to have you at our calls We meet twice a month on the second and fourth Wednesday of the month You can see some of the links to our repo in on the CNCF github And all of our calls and membership is open. So turn up Participate learn etc. That's that's all great stuff So who are we it's it's a wide variety of people everything from Sort of leaders in the space to people who are experimenting it's its end users to project maintainers to vendors and Independence contributors, but overall where the important thing is we are an open community where you can learn you can get advice You can contribute to the cloud native ecosystem and work with the projects In in the CNCF. There are a number of coaches and tech leads and a number of other individual contributors that Work on the different initiatives we have Obviously feel free to reach out. We're on the CNCF slack and we've got our mailing list as well and You know, I keep getting lots of questions about okay So what is the tag and why does the CNCF have tags and and ultimately the CNCF the to see is working with projects and their mission is to to be able to Make make cloud native computing ubiquitous, right? And so part of what we do is is to help the to see scale and what do we mean by by scale, right? Over the years the number of projects has has skyrocketed as you can see there's something like a hundred and eighty different projects today covering a number of different Parts of the life cycle all the way from sandbox to to to graduate it So as part of the tag what this means is that we do four main things We we educate by creating white papers and an information that allows and users to understand the ecosystem And the projects and how to best use cloud native storage in our environment It we help the to see with the review of the projects and the management of the due diligence and the reviews, etc Of course we work with the user community Especially when it comes to you know onboarding new projects and and and working with the community to to understand New projects and new needs that come up because the cloud native environment keeps on evolving And we're there to provide subject matter expertise right because the TOC can't be an expert in everything and we're there to help So when we talk about cloud native storage, why should you think about this? Come out and maybe say something a bit controversial. We don't think there is any such thing as a stateless architecture, right? Every application needs to store states somewhere whether it's a database a key value store some object storage and The point about cloud native storage is to say look We can move we can use the same patterns that we've developed in the cloud native world in the Kubernetes world to apply, you know declare to the API driven structure to storage to Enable the same amount of automation the scale the performance the automated failover the auto healing capabilities of all the patterns that we've learned in in Kubernetes and at this stage there's such a broad ecosystem With CSI support cozy support, etc To enable just about every type of database and a free type of key value store every type of System or service to to interoperate here and we have lots of operators that manage The the deployment and the automation in the day to operations of all of these platforms So might I'll be asking ourselves like what type of projects there are some main projects that are already graduated and incubating Projects like Rook, which is a an operator for staff that provides Block file and an object storage. There's the test which is a scale out my sequel clustered Very large-scale my sequel cluster database harbor, which is a scalable container repo XED which of course you're all familiar with because you use Kubernetes So of course you are and here I give you which kind of takes the key value store and distributes and shards it To give extreme scale as well. And then we have a number of incubating projects like dragonfly that provides acceleration for for images Cuba Fests, which is a scalable file system and Longhorn, which is distributed Block storage solution that runs in Kubernetes And last but not least Have a look at the list on the CNCF website for the for the both the projects and the sandbox projects There are many projects here too many to list In a session this size But there are lots of interesting initiatives and and projects that we're discussing in the tag as well that that are coming up I'm going through the process too. I often then get asked questions as to what's the difference between Sandbox incubation and graduation so sandbox is the the earliest stage of a project within the CNCF where Which has a low barrier to entry and it allows projects to to join the CNCF help grow their community builds their IP policy and licensing and and and their Their maintainers and governance And that's then moves on to incubation incubation is actually where most of the due diligence happens so this is where We make sure that the projects are used in production and we speak to end users and find out about real-life use cases and There has to be you know a healthy number of Committers and a healthy number of maintainers To get to that incubation stage And then once you get to incubation you can move to graduation graduation adds Additional layers on top so there will be things like security audits and code quality And make sure that there are multiple organizations that are maintaining that are maintaining the project and We talked about some of the things that the tag was working on and we've been working on a number of different white papers One of them is the the first one that we started working on is is a storage white paper. We're now at version two in The storage white paper. We do a number of things We define the attributes of a storage system and the various layers that make up the storage environment How we get the how we get access to those to those interfaces, whether you know It's volumes or different APIs, but also the the management interfaces and how it integrates with the orchestration layer and with Kubernetes Also breaking is we're working on version three of the white paper So it would be lovely for anybody who's interested to participate and contribute to for more content into version three because obviously There are new things happening all the time A few things about the white paper So we talk about the storage attributes and why are these important we define these five attributes which which cover a variety of different metrics related to the storage system and and they're there to allow you to Understand what your application needs from the storage system. So for example Different attributes have different compromises with each other and you know, for example Having very high performance might come at the expense of strong consistency And those are the sort of things that you need to understand for your application So this gives you kind of like a way of guide posting and matching the application needs to the storage to the storage system and Of course, the source system does have many many layers, right? So you starting from a physical layer to, you know, the various data services that provide replication and the data protection layers like erasure coding and replicas and The topology of course, whether it's a centralized distributed hyper converged and all of those all of those things have an effect on each of those each of those Attributes and why is that important? So we'll kind of see how these attributes play with each other for a few different use cases For example in a hyper converged environment where we have storage and compute Sharing the same notes and sharing the same resources We we we get a much better shared environment from a performance point of view, but obviously we now have Fault and change management domains that are converged under the same under the same nodes With with block volumes we get Scalability benefits by being able to disaggregate the block storage from the actual application and compute We get the availability by being able to move Storage from failed nodes and mountain on other nodes and we typically get the performance Which is typically lower latency But that obviously depends on the connectivity between the nodes We have things like shared file systems which allow you to scale by being able to access The file systems from multiple nodes at the same time and this might be perfect for example for a machine learning or data links But obviously when you're accessing the same data from from on the same file system at the same time You get into the complicated into the complexity of cash coherency and distributed locks, etc And of course as with all of these things We come back to the layers right because a lot of these systems are built on Underline storage systems so very often you might have for example a file system That's built on an object store So it has the shared attributes of the file system, but the latency of an object store and understanding those those layers is very important And finally we look at object stores Which have an almost infinite capacity and throughput because it allows you to scale To scale in parallel across a lot of endpoints But obviously latency is higher because you're accessing it through an API and every API is expensive And and the performance is is is typically not measured in terms of throughput But the metadata operations and the RPS become the limiting factor so hopefully that gives you a little flavor of what goes into a storage system and some of the things that we covered in the white paper of course have a look at the white paper and There's there's a lot more than 70 odd pages there And from that we'll move on to the next white paper which covers the data on Kubernetes Thanks, Alex So in the 2022 survey by data on Kubernetes community More and more safer workloads are moving to Kubernetes There are different types of data workloads as shown here database workloads has the highest percentage Followed by Data analytics AI machine learning. We also have streaming messaging and CSE D and the underlying storage could be block file or object the stable workloads move to Kubernetes to Take advantage of Kubernetes self-healing ability portability scalability and so on We collaborated with the data on Kubernetes community White paper to describe database patterns the paper is complete and published In the paper we described the patterns of running data on Kubernetes We described attributes of a storage system and how that would affect running data on Kubernetes We compared running data inside versus outside of Kubernetes and some of the common patterns features being used And the paper is focusing on database patterns But a lot of things we described there also apply to other type of workloads as well Three system has attributes as described in the landscape white paper In a cloud native environment the kind of a backing store used the number of replicas all had an impact on The storage attributes such as availability durability and so on We also added a couple of new storage attributes here observability and elasticity In a cloud native environment, there are typically a lot of micro services running in distributed fashion So when something happens, it's even harder to detect what is causing the problem So it is even more important to have a comprehensive observability system built in so that we can detect problem early and prevent failure from happening And the elasticity refers to the ability to scale up and down quickly So this is the on-demand infrastructure where you has the ability to Release the Resources when they are no longer needed This also refers to storage tiering where you can move your data across different storage tier depending on how often the data is accessed and Regarding to disaster recovery Rafael will talk about that later. We have an option to run data inside versus outside of Kubernetes The play and the manage databases Manually without proper automation is not a pattern that is recommended. So we have mainly two alternatives We can either use manager database services, which are provided by most cloud providers Or we can run data inside Kubernetes Running data inside Kubernetes typically will be leveraging a operator and It uses that the Kubernetes declarative approach the operator reconciles the desired state and the actual state and The operator can also automate data operations such as backup restore migration upgrade and so on it can also use other tools such as Prometheus Grafana and Search manager and so on so here we see operators managing different types of database clusters and Database clusters typically Defined by a customer resource The customer resource describe what type of a cluster the customer wants. So that's the desired state and operator reconciles the actual state or the database cluster against the desired state and In this example, we have a state for set with three replicas Each replica is a part that uses a persistent volume The persistent volume is provisioned by a CSI driver CSI defines a set of common interfaces so that a suit eventer can write a driver and Have the system being consumed by containers running in Kubernetes According to the do case survey and Organization typically uses more than one operator as shown here On the operator hub. There are more than 300 operators and more than 40 are database operators including it city and leaders two of the graduated CNCF projects There are nine post-grass equal Operators including clown into peachy as an open source Operator that manages the post-grass equal clusters running a primary and the standard by architecture and There are other operators not not listed here Although operator is used widely while running data in Kubernetes. There are a lot of challenges including lack of standard That's why the okay community is working on this operator feature metrics Trying to come up with the Standardized and the vendor neutral ways to define those operators so that will be easier for a user to choose the operator So there are common patterns and features used by running data on Kubernetes. We already discussed Kubernetes operators CSI and the workload API's and there's also a topology a while scheduling That uses the node labels Well, the key is the topology key and the scheduler can use that information to spread the parts across different failure domains and Together with topology aware dynamic provisioning you also can have your persistent volumes Scheduled to the failure domains where you are node is scheduled where your part is scheduled to as I mentioned earlier We have our first data on Kubernetes white paper published focusing on database patterns and Our next white paper will be focusing on AI machine workloads and in this next paper We will be describing characteristics of AI machine workloads and how that would impact data storage and what changes has been Happening In data storage to meet those different requirements. So stay tuned to our future update Now, let me hand it over to Alex to talk about the performance white paper I Nearly forgot about the order So another one of the white papers that we've been working on is just about to be finalized is is our performance white paper so As you can tell we're delving into some of the different aspects of the storage system. We're talking about Performance and raffaela will also talk about the disaster recovery as well And what we wanted to do here is is to try and break down what what is a fairly complex topic in terms of being able to define some of the common concepts around performance and benchmarking for both volumes and and databases and And and and what became very apparent as we worked through the document was that in real life There seems to be, you know, a lot of pitfalls and a lot of considerations. It's extremely hard To do proper apples for apples comparisons between the different systems But certainly there are a few things to be aware of, you know It's important to understand what your application needs whether it's it's, you know, high operations or high throughput If you're doing machine learning, for example, you definitely want high throughput if you're doing a transactional database It's much more important to have a high level of operations. For example the the way the The storage system is implemented from a topology point of view in terms of the number of replicas or erasure coding or encryption or compression, etc. Really affects performance And often, you know, the latency is is probably the single biggest determinant in all of these things to define You know, how much you're going to get out of your system The other important thing is obviously to figure out the the concurrency and the parallelism that you can get out of the Storage system in terms of, you know, both the number of clients and the number of queues and the number of back ends of the storage system supports But it's also really important to match the workload with the cache of the system, right? Because caching happens almost at every layer in the stack that we described earlier And I've lost track of the number of times when you know, I see a report on on You know Twitter or or a blog etc saying oh, we got, you know Five gigabytes a second out of this storage system And then you look at the definition of the test and maybe, you know They were testing a one gigabyte workload and the whole thing is running in memory And really all your benchmarking is how fast your cache is rather than the speed of the storage system So always be very critical always question always be aware of results are too good to be true the important takeaway is Don't really use the published results always test your own applications in your own systems in your own environments because there are too many variables from your compute to your networking to your storage systems And it's always important to run your own tests to be able to do apples to apples comparisons in your environment And with that I'll hand over to Rafa who's going to talk about disaster recovery Okay, the evergreen topic of disaster recovery So yeah, we have a white paper on that Where we try to imagine how to do disaster recovery in a cloud native word, right? But before we dive into that, let's take a look at this slide in which we try to capture the archetypical approaches to disaster recovery in in an IT organizations things are going to be much more complicated than this slide because Mostly for two reasons First you you will have applications using different approaches and then you will have thousands of hundreds of thousands of applications possibly depending on each other, so it's going to be very hard to To understand the disaster recovery process of an individual IT organization But to simplify this topic and make it tractable here. We we try to Identify these archetypical approaches so the idea is to here is to understand what they are and then What capabilities we need to have in our? in our clown native Systems to support these approaches so from left to right I we ordered them in From less performant to higher highest performant In terms of RPO and RTL which are the two main KPI to measure how a disaster recovery process works, right? So the time that the outage the length of the outage that you take during a disaster That's RTL and then RPO is the measure of the data loss that you take in terms of elapsed of missed transactions or lost transactions So backup a restore doesn't need an explanation, right? But maybe just a comment to say It's the least favored one to be used But I see companies trying to get out of there of that as an approach for disaster recovery However, you still need disaster you still need backup and restores for other reasons for for other data protection Use cases so we will need to have that right then the second one here is volume replication This is still a storage a level capability, right? and You can replicate volumes Both either a synchronies synchronously or a synchronously notice that we To execute the disaster recovery procedure We still need need a global of balancer in front of our applications, right in in in every scenario really And then we we move to transaction replication Here when we cross this line imaginary line here, it's the first time where the responsibility of Disaster recovery moves from the storage team to the to to a middleware team or an application team, right? So this is where On the left side, you can ask the storage team to take care of disaster recovery for for the entire organization Right, and I will because everybody's using storage. They will they can take care They can be the centralized team that it takes care of disaster recovery when you cross this line It's you're starting to give the responsibility to the application teams Or maybe the middleware team. So here the next one is transaction replication where you have a system in Master slaves mode or primary secondary mode, right? So It's still an active passive setup where one side takes only one side only takes the right the transaction But then it has a way to replicate and as you can see you need a way to replicate the transaction from east to west I call that east-west traffic, right? And if you imagine these running in Kubernetes Because in Kubernetes you have the SDN, right? You will need a way to create a tunnel between the SDNs of true Kubernetes cluster and allow for that traffic Okay, and then the last one is where you have a fully distributed work state for workload where each instance can take active Transaction is active and can take right transactions So that for that you need to pick a specific middleware that can work in that way, right? So there is a new generation of middleware and so that's That's the requirement to get there. That's the best performance one because it really Operates a situation of a disaster for this setup really looks like an HA Event right where you don't have to do anything the system recovers bites the system rebalances itself And when you recover the that center that went down the system rebalances back to the available data center Okay so When I when I show this kind of things Concert to my customers one things that come out is Okay, which one should we do right? Which one should we do with this new? Kubernetes Setups that we have or clownative setups that we have and for me I Think what one of the thing we we should reason about is why not try to support more than one approach If we're building a platform where developer can come and deploy their applications Maybe we can give them a choice to pick one or one or more of these approaches and and as a platform builders we just we just Provide those capabilities right so if we do that what are these kept the capabilities that we need so here you you have a The list of the capabilities, but how can we implement them? So I try to collect here some of the projects that you can use to implement these capabilities Okay, there are many more for example for backup or a store. There are a lot of players in that space volume replication there is There is several projects more than what we have here Global of balancing is a place where instead there is a little bit of friction Because everybody has a global of balancer. That's not the problem, right? But can we configure the global of balancer on the fly based on what the developers are deploying in the Kubernetes clusters? That's a challenge that has has now been resolved right there is only this KGB operator that is trying to do that I hope that we will have more operators in the future Okay for to solve that problem For east-west traffic There are projects like Submariner that can create a network tunnel between your Kubernetes clusters and then if you have If you want to install CNI there are some CNIs that can actually do the tunnel that natively like Calico and Silium then primary and secondary Or middleware that can do primary and secondary or master slave These are the incumbent databases can all do that right and even some of the Database services that you get from the cloud providers and then fully distributed middleware This is a new generation of databases. They started Well, they eventually consistent databases or no sequel databases. They all can work that way But then with cloud spanner they start the cloud spanner I think started a new generation of even sequel databases so fully consistent databases that can work that way So I will point out cocker-cdb in that space. You go by TTI TIDB is another one Okay, so if you want you can build this capabilities in your in your setups Now to go back to the white paper we Just to go over what we have in the content of the white paper we tried to To characterize the difference between traditional DR as you as you find it in most enterprises and then what we call clown native dr Which is the last column that was in the in the in the first slide, right with with an active active deployment because we think that Now with cloud native approaches where you can control you have a higher level of automation Maybe you're using github's and you have more so we have more control in the configuration It's possible to set up these scenarios, which otherwise would be a little bit complicated and then it's also Possibly that depends on depends okay, but possibly not that expensive. Okay. It's the narrative that as you go Higher with the number of nines than the cost increases exponentially. I don't think it's true anymore in cloud native I think it's worth Taking a look and making a real analysis about that And then we take a look at why this active active work Stateful workload can work. So we look at the analogy anatomy of these workloads. They're all based on the cap theorem so that's a that's a term that that Describes the properties of distributed stateful workloads and essentially it says that if you want to build a workload that is Capable of surviving a network partitioning where there there is no way of determining of creating a quorum Then you have to choose between consistency and availability. So your workload can be either available or consistent, right and Based on that like I said a lot of new Middleware has been created But if you look into them, they all look similar, right? Because they all have the concept of partitions and the and the concept of replicas I so so you you will have you have replicas to make the workload highly available And then you have partitions to be able to scale almost indefinitely, right? Because those are the two properties of these new workloads And then they have an API layer where they can expose SQL or they can expose some no SQL semantic or they can expose Message queuing semantic, but at the core, they're all very similar Okay, so we did an analysis of these some of these workloads And just to understand what they used in terms of replica consensus protocol And a lot of them now use raft raft is is the but is The most popular at this point. There was a time where Paxos was was the protocol And then we we also took a look at the shard consensus consensus protocol So you should do it at a short request a transaction that goes across shards The shards are essentially independent databases, but the index is they need to coordinate and and you know You see here that they have essentially the two phase commit protocol to do that And then we take a look at how would we build this kind of architectures If we use Kubernetes, right? And so the idea here is that we can set up multiple Kubernetes cluster one per failure domain or one per cloud region, right? And then we we will have an instance or multiple instance of these workloads Stateful workload and they they have to be able to communicate with each other so that they can essentially create their own logical cluster and Then you can have applications writing to to those instances and they will Seological, you know, they see a logical database, which is really distributed across multiple regions and when a region goes down Nothing happens as long as the global balancer is Can identify that the region is is down so I have some sort of health checks We'll start directing the traffic to the available regions and the workload is able to continue serving the traffic Okay, so that's that's it So thank you for Thank you for attending this talk. We're almost at time, but I'd like to take the final opportunity to encourage you to Join and participate in our tag even if you just want to join the tag calls to Listen to presentations from other projects and find out about new projects which are up and coming It's it's definitely worth participating Please feel free to join all the calls are open. You don't need an invitation and also You can find all of us on the CNC slack groups and on the mailing lists. Thank you very much