 Hi, my name is Sagi Volkov. I work for Red Hat as a storage performance instigator in the cloud storage and data services business unit. Welcome to my KubeCon session. Whatever can go wrong, will go wrong, Reconcept and storage failures. In our session agenda, I'll talk a little bit about kind of a storage introduction. I'll explain a few resiliency terms that are commonly used in the storage industry. We'll talk about the SAFSA storage provider, Rook as a storage orchestrator or a manager of storage. We'll talk about failures, supposed to be a live demo, it's gonna be a recorded demo, and I'll leave time for questions in the end. So storage, we all need it, we all use it every day, it might be ephemeral or it might be persistent. Of course, the discussion today is just on persistent storage in Kubernetes. Let's do a little bit of an overview of storage, up until like 10 years ago, what we used to have is storage arrays, these giant bare metal arrays of many drives in it, and they would connect into bare metal nodes or virtual machine, not a lot of dynamic in the location of volumes. Then about 10 years ago, software defined storage or SDS started to be big. We start to see the ability to have servers have the ability to run some sort of a software that provides storage to a cluster of servers or set of servers in a manner that it was still stable, resilient, and redundant. Software defined storage is probably the most common storage used in Kubernetes nowadays. You can use it in a converge manner where we have our nodes running the software defined storage and our pods, and you can also use it in an unconverged manner where we have some nodes running the pods, the application pods, and some nodes are just running the software defined storage pods. So let's talk about a few terms that are widely used in the storage world. Of course, they are also used in a lot of processes in the technology world and outside of the technology world. Availability is basically your uptime in the storage domain. You can hear a lot about the number of nines that a storage solution has. That's basically 99.99 something nines, which basically impact the availability of your storage, the number of nines after the dot. 6 nines is equivalent to 31 seconds of downtime in one year, 5 nines is about 5 minutes, 4 nines is about 1 hour, and 3 nines is about 9 hours of downtime in one year. Durability is basically how durable is your storage, how good it is in keeping the data. The data might not be accessible all the time, but it is still intact. So for example, you lost a data center, the data still resides, or you hope still resides on your storage solution. When the data center comes back up, the data is still there. Reliability and resiliency, two terms that we are going to be focusing more here. The probability of how good you design your storage solution, and basically it gets impact by the ability of your storage solution to heal itself. So again, we lost a storage device, or maybe we lost a node with a few storage devices, or a data center or a switch. It's the ability for the storage solution to heal itself and to continue and provide the services, the data services that it needs to provide. In this diagram, I'm trying to kind of explain these terms with a couple of processes. Of course, we want availability and durability all the time throughout all the processes. We have process Z that starts, and in the end of the process, basically inserts, each process inserts 10 rows into some sort of a database. If the process ran 100 times, we're supposed to see in the end 1,000 rows in the database. If we're seeing that, that's basically 100% availability. If we're seeing 950, it's basically 95% availability. Of course, throughout the process, once we insert the data, we want the data to be durable. We always want to be able to query it. In some cases, in the storage domain, there's a term called retention policy. Maybe you only need it for one night. Maybe you only need the data for one year, maybe just for 10 minutes. But as long as for the duration of your retention policy that you define for that piece of data, you need to be able to have durability, 100% durability throughout this retention time. Then once process Z ends, process Y starts to read input from process Z and run some other processes as well. Throughout all this time, reliability and resiliency is a must. Again, you lost a node, you lost a device, you lost a switch. We want to be able to continue and run process Z and run process Y and complete them successfully. That's basically how reliable our storage solution and even if we are able to complete these processes while losing some sort of data storage capabilities, that shows up how good our resilience is. Let's continue with a couple of more terms. TTR, mean time to repair, a very common used terminology that is directly connected to your resiliency. You had a problem, we talked about it, you lost a storage device. It's usually either a drive or flash device or an array of storage or switch and how easy your storage solution is actually repairing itself. In the storage defined software world, you can sometimes decide between having two copies of your data versus three copies of your data. For example, if you have two copies of your data and for some reason you lost some devices and you now only have valid one copy of the data. Until you actually repair and go back into 100% resiliency, 100% two copies, you are basically exposing yourself to data unavailability if you lost more devices and now your only copy that exists is also not available. So MTTR is very important for your resiliency. In the SDS world, we talked a little bit about converge infrastructure versus a nonconverge. Converge can be more concern in how you design and how you spread your data versus when you separate actually compute in storage nodes. MASH of course, especially in SDS, is a big help. Softly defined storage solutions like SAF thrive on MASH, thrive on having a lot of nodes with a lot of devices, the more devices you have, the faster you can actually recover from a loss of a device. Another term is MTBF, Mean Time Between Failures. Basically everything comes down into the quality of the devices that you use to store your data. So this measurement is basically between time from the last failure to the next failure of a device or a storage system. Just for example, an enterprise grade drive have about 800,000 hours of MTBF, which is roughly 90 years. Sounds a lot, but then if your stored solution have 90 drives in that example, that basically means that you're going to have one failure every year. If your stored solution have or data center have 900 drives, basically means that you're going to have a failure every five weeks in one of the drives, one of the 900 drives. So you need a very good software above the devices to actually make sure that you still have availability. That's where the two copies of the data, three copies, erasure coding of spreading the data across all the devices, that's where all these things come into play. So last couple of terms, RTO recovery time objective, basically how long can your process or your company or data center can survive without access to data? Usually your RTO needs to be as short as possible. You define this usually by tiers in enterprise organization. Some applications cannot have no access, for example, no more than five minutes in one year or 30 seconds in one year. Other applications are in lower tiers and yeah, it's fine for them to not have access to the data for one day in a year. RPO is your recovery point objective. Besides counting on your stored solution to maintain all the copies and run all the time and provide adequate performance, you should always thrive to do also one backups outside of this stored solution and this basically impact your RPO. How many backups you have in case you lost everything in your stored solution will basically impact your recovery point objective. If you're running backup every one hour, that means that your RPO is of one hour. All right, so now let's continue and start to talk about Seth and Rook. Seth is a unified storage system, software-defined storage. It's been around for a long time. It provides basically the ability to have object and block and file. A file from SethFS, which is basically in the Kubernetes world, will be the equivalent of the RWX. It's a shareable file system. RBD is a pure block device and RGW is the ability to interact with object using S3 for example. There are a few components for Seth. Sethmon or the monitor is basically the coordination or the coordinator of all data and components in the cluster. It's using a Paxos to keep itself alive. There's minimum of three of these processes in a cluster and you can go up to seven. The manager mainly concentrates on real-time metrics and other management functions for the Seth cluster. There's one active and you can go to have one active and one standby. Seth4SD are very important processes. These are basically the process that attached into a device that can provide storage like an SGD, SDD, NVMe, any type of block device. And then in return, the Seth cluster aggregates all these devices, creates pools out of them and able to provide the storage capabilities for their clients, in our case Kubernetes and PVCs, whether it's object, block, or file. A couple of other processes that are not showing up here on the presentation. RGW is basically what allows for object storage access. It usually starts with one process and if you need to scale up to you know millions of accesses or millions of object accesses you can add more. MDS is the ability basically to access SFFS. You start with two of these processes. One is active and one in standby mode. And now let's talk about Rook. Rook is basically a storage manager for Kubernetes. In our case we use it with the Seth. I like to call it the orchestrator of all things. As you can see on these slides, anything blue is kind of more on the Rook level. We have the Rook and Seth operator. It's a pod that runs in one of the nodes. We have discovery pods that runs on any node that we would like to that will provide storage and have like block devices that the Seth cluster is going to use. You can configure these by labels and then things like that. The Seth CSI level are basically the attacher and detach-attach pods that exist on every node whether it's for a block or for SFFS. And then you have the Seth demons that Rook is controlling. Of course, each of these things are basically pods. You have one pods that runs on you have three, minimum three, one pods. So on the next node that we don't show up in this slide, we don't necessarily have a one pod. OSDs will actually, we will have OSDs wherever we will have storage that we consume on these nodes. We talked about RGWs. We talked about the manager having one of them, MDS. There's a pod here for a mirroring of RBD to a different Rook SFF cluster on a completely different Kubernetes cluster and crash collector in case things go bad and we need to collect all kinds of logs. So let's talk about Rook and SFF resilience in Kubernetes. As discussed previously, every SFF process is basically a pod. So we want to understand what happens when a pod fails. Going back to all the terminology that we use, of course, this impacts our resiliency. So, you know, the mons are very important processes. That's why we have three of them that spread over three nodes or more. The MDS, which again provides access to SFFs. We have a minimum of two. One is in an active mode and one in standby mode. None of the mons or the MGR, the manager processes of SFF are actually in the data path when we access it using a RBD, for example. You know, we talked about RPOs and backup and durability. If you have two data centers, you have two Kubernetes clusters, you can use SFF built-in replication such as the SFF RBD mirroring to basically mirror a PV in a certain pool from one SFF cluster in Kubernetes cluster A to a Rook SFF cluster in Kubernetes cluster B. SFF also offer the same capabilities for object storage. It is GA in the SFF project. It's still not GA in Rook. It will be GA in the next one or two releases. So, let's talk about some demos that I want to show. We're going to look at a few scenarios basically of OSD pods failures and how Rook and SFF behave. What we're basically going to do is constantly create stress on the storage solution in our case, Rook and SFF. And then one scenario will be when a developer or an admin or a person delete by mistake an OSD pod. What happens? What happens when you reboot a node that has OSD pods on it? Remember OSD is the process that consume a block device on a node and then is part of the whole SFF cluster that then provides a storage to our Kubernetes application. We're also going to look at what happens when we lose one of the devices that the OSD pods is using. All this demo is running on AWS. Very small clusters like three masters. There's going to be three nodes that are going to run Rook and SFF and provide a storage and three nodes that are going to run applications that will consume and stress the storage. I'm using a project named Sherlock which I started a couple of months ago. It's basically a project that aims to check and test and stress performance on all sorts of stored solutions in the Kubernetes domain so you can run it not only on Rook and SFF but on any other SDS provider. I also collect statistical information from the actual nodes that provide the storage and the nodes that consume the storage so you can take a look at that as well. Let's move into the demos. Let's start the first demo. I've kind of split my terminal into four sections. Hopefully it's clear. On the top right section I have a SFF command called SFFOSD3 that is constantly running using a watch. It basically shows the SFF cluster OSD3 as the command says. We can see that we have three nodes that provide the storage. On each of the nodes we have two OSDs that consume two SSD devices on them and provide it back into the SFF cluster. On the top left I have a kubectl command looking at the pods that exist on the Rook SFF namespace. I'm only grabbing for the OSD pods. These are the ones that we are interested in. As you can see we have six OSDs on the top right and we have six OSD pods on the top left. I have a project called MySQL. It has 12 MySQL pods running on these three nodes. Each MySQL database is using a 100 gigabyte pvc. I also have 12 sysbench jobs or sysbench pods running on these 12 MySQL databases and this is done of course using the Sherlock GitHub project that I've previously mentioned. I'm just going to pick one of these nodes, one of these MySQL jobs and I'm going to basically constantly monitor the logs of this job. As you can see it's a typical sysbench output. Every 10 seconds shows like the number of transactions per second latency and things like that. What I'm going to do is I'm going to pick one of these basically OSD pods and I will delete one of them. In this case I'm just going to pick OSD 3. What is interesting to monitor is to look at the top left of what happens to the pod. I kill it if a new one comes in on the top right. Since we are looking at OSD 3, that's OSD 3 right here. What is happening to the status of that OSD component? Is it staying up for how long? There's another command here from SAF that basically dumping all the PGs or placement groups that the SAF cluster has. As you can see now we have 81 pages and they're all active and clean. If SAF sees that something is wrong with one of the OSDs maybe we need to recover one of the OSDs you're going to see that on this dump command. Let's delete the pod and see what's happening. The pod is deleted. You can see it's terminated. You can see that there's already a new OSD 3 pod started and it's already up and running. OSD 3 was marked here as down for one or two seconds. As you can see from the dumping of the pages in here that SAF is working in making sure that the new OSD pod that is using basically the same device as previously has all the right placement groups and all the pages are clean. As you can see all the placement groups are clean and as you can see everything went back into normal state in terms of how SAF behaves. You can also see how in the output of sysbench you can see that right here you can see how there was a little bit of a drop in the IOPS that this sysbench operator was running and that is of course acceptable. There was some kind of an IOPS that internally SAF was doing. This database might have been reading from placement groups that was on the OSD that I have deleted. So for a brief second it had to get the pages from a different OSD. That's the job of the SAF Mons to basically tell all the clients to get these pages from a different location and that's the end of this first demo. All right so let's do our second demo same setup. What I'm going to do is actually now reboot one of the nodes that are running Rook and SAF which basically means that since each node has two OSDs and we have total of six OSDs we are basically taking down a third of our storage for a brief time. I'm going to pick this node and let me start monitoring one of the sysbench pods and two, three, number eight and I'm going to reboot this node and on the bottom left I'm going to look at the node status. Since this is AWS things are probably rebooting so fast. I'm not sure how much of a kubectl is actually going to capture that the nodes are being down but as you can see on the bottom on the top right SAF definitely starts to see some PGs that are not available. It's marking the OSDs as down starting to make sure that it's understand where needs to move these pages that were primary on these nodes and now on the top left you can see that we have two new OSD pods starting and once they are going to be up and running on the top right we're going to see that OSD one and four they are now marked as up and peering process is starting in SAF and it's starting to move whatever pages it needs back into into those new OSDs that just came up. Again this is a third of the storage and as you can see we are looking at the logs of Sysbench. There was a brief time that we didn't see any transaction. Transactions were basically posed or have higher latencies but now everything continues. So these were basically the two demos replacing new devices for OSDs is basically the same thing as the last demo that I've shown. SAF is going to have these new devices part of the OSDs and then move all the placement groups into or re-spread the placement groups using these devices as well. I will now answer any questions that you guys might have and thanks for attending my session.