 Good afternoon. Thanks for coming. This session is about availability and storage auto-scaling of stateful workloads running on Kubernetes. My name is Layla. I work at Shopify as an infrastructure engineer. I'm on a team of 10 engineers called the search platform who develop and maintain the infrastructure that powers search at Shopify. This is the agenda for this talk. In the next 20 minutes or so, we'll learn how the search infrastructure team at Shopify hosts a stateful workload, which is search on top of Kubernetes. We'll talk about the obstacles of providing high availability and storage auto-scaling and scaling in general for stateful workloads that run on Kubernetes and how my team address those challenges. I also have a pre-recorded demo at the end and we'll conclude the presentation with a Q&A. A little bit about Shopify for those who don't know about us. Shopify is a cloud-based commerce platform that lets you start and manage a business by allowing you to create and customize an online store and manage inventory, payments, and etc. Currently, we have more than 3 million merchants that sell a lot of products with us and they collectively have sold over 700 billion dollars in gross merchandise volume. Just to give a better sense of the scale, only during the Black Friday Cyber Monday weekend of 2022, which is a major high volume commerce event that kicks off the holiday shopping season, our merchants sold over 7 billion dollars in GMB. Search is a fundamental part of any commerce platform that allows buyers to search and filter for the products that they're looking for and also allows merchants to fulfill the orders that they have received. When you go to any online store and search for a certain product, your request goes to a search engine that's backed by a secondary data store, which is different from traditional databases. We call this secondary data store the search infrastructure. At Shopify, we use elastic search to provide search and filtering services to Shopify Core, which powers all of our merchant shops as well as their storefronts, as well as the development teams who need search services. We run Shopify's search on top of Kubernetes on Google Cloud Platform and deploy and maintain these elastic search instances by using a custom Kubernetes controller that we have built. So let's catch up on some definitions. Stateful services unlike stateless ones rely on persistent data and respond to the same inputs in different ways depending on their history. According to this definition, elastic search is considered a stateful application that provides a stateful service and therefore we deploy it using Kubernetes stateful sets. Now let's see what stateful sets are. A stateful set is basically a controller that manages pods that are based on an identical container spec and maintains a sticky identity for all of its pods. These pods are created from the same spec but are not interchangeable. Each has a persistent identifier and more importantly has a persistent disk that it claims across any rescheduling. To have a, again, a better sense of the scale, an elastic search stateful set that stores search data for Shopify Core is composed of 120 pods, each of them having a four terabyte persistent disk. For high availability reasons, we have multiple instances of elastic searches with the size deployed across the globe. Speaking of availability, with search being a critical service, we need to make sure that it's always available, meaning that failures of pods, systems or even natural disasters that lead to regional failures should have minimal impact on the service. We also need to make sure that elastic search stateful sets always have enough storage space and they're able to adapt to sudden changes of storage requirements. The first question we want to answer here is whether deploying a stateful service with Kubernetes automatically guarantees fault tolerance and high availability. We know that the main step towards availability of fault tolerance is redundancy. One might think that since Kubernetes stateful sets manage multiple pods based on an identical container spec, running a stateful set will automatically provide high availability by adding redundancy. However, this assumption is not true at all. Consider this stateful set with three pods. Fault tolerance means that the application is able to provide the same service even if one of these pods fail. But can we say that it's true for a stateful set? Depending on the application that the stateful set is running, each persistent disk has a different set of data. So, for example, if pod zero fails, although pod one and pod two are still running, they will not have access to pod zero's data set. And therefore, they cannot provide the same service that pod zero was providing before its failure. So, the service will not be available until pod zero is recovered by Kubernetes. In other words, although we have redundancy for the pods, we don't have any redundancy for the data that's stored on the disks. So, the lesson learned here is that Kubernetes does not provide data redundancy out of the box, and it's up to the application to replicate the data. Taking elastic search as an example, the data stored is sharded by elastic search, and each primary chart can have multiple replicas. Elastic search has a mechanism to distribute the primary and replica charts across the disks in a way that if one of the disks fail, there is at least one disk that has a copy of the data and is able to provide the same service. Applications replicating the stored data improves availability, but we still need to be ready to survive regional failures. For that reason, we have implemented a pipeline using the Kafka streaming service that replicates an entire elastic search state full set between two geographical regions. We currently have more than 100 elastic search instances living across the globe, collectively storing almost two petabytes of data. As mentioned earlier, one of the reasons that a stateful service becomes unavailable is a lack of enough storage space. For example, during high load events, such as the BFCM weekend, the Black Friday Cyber Monday, the required storage space increases, and the disks can get full rapidly. We could pre provision extra storage to avoid running out of storage, but we will have to scale back down after the high traffic event, otherwise we will be paying for the resources that we're not actually using. And we will see later that scaling down stateful systems is not a trivial task. It is also possible that as time passes, your requirements change, and you no longer need the amount of storage that you provided in the beginning, and you need to scale back down. For both situations, we will need to make our storage scalable to adapt to the changes of requirements. This way, we will only pay for the amount of storage that we actually need. And we also won't run out of storage. This brings us to the next topic, which is storage auto scaling and its implementation. Now let's take a look at how my team implemented storage auto scaling. As I mentioned before, our elastic search stateful sets are deployed and maintained by a custom Kubernetes controller that we have built. This custom controller monitors elastic search custom resources, and their corresponding stateful sets and other objects that in general form an elastic search instance. One of the tasks of one of the many tasks of this controller is to make API calls to its managed elastic search instances. To get the current free storage space based on a defined heuristic, the controller decides if the disks need to be expanded or become smaller. Excuse me. Yeah. But how can you expand or shrink the disks of a stateful set that already has data? To update the disks of a stateful set, the first approach that comes to mind is by simply running cube control edit stateful set and updating the storage field to a different amount that you want. And you will see that Kubernetes does not allow you to do this. Here in this recorded video, I this is my attempt to edit the already existing stateful set. I am editing the storage field in this example from two to three typivites. And you see here that Kubernetes does not allow you to do this. The error shows that you're only allowed to update certain fields in a stateful set spec, replicas, templates, and et cetera. And storage is not one of them. So how can we make this work? Let's dive deep into details of handling storage scaling. I'll start by by the case. Sorry, I have lost my mouse. Okay, I'll start by the case where we no longer need the storage we have. And we realize we have over provisioned the storage handling the scenario is challenging because Kubernetes does not have a feature to automatically support scaling disks down. The reason this feature does not exist is that because shrinking disks by mistake will lead to data loss. And it should be implemented with care and handled in a case by case basis. For our use case, the custom controller, as we mentioned before, calls the elastic search API and has the knowledge of whether like this disk scale down is allowed. And based on the actual storage usage, it triggers a disk scale down if if the numbers allow it to scale down the desk. Remember, we cannot update the stateful set. So instead, the custom controller deletes the stateful set object with the cascade flag set to false. This is important. Setting this flag to false allows keeping the pods and disks after deleting the stateful set object. So after the stateful set is deleted, the custom controller recreates it with the new smaller size. All the other specs remain the same, but only storage changes from eight, for example, to four tabby bytes. Since here we are scaling down storage. But replacing the stateful set object with a new one does not automatically shrink the disks. There are other steps that the controller needs to take in order for the storage and the desk to reflect the changes you made to the stateful set, the custom controller should first delete the volume claim object. Deleting it will put the claim with the will put the volume claim object into a terminating phase, but it will not be actually deleted because there's still a pod that's bound to it. So we want to force this termination. To force this, the controller will also need to delete the pod that's bound to the terminating claim, and that will delete the pod and the disks for real. The reason we don't delete the pod first and then the volume claim is that if we do so, the stateful set is faster than us and will recreate the pod before we're able to delete the claim and we won't be able to proceed. Now that the disk and the pods are and the pod are deleted, the stateful set will do its job and will recreate the pod and the disk now with the new smaller size. The controller will have to do this one by one for every disk. As you can see, this scale downs don't come ready out of the box at all, and should be handled carefully to make sure that there are no data loss happens. In our case, the custom controller drains every single disk before terminating the disk to make sure that no data will be wiped from the disk. You can imagine that doing this for the large clusters that we have is very time consuming and also error prone because there are many steps that this controller needs to take to make this work. For example, one of the custom controller calls to delete a pod might fail and the whole scale down process might get stuck and you would need an engineer to intervene and solve the problem. In the beginning, we used to follow the same process for disk scale ups as well, meaning that we would replace and delete the stateful set object, delete the claims, the pods and all that steps and in order for Kubernetes to create a larger disk for us, fortunately we found a better solution based on a feature that Kubernetes provides. Kubernetes has a feature called volume expansion. In order to make use of this feature, the disk should belong to a storage class that allows volume expansion. Let's take a look at a stateful set spec and the volume claim template. In the part that's marked red, we see our templates, our volume claims are created based on a storage class called the ESPD SSD and looking at the spec of that storage class, we see that there is a field called allow volume expansion that is set to true. So any disk created based on this storage class has the ability to be expanded. Setting this field to true is a requirement, however, you need to take some additional steps to make this actually work. Now, let's take a look at this case. We are trying to scale up the disks of a stateful set. The custom controller to do this, the custom controller again needs to delete a stateful set object. Remember we cannot update stateful sets with the cascade flag of course set to false. Once the stateful set object is deleted, the custom controller will recreate the stateful set with a new size. Now, there is a difference here. We no longer need to delete the volume claims. The controller just needs to update the volume claim objects. The storage from like four to eight tebibytes in this case for example. And the rest is taken care of by Kubernetes. Kubernetes will automatically expand every disk without downtime and this will be as you can see this is a lot easier compared to the case where you had to delete every disk and drain and like be extra careful. Oh okay. I have a demo in here. It's a pre-recorded one. We have a stateful set with three replicas and we see here that each pod has a volume claim of four tebibytes on the right hand side. I'm just triggering a disk scale up by editing a custom resource that we have internally built called the elastic search instance. I'm updating a field of data volume size that represents the size of the disks of a stateful set elastic stateful set. Doing this is basically triggering a scale up. And here I'm watching the disks as they get expanded by Kubernetes. We already see that the first one is expanded to five tebibytes. I think I have to skip over a little bit because at the same time yeah here we see that the second one is expanded and after a while the third one is also expanded. So you should expect each pod each disk to be expanded in about a minute. But I wanted to just skip over. Okay. Just like for our case of a 120 disk cluster it takes about five or six minutes to scale up an entire large cluster. The reason is that we do this in batches and some of the scale-ups have it like most of the scale-ups happen in parallel. There are two ways to increase storage. You can either expand the disks you already have which is called scaling up which we just covered and you can also add more disks which is called scaling out. Adding more disks is a rather straightforward approach. No need to delete the stateful set or update the disks. You can simply update the replicas of the stateful set which we now know that it is allowed and more pods and disks will be added. But this approach is not always cost efficient. The reason is that in addition to increasing the number of disks you will also add compute resources which most of the time are not really needed and we're only interested in adding storage. To make the best decision you should compare the cost of scaling out and scaling up and go with the approach that is the most cost effective one. To summarize we learned today that Kubernetes does not automatically provide availability and fault tolerance for stateful services and it's up to the application to implement data application and provide availability. Also, storage auto scaling can be automated by using custom controllers. We saw that scaling down storage is not a trivial task and Kubernetes does not support it and it should be implemented with care to avoid data loss. And lastly we saw that scaling up storage is more straightforward and there's a Kubernetes feature that allows for volume expansion. Thank you for coming and if you have any questions or feedback I'll be here. Thank you. I see a question, two questions I see. I think there are mics over there if you want to ask yours. Hi, thank you very much for your talk. I was wondering if the custom controller it's available open source and if you are using the official custom resource definition for the elastic search or you build your own as well. It's an internal custom controller that we have built. It's not available externally and the same is for the custom resource. Thank you for your presentation and my question is how do you handle collisions? So for instance you're in the middle of downsizing the volumes for instance at node number 20 and then there is another event that says that you need to expand the volume. So what's the logic behind that? This logic like we have some sort of locks that we have built in the the custom controller. So this like since it's the controller that decides whether the disc should be expanded or become smaller this collision is basically not going to happen because it's the controller that is deciding about this. So if controller thinks that this should be expanded but it will continue to contract those right? So the custom controller first of all like it sees the actual usage if so we're talking about the case that it needs more storage. So it comes up with a number that like this would be a good number for the discs, the new number for the disc and it triggers the scale up and like we saw the process that happens. The number that the controller picks is a number that it will not make the discs over provision so that in another loop the same controller will will decide to scale down the discs. So these two like competitions will not happen. Thank you. I also have a question. So you said that there's a replica shot that basically makes sure that you don't lose the data. So is there any kind of mechanic where you first downscale the replica shots or first the original because you have to make sure that it stays the same right and the size have to stay the same and you have to save the data so is there any technique or structure that you do with it? That's a really good question. This is actually implemented by Elasticsearch. Elasticsearch allows you to deploy or like spread or distribute your shards in different availability zones. So the primary shards and the replica shards of primary and the replica of the same shard they don't end up in the same availability zone. So we built our custom controller in a way that it only touches one availability zone at a time. So it's never the case that you are expanding or shrinking the discs of that has the both primary and replica shards of the same shard at the same time. Is there a difference if you first touch the primary shot or the replica shard? No, there is no difference. Okay, thank you. I might also have a question. Thank you. Thanks for the talk at the first point and we are also using Elasticsearch and I'm wondering what we have noticed is if you update for example an Elasticsearch instance and you have a rolling upgrade for example. Elasticsearch needs a moment to redeploy their replica shards and the the state is in a yellow state for the time being. How did you take care of that? Well, you delete the stateful set of course and then this says the same then you build a new one but yeah have you ever had any problems in time-consumting for example if the replica shards are being replicated and rebuilt and stuff like that? If it takes too long, well as I mentioned in the earlier slides we have the same Elasticsearch dataset in different regions. So we have this mechanism of failing over traffic. If we see that this is taking too long the Elasticsearch cluster has been stuck in the yellow phase. We just fail over traffic to a region that is not under this maintenance yet. So just to be extra careful but it has never led to data loss like that I can say for sure but how fast it happens it really depends on like if you're in different availability zones that's fine because you can simply switch over to another availability zone and be done with it. We can't do that unfortunately because of data protection and stuff like that but yeah it's an interesting idea. Thank you very much. Thank you. All right so you mentioned for the scale out that it's relatively easy to just add replicas to the stateful set. When you scale back in do you have to do any additional steps or then do you just rely on elastic to redistribute everything when you take parts away from it basically? I don't think you can rely on Elasticsearch. When you basically scale down on a stateful set like Kubernetes has no idea what's going on inside the pods so it just starts cutting down pods from the the pods that have the higher index number from the the ones that were created later so unless you have set your volume claim policy to retain you will lose data if you just decide to scale down. We allow scaling in by using our custom controller and the way that we do that is when such updates happen like when such things that are requested for some pods to be removed it is really slow. The custom controller one by one again drains every disk and like makes sure that the disk that is about to be deleted is actually empty sometimes it gets stuck because the person because these are not automated like we don't allow this to be done automatically so if a team requests their elastic search stateful set to be scaled in by mistake like they think that for example two pods are extra but at the but they're actually wrong and it's only one that's extra so the custom controller will not allow this it does some calculations it sees like how many what's the storage size free storage and if it's really allowed so we have this protection otherwise you will lose the data of the disks oh sorry that is deleted. Okay so just to be clear you still go through the custom controller even when you're scaling out even though you could just edit the stateful set that part wasn't clear. Yes yeah we also go through with the custom controller. Thank you. Hi thanks for the talk. A question about the pvs you're deleting them right? The pvs not the claims. Yeah they are storage yeah right and so how do you re-sync the data do you rely on elastic search or do you have any other mechanism to achieve this? Re-sync the data after uh can you yeah you delete the pv and then you bring one back in which has a lower disk size and so what you do with this how do you re-sync the data is it some custom I suppose you rely on the elastic search for this? Yeah that's true we rely on elastic search like the elastic search has a mechanism of rebalancing shards so that almost every disk has enough has the same amount of storage obviously like it's not always possible to have really close disks but yeah like once an empty one comes in elastic search will trigger a what we call shard relocating and it moves back like some of the stuff to the newly created disk. Okay and what does it take in time is it hours or on hour? Yeah it's in the scale of hours. Okay thank you. Yeah thank you. Okay I guess that's it thank you.