 Hi all, I'm Abhinu H. Chen and today I'm going to cover a topic called choosing write storage for Kubernetes. Let me go in full screen mode. Okay, so this is about me, I'm an author, blogger and speaker, I'm open source contributor, I'm working as a distinguished member of technical staff in Vipro 5GT. So let's see the agenda. So in this agenda I'm going to cover what is stateful and what is stateless workload, right? And I will also do a quick recap of Kubernetes storage concepts followed by various options which are there for cloud native storage and then the most important part about choosing the right storage and some of the lessons learned and key takeaways. Okay, so let's get started. So what is, okay, so before we get into stateful and stateless workload, it's very important to understand that Kubernetes was originally created for stateless workload, but slowly it has been realized that stateful workloads are also something that needs to be hosted in Kubernetes, right? And what are stateless application or stateless workload? Those are applications which are not affected by pod restarts. So if a pod crashes because pod by its nature is something that can restart and it can get different IP and all, so stateless workloads are not affected by those restarts. Okay, very stateful workload are something that needs to store their state after pod restart. So let's say there is a database, right? So we don't want all of our data to be gone when our database pod restarts, right? So that is a stateful workload example. And to do so, we need permanent storage, right? And in Kubernetes, there is a concept called stateful sets. Now I want to just highlight the fact that the stateful sets are not similar or not same as stateful application, those are two different things, okay? Some of you might know these storage concepts, but for those who are not aware of it, a quick recap of Kubernetes storage concepts. And I'm not going to go in deep here, it will be at very, very high level so that whatever we talk or we go through next, that makes sense, okay? So we will just see what is volume, what is persistent volume, what is PVC persistent volume claim, what is storage class and what is host path, okay? Now volume, so volume is a directory, right? And volume is not a new word in Kubernetes that has been used before. And there are two types of volume, one is a femoral volume and other one is persistent volume. Fever, volume has a lifetime of a pod, right? So if you restart a pod, then those things will be gone, right? And persistent volume, I'll cover in next slide. Now, in terms of volume that is supported by Kubernetes, there are many, right? So if you see a quick list here on this slide, so it supports AWS, EBS, Azure Disk, Azure File, Empty Directory. So some of these are inbuilt, they like Empty Directory, host path are something that you can use. And some other one, let's say Azure Disk, you will need Azure subscription and Azure cloud access, okay? So this is about volume, let's go to persistent volume. Persistent volume, as the name suggests, it's again a piece of storage. And it is persistent basically, so like in AWS we have EBS volume, right? Where if you store something in the EBS volume, that data does not go away when you're easy to instant restart. So similar concept here, that if you attach a persistent volume to a pod, right? And then if a pod restarts or even pod terminates, that data does not go away. And now, how do we create persistent volume? A quick YAML file is given here, right? So it's very simple, you just give a storage class name, how much storage you want. And what is the access mode of that PV? So there are different mode support in pre-tried ones, pre-tried many. And read only, right? So and whether it is host path type or different type, that is again, one of the option, okay? So this is a very simple YAML file of creating a PV in Kubernetes. And persistent volume could be static or dynamic. What it means is, see, if it is a dynamic persistent volume, that means we don't need to create it and storage provisioner will take care of provisioning the PV. Now, coming to persistent volume claim. So PVC is a way the Kubernetes uses the storage, right? So if I've created a PV of, let's say 50 gig, then I cannot really have PV shown to pod, right? Or PV is not visible to pod. So PV is consumed using PVC. So when a pod needs a storage, it has to call the PVC, okay? For that, we need to create a PVC. And how do we create a PVC? As the sample script given here, we just give the PVC name, okay? Storage class name, access mode, and how much is the storage requirement for PVC? Now PVC to PV winding is one to one, and this is very important to understand, because this is a major, major limitation. So let's say if I have a PV of 100 gig, and I create a PVC of 5 gig, right? And access mode and storage class name matches there, then that PV is going to be by two PVC, okay? Since we have asked for 5 gig, we will not be able to use another 95 gig. That will be waste, and there will not be any other PVC that can use that remaining PVC. So that's what one to one mapping means. And this causes a lot of wastage of storage, right? Because you might not know exact storage requirement, right? In the beginning, and you don't want to run out of storage. So you will typically give a large enough PV and map it. And then if that is not being utilized, then your storage is not really utilized properly, okay? Now, so that is PVC. Coming to storage class, a storage class is like a profile, okay? So see, Kubernetes supports different kind of storage, right? And application also requires different kind of storage. Some application requires very high performing storage, and some application where there might not be any need of very high performing storage. In that case, we don't need to really give high performing storage to all kind of workload. We can create a category or profiles of very storage, and those are called storage class. So we can create a storage class for let's say high performing storage, one storage class for low performing storage and so on, right? So these are like tiers. So how does storage class helps? It is useful for dynamic PV provisioning, right? So one of the example given here, YAML file is given here. So here, what are we doing? We are creating a storage class called CNCF EBS storage class. So this is EBS specific storage class, and we are giving a provisioner, okay? And these are just three parameters given, but there are several options. And these options are again not consistent. Different cloud provider, different vendors give different parameters, right? So you need to find out the right parameters for your volume type and all. So here, what we are saying that my storage class is going to use type IO1, and it will guarantee IOPS per GB of 10, right? Now, so what happens here is when you use this storage class, okay? In any PVC, then you don't need to create PV. So let's say the same example where my PVC is of 5 GB, right? And I did not create any PV. So my PVC is 5 GB, I did not create any PV. And now when my PVC is created because this storage class is mentioned, there will be automatic provisioning of 5 GB of PV using this provisioner, okay? And what is the benefit of it? You're not wasting the storage. And this is still one-to-one mapping, but then the headache of creating PV and maintaining PV is taken away from you. Some specific type of volumes and providers give regional replication capabilities, right? Like GCP gives it. And different option features supported by different vendors. So I think that I've already mentioned. Now there is a concept called host path. What is host path? So host path is a way by which you can utilize the storage, which is locally present on the server, okay? Now, in this example, if you see this YAML file, it's a CNC of demo PV. So in this persistent volume, okay, I'm mentioning host path and path name is slash MNT slash data. Now this storage is coming from my local node. So wherever whatever worker node is being used where I'm creating this, that storage will be used. Now the limitation of this kind of host path is that storage is local, right? So it is not getting replicated from one worker node to other worker node. So if that worker node goes down, even though you're using persistent volume, your data will go away because underlying the storage itself is not available, right? So that is one major limitation, okay? And second is that it is not recommended for production. The reason being it has seven security risk, right? So in our sandbox environment or non-production environment, it's perfectly all right to use host path, okay? But in production, it's a big no-no, I mean it's not at all recommended, okay? Now let's move on to next part where it is about Kubernetes storage options. So we have seen why we need storage, right? How do we give storage to Kubernetes both of them, right? So we give it in terms of PVCs. So whichever part wants persistent storage, persistent volume, it has to have PVC mentioned in its YAML file, and then that PVC maps to a PV. And there is a storage class option also that we can use for doing dynamic provisioning. And that is it. And if you want to use local storage, you can use host path. Let's see the storage options. So this screenshot is taken from CNCF landscape. So as of 25th of July, this is the landscape of cloud-native storage. And if you see here, there are like 56 different options, okay? And all these logos are logos of the company mostly which provide these solutions. And some are owned by Lenox Foundation, like Rook. CNCF projects, all of those 56, these six are CNCF project. You can see it here. Out of these, OpenEBS is something that is used in production in CNCF itself. I mean, that is not the only place, but yeah, it is also used in CNCF as production. And if you're aware about the concept of graduated project, then Rook is the only project which is CNCF graduated out of those 56. Now, a little bit detail about these projects or these options, one that not all of them are same, right? So it's not like a cloud provider situation, right? Where you take virtual machine from Azure, it is called Azure virtual machine. If you take it from AWS, call EC2, but they are the same virtual machine. They are same, there will be some operating system and storage and so on, right? They are same, but here it is not like that. So one example would be, let's say, MinIo, this one, right? So MinIo is only for your object storage, right? So you know there are three types of storage. One is object storage, like S3, object storage. And number two is block storage, right? And number three is file system-based, right? So these are the three. And file system-based means NFS is also part of file system. So one of the difference between these products is one that they are catering to specific type of storage. Let's say MinIo caters only to object storage part. Some cover all three object, all three storage types like Chef, okay? Some are, so this is like one difference in terms of what kind of storage options are supported. Second difference is that, I think this can be understood from this slide. So there are three types of cloud-native storage or Kubernetes storage. One, which is a traditional solution with CSI plug-in, okay? So that means, let's say NetApp. Now NetApp is there in storage world from so long and to support container workloads or support Kubernetes, what they have done, they have created CSI plug-in, right? So CSI is container storage interface, which is a standard given by CNCF that every vendor has to follow, because otherwise they will not be any way to say that, yeah, things are working, right? And second is, so this is CSI plug-in is one way. So out of those 56, some of those cloud-native storage are CSI plug-ins, basically. Okay, second is some solutions are software-defined storage with container optimization. So they have decide, so they don't rely on plug-in, but they have a container optimization. So they are, as a solution works for container also, and they say that they have optimized it for containers, so that it works for containers. And third category of solution is cloud-native solution. So there are certain products which were built ground up, ground serve, and for this Kubernetes, right? So those are like cloud-native solutions, all three different categories. Now the question is, the most important part is as an architect or as a solution architect for Kubernetes, I need to decide which storage is right storage for me, right? How do we decide that? So choosing the right storage option is our next agenda in this session. So now in this slide, I have listed 20 such decision-pride areas, okay? And you are free to add more or you're free to skip something which are not relevant to you, but these are more or less something that covers all possibilities, okay? Now let's go through each one of them because this is the most important part. Okay, so whenever you're choosing any storage solution for Kubernetes, you might lend up in a situation where you will have different storage for virtual machines, which is like non-container world and container world. So now the question here is whether you want two different solutions or one, right? So ideally, you will prefer to have one solution, right? So that you have less management overhead and so many challenges go away. You don't have to really go with two different vendors and don't have to manage two different way of maintaining a storage for different type of work, okay? So first question that you should ask while deciding any particular storage solution, whether it supports both virtual machine and containers. Second one is, as I mentioned before, there are three types of storage. One is block, object and file system. So second question would be whether given solution supports all three or one of them or two of them, how is it? Because almost I would say I can't get the number, but yeah, most of the solution do not support all three, okay? All right, now third is whether you have enterprise support available. And this is again a very common question from open source software perspective, right? So when you are running open source software in production, you want to be sure that there is enterprise support available from some vendor. And they are there to help you 24 by 7 for all kind of server issues and all. All right, number four, so whether it supports major cloud providers or not. So see these storage options are also applicable for, let's say on-premise or let's say if you want to run your Kubernetes in AWS, right? Or it might be EKS, then you want to know whether this solution will work with EKS volumes and all, right? And similarly it will work with Azure Disk or not, right? So whether it supports major cloud provider or the cloud provider that is in use in your organization, so that is full criteria. Fifth one, whether it provides any kind of replication feature, why do you need replication in Kubernetes? So you need replication for higher availability and disaster recovery, right? So many solution supports mostly async replication. Some of them are supporting sync replication also so that there is no data loss, right? So based on your requirement, you can decide what solution fits in your requirement, whether if you need sync replication based solution then that could be one of your major criteria, okay? So six and seven are types of replication which are supported by various products. Eighth is provides de-duplication and compression and this is again a normal software defined storage feature. We need de-duplication and compression, okay? Next one is ease of deployment. Some of the solutions are very difficult to implement and they are all command line options and all and that is quite difficult, right? When you want to manage your storage at scale, then ease of deployment also matters. And many companies also look for workload migration feature, right? So and what I mean by workload migration is this could be first of all across Kubernetes cluster, right? So your Kubernetes cluster one could be running on-premise or maybe cloud one. And you want it to migrate from cloud one to cloud two. And you want to do this at storage level, right? You want to create a snapshot, copy that snapshot, restore it. That could be one option, right? Or you might want to have a sync replication where your application is always copied or rather the application data is always copied when something goes wrong here. You have exactly same copy available on the other side, right? And you can do a failover to other side. So whether that workload migration feature is there or not. Number 11, whether it supports high performance. Now, how high performance might not be a key criteria for everyone, right? So just see, let's say if you are hosting some telco great workload, let's say some VNF hosting you want to do, right? And there your storage performance and network throughput, everything matters, right? So in that case, you want to be sure that your performance requirements are met. Then next point is point in time snapshot feature because you might want to take backups of your application at time to time and whether is there any snapshot or some other backup mechanism available, you want to ensure that. Next one is QoS Guarantee, quality of service guarantee. Like these many IOPS per second, that kind of thing is there or not. Encryption again, very important from security perspective. So whether it supports encryption at rest, encryption in motion. And for all kind of corruption issues, whether it supports checksum error detection, then whether it's, and how does the scalability work, right? That's another point. So, see when you create this kind of storage solution, you will be using multiple nodes, right, storage nodes. So let's say you started with three storage nodes and you got 50, 50, 50, like 150 terabyte of storage. Now, when you're running out of it, what is the way it works? I mean, how do you scale it? Is it like adding one more node will take care of rebalancing and everything and you will be getting the same performance after that or not? So that is scaling with nodes, is there or not? Next is whether there's any infra lock-in or not, right? So sometime it happens where a certain storage solution work only with their hardware or certain hardware. Now, that is what is lock-in, right? So whether that infra lock-in or hardware lock-in is there or not? That is again, one point that you should consider. Is there any backup and recovery feature? And last one, whether it's supposed thin provisioning or not? Okay, so as you can see, some of these points are related to security, some are related to software-defined storage, some are related to high availability, disaster recovery, some are related to key features of type of storage or type of workload. And so you have to evaluate from different, different perspective. And what can happen here? Like how do you really evaluate that? So you can create a table sort of thing where you mention all these 20 parameters or maybe more, maybe less. And then you choose certain option. Let's say you choose five options. Okay, let's say you want to evaluate Rook. You want to evaluate four more, okay? You put all five in that same table. Now you start exploring whether solution one fits in point number one, point number two, point number three and so on. Okay, and then based on this entire exercise, you will get to know which one is the right solution. I did not mention the licensing and subscription cost aspect, but that is also important. Some of the lesson learned, since there are too many options and if you start exploring all 56, it might be a long-term exercise, right? Because you cannot do it theoretically by reading literature that is available in public domain. So you have to do certain POCs. Now you cannot do POCs of 56 different solutions. So one of the recommendation is that at theoretical level you decide, right? Which one in which five are aligned to your strategy, right? Very highly, then start from there. I mean, don't start with all 56. Number two, very important lesson that some of the limitation will not be known before end. So I mean, in many public, publicly available documentation, you will not see a heading called known limitation or these are the things which do not work, right? So it will happen that once you start using it, you will find out, I mean, there could be multiple scenarios. One could be that there is a feature which is mentioned, it is there, but actually it is not there, it is work in progress kind of thing or it might not be working, there are known issues. Okay, so those kind of things you will get to know only when you do the POC. So once you shortlist, let's say five of them, then do a quick POC in some test environment and try to see various features, whether I'm able to take back out there, whether it is supporting async replication, I mean, you can do those smaller small POC. Now, as I mentioned before, subscription and licensing of add-on should also be considered in the beginning itself. So what I mean by licensing of add-on, so in most of these products, core functionality is free mostly. Core functionality means the storage fabric part is free. So you can implement their solution and that will take care of creating a storage pool, doing dynamic provisioning and all those basic things will be taken care of. And you can give that storage to your parts, you know. But when it comes to let's say sync replication or when it comes to backup, when it comes to even management, sometime somewhere does give command line-based management free, but if you have to manage using the management portal, then you have to pay for add-on, right? And it's not substantial amount, but yeah, it is still something one has to consider. So yeah, that's what I meant by licensing of add-on should be considered. And since you're dealing with open source, Slack channel is the best place to get the support. Sometime the enterprise supported version also have some support email and all and you can reach them and get support. So these are some of the lessons. What are the common challenges? First one, it's a fast changing area, right? So I mean, if you have done a POC one month ago, then you cannot be sure that everything is same now because people are introducing new features and people are fixing bugs. I mean, a lot of things are happening, right? So sometime it will be good to find out the roadmap of given product, right? So certain thing might not be there. Let's say out of your 10 criteria, 9 are meeting, 1 is not meeting. Now, you can always check with their, if it is open source, you can just check in the open source community if there is a roadmap to get that 10th feature. Or if it is supported by any company and that falls under that company, then you can ask them if it is part of their roadmap or not. That way you can catch up. Otherwise, if only on the basis of 9 parameter, you reject because you're not, I mean that solution is not fitting in your 10 out of 10 things, then you should not throw it out. Number two is lack of support for VM and container both. I think I mentioned it in the beginning, that very few solution supports virtual machine and container both. And the practical experience is that very few company want to go for two different solutions, right? One for Kubernetes and one for traditional virtual machine world, right? Third one, error prone command line based step. I think this is self-explanatory and it is about, I mean, you don't want to do first, neither you want to do entire management using command line, nor you want to do automation on your own to bring it up to a shape where it can be used in production, both are not good. So this is again a common challenge. Lack of application awareness. Now application in Kubernetes is something that is not clearly defined, right? So there's nothing, no entity called application. Application is some of various objects. It will have pod, it will have deployment, it will have some config map, it will have some secret, I mean, and all together is one application, right? So, and the storage which is attached to it, when you say, I want to back up my application. Now, is this storage solution aware of your application? Can you define your application? Can you say that, okay, whatever is there in this namespace, that is my application? Is that kind of option given? So typically it is not there. So lack of application awareness is common challenge. And replication and disaster recovery features are something which are commonly missing. And sometime async replication is there, but sync replication is not there. And so on, right? So these are some of the common challenges. Okay, coming to key takeaways. Number one, so out of those 20 criteria or decision criteria which I explained, the most important ones are enterprise strategy. So many companies have alignment with certain storage partner vendor, right? So that, and they want to continue that. They don't want to introduce one more vendor, one more complex thing in the mix. So sometime that itself becomes a single decision criteria, right? And of course, unless it is limited by so many features, but then that itself is good enough to choose. That is like simplest of all case. Next is enterprise support, again very, very important and features, various features. These are three main criterias and features is again I mean, it can have several other points, right? So which feature you're looking for, you will try to map with it. Now next one, again a lesson learned in the key takeaway that most of the vendor supported product do not have exact same commercial version, feature and open source version. And that is obvious, I mean, that is how it works. If they give exactly same version, then what is their value at, right? So there are, I would say, as I mentioned before, I mean, most of the feature, the core feature is free and is always given. But some of the, I don't like, let's say GUI based maintenance and all that. That is given as part of subscription. Some vendors, very few vendors, give their as is commercial version with several restrictions. What they say, I'm giving you exactly same product which I give you when you buy my subscription, but then this product which I'm giving you for free is not going to have same kind of, not same kind of thing. I mean, it will have certain limitations. So it will work only with five terabyte of storage. It will work with only 10 nodes or whatever. I mean, those kind of restriction will be there. And those are good. I mean, I mean, for non-production kind of environment, you're good with those restriction, right? You can still live with those restrictions. Yeah, I mean, in such cases, feel free to use those products in non-production. Give a special consideration for performance requirement because not all vendors stress on performance. Some vendors are very good in performance and their performance of their storage subsystem is their key USB. And if you're looking for performance, then you should consider it, right? Excellent. And to an automation and GUI driven configuration is not available. And this is a challenge, right? If a mix of GUI and command line is given, then it becomes a nightmare in real life, right? And last but not the least, no one size fits all solution, right? So though I've mentioned 20 criterias and you might have maybe 10 more, 20 more to be added in that list, and you can, I mean, by this line, what I mean to say is that you will not find a single product which will fit in all those 40 criteria. You will not find, I mean, there will be some gaps somewhere. Either it might not be good in performance or it might not be good from manageability perspective or it might not be covering all the features. So, I mean, there's so many ways it is different. So no one size fits all solution. Okay, so that is it. And this is about my company. So I work for Bipro. We are a billion company. We have got 11, 20 active clients. This is our employee strength. And this is as of FY21 results. We have presence in 66 countries. And with that, we have come to end of this presentation. So thank you very much for your time. Thank you.