 Hi, everyone, and welcome to this webinar, How to Protect Your Cloud Native Data 101. During the next hour, we're going to take a look at data in communities, what are the different paradigms, concepts, and what is required to secure your data and guarantee their operation using some of those paradigms, but also evaluating what you may need on top of that. So let's get started. My name is Nick Vermandi. I'm a Principal Developer Advocate with ONDAT. I've been working with communities approximately for the past six years on various areas such as network CNI, now more focused on the CSI. So for today, what I have in store for you is we're going to take a look at the definition of cloud native data, what are their characteristics, what may be different options for you to build data within the cloud, because cloud native is not necessarily only about Kubernetes. But then we will switch specifically to Kubernetes, talking about different concepts involving data, and also extended to components that will help you protect your data. And finally, we'll take a look at open-source solutions around data protection and also around more real-time protection, things like synchronous replication, how you could eventually do this with Kubernetes, and the CSI driver. So let's get started. What do we mean by cloud native data? If we look at the Goddard magic credrant for cloud database management systems, it says that 75% of all databases will be deployed or migrated to a client platform with only 5% ever considered for repatriation to on-premises. So it means that most of the time, databases will be consumed as a service in the cloud, not necessarily in Kubernetes, but directly over platform as a service, such as RDS and alike. It also states that by 2023, cloud preference for data management will reduce the vendor landscape while the growth in multicloud increases the complexity for data governance and integration. And also cloud DBMs revenue will account for 50% of the total DBMS market revenue. So Garner essentially look at the database and data ecosystem as merely services running in the public cloud provider. But what it doesn't say is that we should consider this whole cloud native database concepts as a whole iceberg. At the top of the iceberg, this is more or less the Care Bear world where the narrative is told by the service cloud providers. Everything is easy, is naturally consumable via APIs, and you basically pay for what you consume. But of course, the story is not as clean as what they say. All those services are specific to particular cloud provider, and as we know today, there's not such a thing as sticking with a single cloud provider. First of all, because you don't control the business. You may have different merger and acquisitions, and those companies may use a different cloud than the one you have started with. And another reason may be that you may want to use a particular cloud for a particular function. If you look at maybe Google to run Kubernetes, maybe AWS to run your storage, and your buckets, et cetera, et cetera. So you may want to have specialist cloud for a specific set of function you want to provide to your business. And then the challenge becomes that as you move between different clouds and reiterate database migration or operate databases into those different clouds, even though at the end, it may be a relational database or no secret database, the fact is in terms of the operation, those are operated in a different way because you are using different cloud providers. So it's not exactly the same API. But protecting your data will happen in a different way. And how you combine this data with your overall application architecture will also be different. And typically, that means that you need a broader scope in terms of the skill sets of your engineering teams, your DevOps teams, as well as your developer in the end. And on top of this, there's also two other considerations to have. The first one is about cost. When running databases services in the cloud, most of the time, you will also have to account for the underlying running instances that are powering your database. So mainly, this means you will have to pay for the compute, so the running instance, as well as the storage. For example, your LBS volume associated to your database. That's one. And the second consideration to have is about availability. As you probably know, most of the cloud database are replicated and highly available within a particular availability zone. As soon as you want to recover into a different availability zone, then it does incur some downtime. Potentially, you have to restore for snapshots. And that also means you have to schedule and manage the lifecycle of those snapshots by yourself. And chances are that your snapshot may not be restoreable as a destination. So on top of that, you would have to test your backup on a regular basis, which is obviously causing some overhead on top of your operations. And again, this is for one particular cloud. If you consume this database as a service model from different clouds, you will have to repeat the same sort of automation and extended operation over multiple cloud. And automating snapshots, testing the restore of those snapshots in different cloud will also involve different skill sets because they are using different APIs, different SDKs, different provider, if we're talking about infrastructure as code, and talking about DevOps pipeline, for example. So all of this needs to be taken in. And if we compare this to a, let's say, cloud native Kubernetes solution, then there will be a couple of differences. So let's first take a look at the cloud native features, what we could expect from a Kubernetes environment, whether running on premises or in the public cloud managed or unmanaged. First, it's all about scalability. The last couple of years, I've seen the rays of autoscaling for pods, but also autoscaling for nodes. So this means that as your application requires more power, you can also deploy more nodes in Kubernetes, as you would do with an autoscaling group in public cloud providers as well. So not only for nodes, but also for the application itself. So you can scale your application to be able to take some of the peaks during high usage periods, such as promotion, sales, if that's a commercial application, or during Black Friday, for example, where potentially you need more power for your application, so more web servers, maybe more database nodes to facilitate reads, but also more Kubernetes compute nodes as well. Elasticity, so that is the ability to scale up, but also down when you don't need this extra power. Self-healing is also a very important concept in Kubernetes, whereby Kubernetes is fundamentally an asynchronous system with eventual consistency, where things happen concurrently and will eventually converge to a finite state. So even though things may not succeed the first time, maybe the second time a controller will try to do something, all the prerequisites will be met by if it depends on other controllers, and in the end, eventually, everything will converge self-healing itself. Observability is also key in Kubernetes. They are a proliferation of solutions like Prometheus, Rafaana, different dashboards that are available. Within Kubernetes, I would say at no extra cost. So this is also a very important factor. So this is for the basic, the foundation of Kubernetes and what kind of requirements and capability it is providing. But how about persistent data and storage? Let's say if you want to build your own database in Kubernetes and run it in production, then first off, of course, it needs to be distributed. You can not run a single pod database on a single node. You want to have a distribution of your data across multiple replicas, across different Kubernetes nodes. You want replication to happen as well, because by default, we're going to see later some of the Kubernetes primitives. But the data itself is not replicated by the platform. So meaning that there are two main solutions, you can replicate your data at the application level, meaning that you rely on your database cluster or your database replicas to provide multiple instances of the same data. Or, alternatively, you can also rely on your storage provider or CSI driver to provide that particular feature or of replicating at the block level. And I'm going to talk a little bit about this later as well. Encryption is also something you have to consider, especially if you're running a database that are holding sensitive data. And to end, encryption is very important in Kubernetes. And you have to find the right solution, which is not necessarily relying only on the cloud provider for encryption. You may also want to encrypt your data directly inside Kubernetes so that no one can get access to your Kubernetes volume if someone were to read it from Kubernetes itself. And another very important function for your developer is self-service provisioning. So the idea here is as microservices become more and more popular, individual teams are responsible for a set of microservices. And each one of these teams will run their own queuing or messaging system and their own databases. And you can simply in the, let's say, cloud-native philosophy, you cannot rely on a waiting time for developer to consume and to provision their databases. They need to be deployed on demand. You cannot afford waiting two, three, four days or even multiple weeks to get the database up and running in an environment where potentially code updates and new releases are deployed in production, typically multiple times a day. So self-provisioning is a very important concept when it comes to deploying and managing databases in Kubernetes. And because Kubernetes has all the fundamental prerequisite to enable this kind of paradigms, it makes it the perfect platform to run databases. And finally, all the functions we have been mentioning so far can be encapsulated into DevOps pipelines. And again, you have two solutions. Either you could use your cloud service provider native service, such as Azure Pipelines and others. And of course, you will be subject to the same drawbacks I would say that we've seen before in terms of different clouds, having different APIs, and different way of implementing those DevOps pipelines. Or, alternatively, you can choose to stay within Kubernetes and use a Kubernetes native DevOps tool, such as Tecton, which gives you the ability to develop your DevOps pipeline without leaving Kubernetes. Basically, every task defined in your DevOps pipeline in Tecton is run as a distinct container. So you can compose your pipeline as you wish, again using YAML, which is the de facto language, declarative language in Kubernetes. So again, no need to learn anything new. Just using YAML, you can define your different action, your different tasks. And those tasks can communicate to each other by using also your storage inside Kubernetes and essentially run the whole pipeline without leaving your cluster, meaning that as you deploy Kubernetes as your de facto cloud operating system, you can stay consistent across all different clouds in terms of building your applications, but also building your DevOps practice. So now let's take a look at some of the Kubernetes data primitives. The most important primitive when it comes to persistent data in Kubernetes, and non-persistent data actually, is Kubernetes volumes. And Kubernetes supports a variety of volumes. For this, you can use kubectl to get access to the whole descriptions as usual. So kubectl explain pod.spec.volume dash dash recursive will give you the full list of supported volumes. Some of them are legacy, I would say, because it also includes the deprecated entry drivers. But Kubernetes has moved away from the entry drivers into a more modular approach where every storage vendor or provider develop its own driver called CSI or container storage interface. It's a pluggable architecture where only the required CSI will be installed by the user when you need it. So for example, if you're using Amazon EKS, you can install the EBS CSI driver. And so it will be able to take advantage of a variety of feature that comes with the CSI. So all the Kubernetes primitives and on top of that, also additional capabilities. The main volume providers you're going to be using are displayed here on the screen. The first and more obvious one is the persistent volume claim. So a persistent volume claim is a request for a back-end persistent volume that matches specific criteria, such as the size, the storage class of the volume you want to create. Essentially, when you create a PVC, two things may happen. The first one, if your CSI can provision dynamically a persistent volume, well, when you create a PVC, in the back-end, a persistent volume, or PV, is dynamically created. If the CSI doesn't support dynamic creation of PV, then you have to statically link your persistent volume claim with the back-end persistent volume that has to be created before. But nowadays, most of the CSI driver support the dynamic provisioning of persistent volume. And typically, what you will do is create a particular storage class defining the type of storage you want to use and the CSI you want to use, and then reference that particular storage class in the PVC definition as well as the size. And then the back-end persistent volume will be automatically created. Another important consideration is how the pod will access the PVCs. So if you have a single pod, you can have a PVC that is locally attached. If the node fail, then, unfortunately, you will also lose the data. Now, if you create a higher-level construct, such as a deployment, then you will have to use a shared file system. Because in the definition of your deployment, you will specify a single PVC, meaning that if the PVC is a local attached file system, then only the first pod will be able to consume it. The other pods that will be potentially residing on the same host won't be able to access it, because it's been already claimed by the first one. And pods that are residing on other nodes, well, they won't have access to the local attached volume. So the only solution is to have a shared network file system if you need multiple pods part of a deployment to access a particular PVC. And then you have the option to create a particular directory per pod basis. So you would have a shared file system, and every pod that has access to that particular PVC will create or will use its own directory. So in terms of access definition, it means that if you want to use a PVC within a deployment and every pods need to write to that PVC, you will need to use a read-write many access backed by something like an NFS share. Other volumes that can be used include MTGear, which is a scratch directory typically mounted from the root file system or RAM on the node. It starts empty. And of course, the pod may write data to the directory that will be mounted into it. But when the pod is restarted, the data that is located there is also scratched. Then hostpass, which identifies a particular path in the Kubernetes node that will be mounted as a volume into the pod. It is typically avoided in production as it has some security involvements, but also because it's only valid for naked pods. Again, if you create a higher level construct managed by a controller, such as a deployment, a stateful set, et cetera, then it doesn't make sense to use a hostpath. So typically used for testing or troubleshooting eventually. Then we have config map, which are a set of key value pairs that can be mounted to your pod as environment variables, but also as a volume into the pod. And then your application can get access to this information just by reading the files that will be present into your mount point. Secrets sort of similar to config map except that it is encoded in base 64, but not encrypted by default. This is really important. Then we have the downward API, which can be very useful because it provides contextual information for your application running inside your pod. So the downward API allows you to define in YAML, again, inside your pod. You can reference particular fields. You want to inject into your running pod. So it can be things like your pod IP, the requests for your CPU, the limits of the CPU memory, et cetera. So give essentially a lot of contextual information for your application as opposed to hard code, those information. And finally, we have also ephemeral volumes which are a bit more recent than the others. And they have been created to meet the requirement of specific use cases where application don't really care if the attached volumes are persistent or not. So for example, it may be a caching application where the data can be easily scratched when the pod get restarted and the application doesn't really care about that. Or it gives also the ability to pre-populate data as input for the application. But essentially, the main difference is that the lifecycle of the volume is the same as the one of the pod, meaning that the pod can get restarted on a node where previously the volume didn't reside as opposed to a PVC. For example, as the PVC once it's been claimed will basically reside forever on a particular node, it means that the pod is tied to the specific node where the PVC resides. So it cannot be restarted on another node. Here the difference is that pods can be restarted on whatever nodes. Also, in addition, ephemeral volumes can be supported by CSI providers to deliver some additional capabilities, such as snapshotting, cloning, resizing, and storage capacity tracking for those ephemeral volumes because they are fundamentally CSI capabilities. OK, so now let's focus a little bit more on the CSI basic capabilities. What does a CSI need to deliver to communities at the bare minimum? So it is a standard defined for storage plugins in 2018 when communities moved from the entry driver development, meaning that for every modification the whole community system had to be replaced. So moving away from this entry to a more pluggable architecture where you don't need to replace the whole community system just to update or upgrade your CSI driver. So essentially now the CSI driver is delivered as an additional application that you installed inside communities as you deploy the cluster. So the CSI driver need to be compliant with a couple of APIs or RPCs that will deliver a specific function to communities or function that communities expect. So dynamic provisioning and the provisioning of a volume attaching or detaching the volume from a node, mounting and mounting volume from nodes, also support the creation and also deletion of snapshots. And finally also provisioning a new volume from a snapshot. Those are typically the features that the any CSI will deliver. Now, as I was saying before, the CSI driver itself is installed as an application on your communities cluster and has typically multiple component. So the node plugin typically delivered as the daemon set, holding a GRPC endpoint, then we have the controller plugin, also a GRPC endpoint and then the CSI driver has multiple interfaces. One responsible for identity, the identity services, the controller services and finally the node service. Now when it comes to data protection, the CSI driver deliver multiple functions that are represented as an extension of the communities APIs. Snapshots are effectively represented as CRDs or custom resource definitions and are composed of three main objects. First, the volume snapshots, the volume snapshot content and finally the volume snapshot class. So the volume snapshots is comparable to a PVC in the sense that it is actually a request for a snapshot, a real snapshot. And that snapshot that is taken is effectively similar to a persistent volume in the sense that it is effectively the physical sort of snapshots and the corresponding object is the volume snapshot content. The volume snapshots is composed of a snapshot controller as well as a validation web book and is effectively delivered by the CSI driver. So the snapshot controller watches both volume snapshot and volume snapshot content and it's the component responsible for the creation and the deletion of volume snapshot objects. On the other side, the sidecar CSI Snapshotter is the component that watches volume snapshot content objects and that triggers create snapshots as well as delete snapshot operations against a particular CSI endpoint. And finally, the validation web book is nothing more but a HTTP callback that is there with the goal of tightening the validation for volume object snapshot. And finally, we also have the volume snapshot class which specifies different attributes belonging to a volume snapshot. It is sort of similar to a storage class if you want to compare volume snapshots to PVC again. One other thing to notice is that the volume snapshot content similarly to a persistent volume can be provisioned dynamically or can be pre-existing. So for snapshots, pre-provision or already existing mean that you can link a volume snapshots object again which is the request to an existing snapshot that has been taken maybe by your storage array. So effectively representing the external snapshot taken from the storage array into a Kubernetes first class citizen. But most of the time, of course, if the CSI driver supports it, when you will create a volume snapshot, the corresponding real snapshot, the physical snapshot will be created and will correspond to the volume snapshot content. So obviously snapshots are asynchronous. That means they represent at a particular time the content of the data. It's not a synchronous replication that is happening continuously over time. And that may be an issue in case of RPO that needs to be equal to zero. So RPO or recovery point objective is the representation of the data then you can afford losing in case of failure. So if you have an RPO equal to zero, it means that you need something more synchronous than a snapshot. Basically you need a representation, a continuous representation of your data over time. And this is the type of thing that cannot be represented directly or are not available directly in Kubernetes. But by using particular CSI drivers that can produce and provide actually that feature on top of the additional functions that are required by the Kubernetes API, well, the CSI driver can itself deliver synchronous replication. So this is the case of the on that CSI that is represented here on the screen, but other open source CSI drivers like OpenEBS can also support replication. It's just to give you an example of how it can be delivered. So the idea here is to provide this additional capability by giving the user the ability to configure as YAML or in a declarative format to configure the number of replicas on a per volume basis. So that when a node fail, even though the storage is locally attached, it's effectively aggregated as a pool of available storage and every volume that is consumed within that pool is also replicated on a set of other nodes. So when the node fails, the pod can be restarted on another node where the volume of replicas can in turn be promoted as the new primary volume. And this is what will enable you to seamlessly recover in case of node failure with effectively zero data loss. Okay, so so far we've seen different paradigms, snapshots, asynchronous, synchronous application for zero or PO, but fundamentally there is also something else which is creating backup from your snapshot. So your snapshot as such are living within communities. So in case of failure, of course if you want to restore, you need to restore from a storage repository that is still available. So typically you want to externalize your snapshot and copy the data into an external storage repository like AWS S3 or Google Storage. And again, you can do these operations without living communities with again the same principle leveraging controllers, CRDs and the operator model. So in our particular example, we'll take a look at Canister which is an open source tool by Caston and is effectively composed of CRDs, controller as well as a command line that we're going to explore in a minute. At the center of the Canister architecture, we have multiple CRDs including the blueprint which is a custom resource defined as a set of instruction that tell the Canister controller how to perform action on a specific application. They are typically a curated set of manifests that are maintained by the community. So every application will have a corresponding blueprint that will encapsulate actions such as how to quiz the file system or the database, et cetera. Then as another set of custom resources, we have action sets which define actions that can be triggered by the creation of the corresponding custom resource manifests. So typically if you want to do a backup or a restore actions, you will do that by creating those manifests. And to help with the lifecycle of those custom resources, you can also use a command line called CanCTL which can be used in dry mode to generate the different manifest. And then those manifests can be applied to the communities cluster using Coup CTL or you can just use CanCTL without the driver option and it will directly create those CRDs into your communities cluster. So here we have an example for the Elasticsearch application. The first thing you're gonna do in the workflow is to create a profile that encapsulates the information required to create a remote storage location. In that particular example, we are using GCP and cloud storage to externalize the data that we are going to backup from Elasticsearch. So it encapsulates the information required to configure the external bucket as well as the corresponding credentials. Then once the profile has been created, you're gonna create the blueprint that is available publicly that is really specific to Elasticsearch and define how to perform action on that particular application. Then we can use CanCTL with dry run mode to generate the manifest for the backup action set and later apply it with Coup CTL or here in the example, we just use CanCTL without the dry run mode and that will directly create and push the manifest into your communities cluster. So here we specify the action set as backup. We specify the blueprint that we just created. We specify the stateful set that is basically representing the application that we want to backup. So default Elasticsearch master represent the namespace and the name of the stateful set. And finally, we also reference the profile that defines where we want to store our backup content. So once the action set has been created, the manifest push to communities, the controller will react to that and trigger effectively a backup that you can monitor in terms of the status using Coup CTL as well. So just monitoring that particular across the resource, you will see once you will be updated once the backup has been completed. And then in case of disaster and you want to restore the content of the remote location, then again, you can just use CanCTL as displayed here on the screen, specify the namespace, create the action sets. This time the action set is restore and from the backup name, which is basically the name that has been returned by the previous command when triggering the backup action set. And again, it's a CRDs, you can monitor the progress of the restore by using Coup CTL to monitor the status of that particular CRDs. And at some point, the initial data will be restored in the right place. So that concludes our presentation for today. Hopefully you learned something and it's been useful. Couple of key takeaways before moving on. Cubilities is ready for a stateful application with cloud-native data. This is a very important point. It has evolved over time. So now it's not only about CATL, you can also run pets in Cubilities. But the key is to make sure that you can reach the right level of availability, scale and performance. And we've addressed today some of the challenges for persistent storage. And as you've seen, they are not all addressed by default in Cubilities, especially when it comes to zero RPO and synchronous replication. And even when you have a CSI driver that can provide snapshots, you also need to back up those snapshots into a remote location. And this is where you may want to use, again, Cubilities-native tools, such as canisters or Valero or others. And this is made possible by extending Cubilities for data protection and making data protection first-class citizen in Cubilities thanks to the ability of Cubilities to extend its APIs. So finally, you want to make sure that your CSI driver can protect your stateful workloads in case of node failure or also cloud outage and possibly availability zone outage too. Now, a call to actions for you. If you want to learn more about data on Cubilities and how to run your stateful application and your stateful workload in Cubilities, please join the DOK or data on Cubilities community. You have the link there. I'm personally running the DOK London Meetup. So if you're local to the UK, you can go and subscribe to the Meetup page so that you are always up to date when it comes to the next dates for our Meetup. The next one will be it's in September. So if you're local, don't hesitate to join us. Also, if you want to learn more about on that and on our CSI capabilities, you have a set of links there that you can go to for more information. I hope you enjoy this presentation. I wish you a good day and take care of yourself. See you next time.