 Hi friends, welcome to open source summit Europe 2022 and for the session, thanks for inviting us to speak here. We were supposed to come in person, but unfortunately we had some medical emergencies and we could not join. So we quickly made this video so that we can communicate or deliver the session at least online. Good morning. Good evening. Wherever you are, welcome to the session. Today we'll be talking about open container data protection, state trend, and the couple with me, Sushant Kumar. Hi everyone. Yeah, Sushant is a leader architect in Huawei technologies and working in solar foundation as an architect and a main theater. Myself, Sunil, I'm part of the same team data management team. I head the data management team and I am the co-chair of technical oversight committee of solar foundation. So let's get started. So we partitioned this the whole session into three parts. The first is setting the context. What is the industry situation for data management, particularly on data protection? And then we'll see what's happening overall, especially in CNCF or Kubernetes or specifically on CSI and then what we are doing in a sort of foundation. So the first part is kind of setting the context that most of the information will be available to you or known to you. This slide is self-explanatory. Today everything is connected. We want everything to be connected, data, devices, including people. So in this context for data management, the demand from the industry is we want unified and a global data management platform, which can provide us the data management facilities across Edge, Core, and Cloud. So the domains if you see Edge, Core, and Cloud, anywhere the data is we want the data management facility as a common unified way and irrespective of the data types like file, blog, or object, it doesn't matter. We want something unified for seamless data management across this. And add to the complexities, we have different platforms, different applications, and different use cases. Platforms, when you say platforms, Kubernetes, OpenStack, or VMware, and things like that. And applications can be various applications on big data, IAML, and use cases are unlimited. Whatever may be the variables, the demand is having unified global data management across. And we will see some kind of industry trend information about what's happening on container application space. First we'll see what is happening to the container application and then we'll see what is happening to container storage because of this. See on the left hand side you can see we have done some kind of data storage trend survey in 2021 and we also have done 2022, the result is said to be announced. But based on 2021 result, we see that the container-based deployments are exponentially growing. We can see these numbers. And most of the top deployments we are seeing, especially the new deployments are with respect to Kubernetes, hybrid Kubernetes and Kubernetes on premise. And we also sense that these are no more notional but it's real use cases deployments happening. On the right hand side, you can see some of the industry reports, we have already put those notes, the actual links below as a slide note. The application container market is growing 26.5%. It is in between 2019 and 25. And similarly, container as a service is growing around 35% here on here. And the installation base and the real use cases, real production environment of container is growing really exponentially. What does this mean? This means that more and more application deployments are happening in container-based deployments. So the container-centric storages are more and also we see hybrid cloud data management requirements are coming up very drastically. So with this, what we can see is that because the container deployments are more, of course the container storage or container data management is very important. It is becoming important and inevitable. So storage as a service was something coming up earlier. Now it is moving towards container storage as a service. These are some things we can derive from these trends. And if you see the solution trends, including the storage vendors, because storage vendors traditionally, their product vendors are more closed in nature, but they're also coming out of that shell to build logical solutions to provide end-to-end data management with their own solution storage box features and outside box features like data management, O&M, things like that. And notably that, they also embrace the third-party storage solutions or open-source solutions in their overall end-to-end solutions stack. This is what we are observing based on our industry analysis. Like I said, we have put certain examples like NetApp, Astra, VMware, Tanzu, Pure Storage Orchestrator, et cetera. It's not with any priorities. We just put some examples. If you see their products, it's kind of an end-to-end logical solution stack wherein they have their own storage related products, third-party products and open-source products to provide end-to-end, Kubernetes-based or plus multi-cloud kind of data management. That's what we see in the solution trend. And what's happening to the container landscape? This is not with respect to storage. This is just with respect to the container application deployment. So the container landscape, this is picked up from landscape.cnc.io. This is growing and changing every now and then because more and more projects are coming in. So this indicates that this container deployment is becoming, container landscape is becoming important. That's why more and more projects are coming up in this space. So this is just to indicate that things are really happening on the ground. So hence, container data management and container storage are critical. So we came from the application side. The application deployments are important, so as the container storage. Now we just go one step down that container storage landscape, what's happening and what it indicates. So before going into this, I mean, some of the terms we often use it interchangeably or confused. So we just try to clarify the cloud native storage. So it's some kind of a storage software defined storage that has completely API driven, more or less REST API or a common API driven, a kind of a solution. So that's kind of called cloud native storage. It can be on premise cloud or it doesn't matter. Similarly, container storage, the container centric storage solutions are cloud native in nature. So it will have declarative API like GitOps, auto-hailing and things like that. The developer productivity cost optimization are truly hybrid in case of container storage. And now coming to container storage interface, this is a specification which right now drives all the container storage orchestration within Kubernetes or kind of container based solutions. So whichever container orchestration engine compatible with CSI specification can work seamlessly with the CSI compliant drivers of the storage boxes. So CSI is a specification which drives the development of storage drivers to connect with the container orchestration engines example Kubernetes. Now this is a typical stack. This is just to show, we are not going to explain the whole stack, but this is to indicate that where the storage has its position. Now if you see, you have business applications, application platform orchestration and then container runtime engine. So in container runtime, you will have storage, the compute and the network. Now the storage is what we are focusing on. So this is where the storage fits in. Now, we have picked up the storage specific projects from container project landscape, which we have seen earlier. So if you see, there are a lot of projects coming up in storage as well. But compared to other projects like compute or network or any other projects, the storage part of Kubernetes needs, it has got enough gap to get more developments to research and solutions to manage seamless data management for Kubernetes. It's a fact. Now, some of those projects we have circled down, these projects are part of Soda Foundation. You will talk about Soda Foundation a little later. These are all joined as eco projects in Soda Foundation to build some kind of a collaborative solutions. Even Soda Foundation, you can see in the cloud native storage space in CNCF. Now, what is the industry trends overall happening with respect to the product? This we have given just for a reference purpose. There is no comparison with the products, but we just want to derive that what are the key focus areas today in the industry? So these are some of the products like NetApp, ASTOP, ESO, SMRL, Karavi. Karavi is, I think, renamed to data management modules by Dell. So no more it is known as Karavi today and HSPC, Veleru, Longhorn, Rustic, OpenIP. There are more than other projects. So what we have done is that we took these projects and seen that what are the key use cases or the areas it is trying to focus. So what our understanding based on the data available is that more or less people are focusing on data protection area first, because maybe one reason is that most of the use cases are the users, they are ready to move because it is a transition period from enterprise or towards the container deployments. So they may be ready to do something like non-critical on the data path. So they have something like a backup, a restore, those kind of use cases, they may be ready to try and maybe that is the reason. So we see most of the cases, it is like data mobility, data migration, data protection and also kind of monitoring are some of the key features which we have been seeing on all the new projects coming up as a first real kind of use case deployment. So this is just some more details on these products and at the past pure storage. One more observation is that most of these projects try to provide the cloud connected data management along with their own storage product. So that's a good direction to watch out because on premise plus cloud and in the cloud again multi-cloud means multiple cloud vendors and seamless data management across the multiple cloud vendors are becoming more interesting to the users. So this is again a continuation to that. So now what we, the summary is that the product direction is inclined towards hybrid cloud data management, more on cloud native storage support for data protection and mobility and backup and restore is kind of primary use case under container data protection. This is something what we try to derive from all these analysis. Now what are we considering at Soda Foundation? So basically in Soda Foundation, we try to build an open, completely an open source data management solution to meet some of the challenges. So we'll first to see quickly on what is Soda Foundation and then quickly we'll see some of the projects what we are working on. Then we move on the specific project, which we are a building for container data protection. So Soda Foundation is a, sorry, a sub foundation under Linux Foundation mainly focusing on data management solutions across edge core and cloud. It's completely open source projects. We do some kind of research. It's completely open and collaborate with the different organizations. So this is basically run by industry organizations come together. It includes vendors, users, and also standard organizations and solution providers. And overall solution and end to end solution, what we envision is that we would like to build a logical stack, which may comprises of different projects, but it can connect data management for edge core or on-premise and cloud for various different kinds of applications and various different storage. If you see the southbound, there are different kinds of storages on-premise and also cloud. The cloud is missing in here, the multiple cloud. So we can seamlessly do data management across any storage, any platform, any time. So this is the kind of vision what we have. And the key propositions are it's completely open source. It helps to connect the data silos. You know, it's vendor agnostic or cloud vendor agnostic or storage vendor agnostic or platform agnostic. That's a way we are trying to build our solutions. And it will be extensible because most of the cases we build things on microservice based architecture or plugin based so that we can have custom plugins attached to this whole solution. And we also have some efforts to standardize. We work with SNEA and such kind of standard organizations to support the standardization efforts in this direction. Also, we try to work with CSI and build an ecosystem for hardware, software solutions and services. And also later on, we can have some kind of certifications on based on these standardizations. So you can find our source code in these three GitHub repositories, mainly in github.com, sort of foundation of all the projects you can find. And as we mentioned earlier, we try to cover the hybrid demands, hybrid data management demands, wherein you can have Edge and the data center and multiple cloud. We'll see some of the key projects now. This is our project landscape. It has got our own framework projects where you can see slight green looks like. So Kahoo, Komu, and the blue color is our existing projects, Terra, Stratto, and Delfin. And all the light blue color, I think I'm telling the color names correctly. So light blue colors are external projects which joined our Echo project. Now, our key projects and focus area, first I will talk about the three projects which are already existing and some of the users are already trying out. The one is Stratto. This is for multi-cloud data management, wherein you can have seamless data management across different cloud vendors. Say for example, you want to do lifecycle management, data mobility, data orchestration, across different cloud vendors, it seamlessly can be done with our unified interface for multi-cloud. So it is a cloud vendor agnostic solution. It is S3 compatible, so even if you have an on-premise S3 storage or products like ODA, Optical Data Archival Systems, which are S3 compatible, you can do data management even from on-premise to cloud. Now, the second project is Delfin. It is kind of QA or GRD kind of a product where you can do a heterogeneous storage map monitoring. It is not very specific to container deployment, but it is for the enterprise solution where it is storage vendor agnostic. So you can have storage monitoring across any different kind of storages. It supports resource alert and performance matrices. And you can also connect to that party Prometheus or maybe Kafka for further analysis. And the third one is the Terra. We call it as a SDS controller, wherein you can do seamless orchestration of storage on container or non-container kind of platform. We support Kubernetes, OpenStack, and VMware here across different storage again. So our idea is that it should support heterogeneous storage and different platform. Now, the key two projects which we are working right now is one is CAHU. It's called CAHU. It's a Hawaiian word, which means Guardian. So it is primarily for container data protection. So we want to augment Kubernetes container data management and provide the extended features than what is supported in CSI, exploiting the features supported from the real storage in a Kubernetes native manner. That's our idea. And Como is a multi-cloud data-like project. We are collaborating with SoftPank and for an object data management to start with. And later on, we will extend that features for pure complete end-to-end data-like kind of products. And the Strato project, the multi-cloud project, some of the components we will be reusing in the multi-cloud data-like project as well. All these projects, this CAHU and Como, is in very initial stage. Some are in the initial phase of development and Como is in some initial phase of design and architecture. So we welcome more community members or developers and experts to join us to support us so that we can build these completely open-source solutions which will be useful for many others in the industry. Now, we will focus more on CAHU because that is a container data protection and that's our topic today. So container data protection and the CAHU. So just before going to the CAHU projects, what is supported in Kubernetes data protection we'll just glance through. This is again to emphasize that it is very important. So in CSI, we have a data protection work group in storage tag of CNCF. They are working to identify the gaps. There are some gaps, some application workload, stateful set, deployment, even set, all these are supported, but there are things not supported in CSI, volume backups, backup repositories, et cetera, which right now we are trying to support from the external products like CAHU or Vellero, but they are also looking at it. So we'll sync up with them and see that what we can converge and what we need to develop newly. So let's get into CAHU right now and Sushant, you can take it from here. Yeah, Sanil. Okay. So hi, everyone, myself Sushant. So Sanil has explained the importance of the data protection and as well as the container data protection. So this project CAHU is mainly designed or we are working towards solving some of the typical data protection with respect to backup and digital feature for the containerized applications. So in this, so if you typically think of a solution for this, so what we can think of is we have a certain set of application or workloads running on certain platforms and in our, if you narrow down the applications, if you think of container applications, the first thing that comes to our mind is like the application deployed on well-known container orchestration like Kubernetes and below you have storage provided by different providers. So in the middle of these two, so we can think of a solution which can help us or which can plug the gap between protecting the data and linking them to the storage backends. So Sanil next. Okay, so how exactly CAHU does it? Before going there, what exactly will be there when we think, what exactly comes to our mind when we think of backup or backup for containerized applications? So as we know, the typical containerized applications will have their own configurations through which they will be deployed or the configuration which they will utilize during the due course of their operations and typically the metadata for running those things. And also from the persistence perspective, they might be having using some kind of a volume to which their application or the business data will be persisted. So their typical area of interest will be to backup these two kind of information, the metadata information and the persistent volume information to which are to be saved and protected for the future usage. So in this project CAHU, so our first phase, in the initial phase, this, like I mentioned, our SOTA foundation tries to solve multiple problems related to data management. And if you narrow down next level, it is a container data protection. And the first part of it is about the backup and restore. So that is in fact the first project under SOTA foundation, SOTA CDM, so that is CAHU. So this is the first one. And the first phase, the typical at very high level, the requirement is backup and restore in Kubernetes as such, along with the support for different providers and the scope of the backup. So where exactly you want to take up to what level of scope we must support. So that is what is our main goal. Okay, so if you divide the, I mean at the goal level, if you divide what are to be expected, what are the things that can be expected from this project? If you break down at very high level, we can categorize to three categories. First one is about the feature itself. So what are the things that comes to our mind with respect to backup and restore? Second one is about some kind of a framework support to support multiple storage providers. And third one is about mainly from the usability perspective. So coming back to first part, that is with respect to the backup restore feature. So the key thing is about backing up and restoring of a metadata. And metadata, when we say the user's interest may be based on the scope of the application or the magnitude of the application. So user scope may be to backup a particular, backup the application at different scope. That means interest may be to backup the entire cluster, entire namespace or the application based on some labels or it may be even to a granular level of backing up a particular resource. Like I am interested in backing only a pod one or some set of like persistent volume one, persistent volume two. To that level also, users need can be to be catered. And second one is snapshot support or mainly for the volume backup part. And some support for some of the storage side features. Like if the storage is capable of, multiple storages are capable of providing incremental differential backup or full backup. So we think like at the platform side also there should be support to orchestrate this. And during the backup and restore for the business applications, always there will be a need to maintain the consistency during the backup and restore. That means there will be some use cases wherein user might want to perform some operation before taking the backup and performs a certain set of operation after taking the backup for the business continuity. And backups across the storage providers is another typical need because now the support set from storage just maybe varying and one might be interested to have a feature wherein the volumes provided by one provider or volume provider if somebody has a capability to backup, there should be a provision for that. And cross cluster backup means backup in one cluster and restore in another cluster. And for types of volumes, so volumes can be of CSI provision or non CSI provision volume. And backup based on the provider capability. That means a different storage providers may be having a different capability. For example, some providers have their own snapshotting capability or through which they want to support the backup or some providers maybe, they have a very good backup software without not related to any storage systems but they have a specialized software to just to take the backup of a persistent volume. So we think like based on this capability of the storage or the capability of the providers. So the orchestrator should be able to cater to the requirement. So this is about the feature side. And the key motive for Kahu is one of the key motive for Kahu is supporting the storage provider framework through which one the any backup volume backup provider can be pluggable dynamically. So either it can be to take volume backup or it can be to take a metadata backup. So at any point of time, we should be able to bring in new backup provider. That's one part. Second one is a coexistence of multiple backup providers. That means at any point of time, if user wants to have multiple backup providers for multiple applications or based on the need, the system should be able to make them coexisting it. And third part is about usability from the user side. The requirement can be to take backup, have some kind of a control on taking the backup. That is can be through some scheduled way or some event-driven way or policy way. So these are some of the high-level goals Kahu has during the course of its execution. So excellent. So where are we currently? The currently our focus area is we have, as Sunil mentioned, it's a new project started under SOTA Foundation for this specific requirement. So we started with the basic support for the backup and restore with respect to scope and volume snapshot support, I mean the backup support through snapshots and the implementation of the hooks and currently the support is for the CSI provision volumes. This is where currently we are working on and from the storage provider framework because we think this is the first thing to be integrated so that we can add on providers either for metadata backup or volume backup. So we have built in a storage support. This framework is built considering, I mean in line with our CSI provider and CSI framework way wherein we can just plug in the drivers with the capability and it can work. So currently we have this framework wherein any provider can be plugged in. We have certain set of interfaces to be implemented and through which it can be plugged in and for the first phase, we have considered NFS provider for the metadata backup and the next it can be extended for S3S. Yeah, now I think we will have a quick look of what Kahu is supporting currently. We have a short demo to have a feel of it. Yeah, so Shanti will share, I will stop sharing. Yeah, okay. It is visible. It is visible, you can go ahead. Okay, so maybe it's maybe a little small for the guys who are seeing from back. Sorry for that. So in this, I have used the kind cluster just to introduce the environment. We have used the kind cluster here and we have deployed our application that is Kahu. So we can see like this is application currently this application is deployed on test NS namespace, which can be any namespace of our choice. So we can see like backup controller is deployed here. That is mainly the controller part for taking the backup and the restore of applications. And another part which is deployed here is for the NFS provider. So as I mentioned, our current support is for the NFS provider. So we have deployed NFS provider. So this NFS provider has two parts. One is the provider part itself just like CSI and another one is the sidecar from Kahu. So it is deployed with the two containers. So if you have S3, there will be similar part, one more which can be dynamically deployed here. And for NFS provider is expected to connect to an external NFS server. So there are different NFS servers we can have in our system. So in this, for this demo purpose, we have used the container application which solves the NFS server purpose for us. So it is deployed here. So I just wanted to show how we can verify our backup. So this is our NFS server. So this is the path which is mounted and this is where our backups will be put. So once we take the backup, we can come back and verify what is saved here. So we can see the providers registered in our system that is NFS provider here. And these providers are mainly registered mentioning their capability. So we can see like currently this is registered for metadata backup. So and this current status is available. So this is about the provider and based on the provider, we create the backup location. This is nothing but a kind of a storage class wherein we tell which provider we want to use and with what set of configuration we want to use this provider. So you can have different backup locations for the same provider with the different set of configurations so that your driver or the provider will be able to realize the usage. And now I just want to show how we can create the backup. Before that, we have certain set of CRDs which are deployed for this project. So we have CRDs for backup location management, provider management, backups and restores. And this volume backup content and volume restore contents are mainly for the volume backup and restore purpose. So these are some, we can think like this kind of intermediate informations which are created and used during the volume backup process. And now let's move on. Sunil, am I audible, right? Yes, yes, yes. Go ahead. Okay. So let's see how a typical backup file will look like. So this is a backup file wherein we give which namespace to backup and which kind of resource to backup. So in resource, you can mention what kind of resource and even the name of the recipient specifically tell to backup a particular resource. So we have a regular expression here wherein you can select all the instances of that particular kind or it can be a specific resource. So currently we are backing up a default name space using an address location. So it is for the pod. So currently we have one pod in this one and there is no deployment in this name space, default name space. So we expect this pod to be backed up. So let's just apply this. So backup demo is created and we can see the CRD object here and the stage and state. So stage will tell which space it is in and state will tell whether it is completed, failed or whatever this case. So you can see here volume backup location as well. So currently in this demo, we are showing you only the metadata because there is no external storage interface or the storage provider integrated here. That is the reason here volume backup location is nil. So we can go through the events and check what exactly happened to this backup and how the backup flow has taken here. This is about the backup description. So now the NFS server, we can go and check whether backup has come. So we can see like our tar file is created at the NFS server side and it has backed up the pod. So this is about the backup part. So we can also go and check from check about the restore part now. So right now we don't have any restores in the system. So we'll create one. So we will during the restore, we can specify certain set of parameters which like from which backup it has to be restored and some other customizations like here, we have given namespace mapping and also we can give certain kind of a prefix to identify the resource. So that you will be able to identify the restored resources if you restore to same namespace. So in this example, we just did the mapping for the namespace. So we created the restore resource and it shows it is finished and completed. So we expect the new namespace to be created as a part of restore process. So the restore namespace is created here and we expect a new part to come up in the new namespace. So we can see a busy box coming up in a restore namespace. So like I mentioned, this is about one use case of backup. So if you want to backup a specific resource, you can give a specification like this. So where in use specify which part to be backed up and regular expression you have set it to false to pick up that particular part. And also we can give a certain set of hooks if you want. So we can support pre-hook set of pre-hooks and set of post hooks, which will be honored during the backup and restore. So this is for the backup. So that's a short demo about Kahoo. So over to Sanyal. Sanyal, I can stop sharing. Yeah. So this is what currently we are working on and we are in the process of developing and verifying some things using some providers then slowly adding more providers to our framework. So what next we have in Kahoo? So with respect to the basic support side, we plan to take up the part of cross cluster backup and restore support. And from the provider side, currently we support the pluggable part of, pluggable aspect of storage provider framework, but supporting multiple backup providers at runtime. This is something which will be taking up immediately. And also addition of more backup providers. Like I mentioned in currently, as we are not integrated any of the volume providers, we were not able to show any volume backup part, but we will be integrating CS side drivers or certain open source storage, which is whichever is feasible and integrate our system so that anybody can directly use it and have a feel of it. And from the user side, next key thing is about the usability. So give more flexibility to user with respect to controlling the backup. Yeah. Thanks Sushant for a detailed demo and the future description. So we'll quickly wind up. So as we discussed, these projects are in the initial phases and currently active development is happening. So we welcome developers of all levels of skills. Even if you're a student, please join us. You can join the Slack and say hi and show your interest to contribute. We'll take it forward from there. So it's a right time to collaborate. Basically now is the right time because you can join the container data management initiatives and also the Soda Lake initiatives if you're interested. And going forward, data management, what we see is that most of the solutions will be mutually complemented solutions and inbox and outbox kind of features coming together. And we also see that storage as a service is changing to container storage as a service because most of the deployments are moving towards the container. And cross cluster, cross cloud and cross domain will be like a just normal requirement today and in coming days. So most of the solutions when we design or develop we need to consider these aspects whether it is hybrid cloud or multi cloud or container cloud native or not we need to consider these aspects. And storage vendor come up with solutions to solve the hybrid scenarios. That is where they try to take this open source and the third party solutions to find out how they can provide end to end solutions to their users. And incidentally cloud vendors are moving to enterprise like storage vendors are moving to cloud because on premise cloud and edge the boundaries are narrowing down. So you can't restrict to one domain for any type of vendors. So this is the kind of scenario what we see in the industry and that sort of foundation we will continue our efforts to build open source solutions completely open source solutions to build data management solutions for hybrid data management challenges. Thank you so much for listening to us and if you have any questions please get in touch with us in any of these channels. Thank you so much. Have a nice day and enjoy the rest of the conference.