 All right, welcome everybody. I think we're going to go ahead and get started. I'd like to introduce myself, Brian Davis. Come from a, come from. I work at Onyx Enterprise Solutions, Vice President for Cloud Services, Cloud Delivery. Been here about seven months or so and I come from a large financial institution. I want to talk today about OpenStack and Backup. It's probably the first time you've ever had those two words in the same sentence. So the questions are backup, data assurance, data protection, do we care? This is the cloud after all. If you think about the things that we've been doing in OpenStack or Cloud Delivery, Backup was never a consideration. The community and corporate IT believes workloads in the cloud should be elastic, they should be ephemeral. The community, if you've been to any previous summits, you didn't hear things about backup. Cloud, what's going on here? Cloud workloads are ephemeral, we do not need backups. Try this one more time. A conversation that I was actually a part of when we were being first introduced to cloud, when we were being first introduced to backups, that hey, we're gonna redo all of our applications, they're all gonna be last, they're all gonna be cloud aware. There was a couple of essays sitting around the table saying, do we have any cloud aware applications? The other guy goes, I don't think so. I don't think they even knew what that was. But they said, hey, with the move to the cloud, how do we back up the data if there's no backups out there? The other guy's stupid, this is the cloud, we don't need backups, okay? So IT organizations have this question is, do we need a backup solution? No, of course not. All the apps that land in our cloud are cloud aware, they're elastic, they are resilient. How many people agree with that? How many people in here in this room have large enterprise environments? Show of hands. A few of us, mid-size? Small? Okay, so all of you that have environments, every app that you move into the cloud are cloud aware, cloud resilient, elastic. They know how to grow and train, right? If that's the case, you're doing great, you're probably the 1%. I know in my experience, when we brought our cloud up, when we brought up OpenStack, we moved in all of our legacy stuff from our legacy environments and we were still living by, hey, we don't need backups. Well, one day we had a business impact. Business partner comes to us, the customer says, hey Brian, we need to restore this data. And all of us in the room just looked at each other like, oops, okay? Cloud and virtualization in the enterprise requires data protection, period. And you notice I said data protection and not backup, okay? Most private cloud environments still host legacy applications and legacy workloads. They have no ability to take advantage of the cloud elasticity. Cloud and OpenStack lack the virtualization features that some of our legacy hosting environments had. Circling with this thing here, sorry. So the question is backups versus data assurance. Our legacy thought is backups were modeled around how to capture and store point-in-time data captures. You install an agent, backup some data. If you have a failure, you go back hour two, you can restore the data or the VM or whatever. Well, an hour or two, if you think about that, some environments may cost millions of dollars, okay? So current thought is we need to be looking at data protection. Data protection is modeled around availability and recovery. Some of the things that we're gonna talk about today is we're gonna look at image snaps, flexibility, fault tolerance, scalability, and instant recovery of a VM or an application or your data. So the question is, how do we solve the enterprise need for data protection in our OpenStack clouds? That was a question that I had in my previous roles. This is a question that I work with my customers today. Want to introduce Trilio Data, we really hear from Trilio Data and we're gonna talk about exactly how we do that. My name is Murli Balcha. I'm CTO of Trilio Data. You know, thanks everyone for showing up this one. This is the first session in the OpenStack and I'm very excited that this is one of the first one. So how is it going so far? Good? Thank you. Yeah, I think I've been coming here like since Daniel I see this growing leaps and bounds. I think this time it probably touches like 8,000, 9,000. Very good. We discussed backup and recovery on and off and you know, there are, so some of them kind of raised their hands saying that they have some production ready workloads that are running in the OpenStack. How many people dabble with the challenges of the backup and recovery in a show of hands? So how is your experience so far? Good? It's all done. So, you know, there are a few APIs that are available in NOVA, right, or Cinder. Depending on how you lay out your workloads, you may be able to put together some solution or you may put together some solution purely based on the storage-based snapshots. But at the end of the day, when it comes to real enterprise backup and recovery, putting together solution just based on NOVA APIs or the Cinder APIs are volume snapshots themselves. It's going to be challenging, right? So this is the problem that we are trying to solve here. So, when you look at the OpenStack, right, I think I'm getting used to this clicker, but please excuse me for, so when you are considering your backup and recovery for your cloud, like what is the thing that comes to your mind? Like what do you guys think, like what the backup should look like that is kind of different from what you've been doing for your bare metal or maybe what different than from what VMware you've been doing with, right? You know, I pose this question for various people, like when we go to the customer site and say, okay, they need a backup, but what does it mean? And they said, okay, I need to backup the controller, right? Obviously, that's the first impression that I get. And then suddenly when we start digging down deep, the immediate discussion that goes back to is, okay, they really need to backup tenants, right? That is the way business value is. And most of the controller database, obviously you can recreate that with your DevOps scripts. And any snapshot you take at the database level in the controller that may not reflect what's happening in the tenant space because at the moment you take the snapshot, some tenant may have created a new resource or deleted a new resource. So the snapshot that we have may not reflect what is out there, right? Even though you have a record of what is there in the tenant space, you don't have all the information, right? So the other way is, well, try out some well-known solutions out there, right? Backup and recovery is not something new, like container. But this is something that every IT has to implement based on the kind of workload that they are running. So let's start with what they have right now and then try to do a lot of scripting and try to basically put together a solution that works for your product load. But something changed, right? Something changed. Otherwise, we wouldn't be embracing OpenStack. Something radically changed compared to what we've been doing for the last 20 years in the IT. And that is forcing us to take a completely new look of how the data is protected, right? What are those? There are two things. One is, Brian talked a little bit about it. It is about highly distributed nature of the workloads, right? It's not like you want a high-performance database and then you basically buy the baddest and the beefest server out there. And then load up your database and then start backing up, right? These are scale out. That means you basically provision not one VM or one volume, you provision multiple VMs, multiple volumes. And then your data is spread across these VMs, right? So what good is to back up individual files within a VM when you don't have the complete picture? And the second one is the operating model, right? Cloud operating model. You have the multi-titanancy and you have the agility and you have the elasticity. So when I was talking to one of my IT friends, right? Whose day job is to take a backup of SAP at the Oracle that's running in there. And then for the compliance need, they have to go and test whether the backups are running at night every six months. So I asked them, how do you test it, right? So the first thing they do is we will identify a server, the standby server, that has identical configuration and then take six months old copy and then load it up and make sure that the application comes online, right? Well, the need to test the backup, need to make sure that your compliant is still there, right? But there is a better way to do this thing because you don't need to basically keep a standby server. It is cloud and it is agile and elastic. So that means you have all the resource out there that you can do a better job at meeting your compliance, right? You don't need to do that old way. So these are the two things that are predominantly like forcing us to take a completely new look at how you do data production in the cloud, right? So let's dig down deep a little bit into those two aspects. One is scale out distributor workload, right? You know, this is a typical flow, right? You have the heat template, some of the orchestration tool that has application template, essentially blueprint. So you have what VMs to provision, what flavor of VMs to provision, what network to configure, what storage volumes to add. And then once you have those things and then you have some puppets and running in each of these VMs to do the configuration management, to tweak the application configuration to basically install some packages, right? So you start with the basic provisioning and then over the period of time you basically fine-tune your deployment to make sure that the application that you're running is meeting all your operational parameters, right? Now, your application are up and running and then now after some time, obviously you need to scale out. You know, it's not going to stand there. You probably scale out in terms of users or in terms of database capacity. Now, how do you define a point in time in these distributed workloads, right? It's not a list of files. It is, when you look at a point of time real, you know, if you want to go back to some time there, right? And then you want to basically quickly do a test of whether you've got the right backup, right? For various reasons. So the data is there with the traditional backup. You can back up your files to the hot content. What about the other things that make up your application, right? It basically defines your application context. You know, what the CPU power, what the network configuration. If you have opened some codes back then, do you remember those security groups that you applied to those VMs at the time, right? So going back to point in time includes standing up the whole thing so that you can verify your backup, right? So, but if you want to use your traditional methods and then do that testing to make sure that your backups are running, what do you do? You need to basically consolidate a lot of files that you backed up over the period of time, right? And depending on how backups are set up, whether it's incremental or full backups and how frequently you take the full backup, the consolidation, you need to, you need to a little bit of consolidation of all the files. And then you need to remember, you know, what images that you booted from, some of the hints you may get from the heat template that you used. And the next one is, next one is obviously you need to provision all the volumes with this write capacity and then format those things to the right file system and then apply the necessary security groups. So you need to do all these things to make sure that you recover that point in time, right? So what is the guarantee that you get everything right, right? What if you miss even a simple detail, right? Not having a right patch in a VM or not applying the right security groups. All those things matter a lot, right? Especially when you are under pressure to basically recover your application to a point in time, right, any little things will basically affect, you know, how much time it takes to recover your application. And then how confident are you that this recovery is going to work, right? Whether this is a repeatable process, a proven repeatable process. So there is a matter of doing it manually, some of them automating those things and then basically copying the data, right? So in this kind of workload, what would be a right solution for you, right? An ideal solution would be like, you know, obviously I don't want backup policies on a file basis or on a volume basis. I want my backup policies to the entire application, right? And the entire application could be one VM, multiple volumes or multiple VMs, multiple volumes, multiple networks, no matter what it is, right? Whatever your application requires, right? Your data protection policy has to be applied there and then your backup solution need to basically keep track of what is being changed in the environment. So you always have a right point in time, right? And your backup API should not include backing up a lot of individual things, but it should provide some kind of one API, no matter what your complex, that job is, like what the complex stuff your application is, right? And the other thing is obviously, the restore is the most important thing. You can backup your hurt content, but if the restore doesn't work, what good is that, right? So your restore has to be as simple as that, right? You should be able to restore the entire workload that you backed up, a multi-VM, multi-network, multi-volume backup application to that single point in time with these, right? Whether it's through one API or with one click into the GUI, right? And obviously, when you simplify how you backup and how you recover, you have high confidence in how you can repeat the process, right? So this is one of the dominant thing that is affecting how you protect your workload. And what about the cloud infrastructure? So the traditional applications, they are central administrator, right? We have a central administrator setting up the solution, and you run some agents in the VM and they are your host and then backing up your files, right? Now, with a multi-tenancy model in the cloud, it is all about empowering your tenant. You know, as a cloud administrator, how much you can keep track what applications each tenant is running. It's not possible. It's not even cloud model, right? So you need to basically empower your tenant with the right service into the cloud so that they own their responsibility for the data production, right? So they, because they understand what applications that they're running, what the application boundaries are, right? What VMs corresponds to a particular application. So what VMs to include as part of their data production policy. So next one is agent-based file backups. Yeah, it works, but they don't really understand what the infrastructure underneath that application is, right? Unless your application is at the infrastructure level, unless it understands the VMs and the flavors and the glance and the sender types and the networking, whether it has a private network between the VMs and then a public interface to the application. So all these things, whether you had to maintain somehow or your backup solution need to basically understand, record those, that information so you can orchestrate the whole recovery in a single step. What about the scale? So cloud is all about horizontal scale, right? You may start with 100 VMs and then grow into 10,000 VMs or you may start with the 10 nodes and then you may go up to like 100, 200 nodes, right? And also, what is, how do you make sure that your backup solution scales with your cloud, right? So most of the solutions out there, those are like fixed capacity. They come up with saying 100 terabyte DDova plans, right? So what if you are growing beyond that and you want to grow it? Now you can buy one more appliance and then now you have two appliance that you want to manage, right? So it's not really, it can scale. So your solution need to scale with your tenants, whether you grow from 10 tenants to 20 tenants or compute everything that goes as part of the cloud management, right? And then DevOps, right? Very, very critical part, right? So I just put that joke there, like you must be kidding, but that is true, right? Like DevOps is a relatively new phenomenon, but most of the applications that have been developed, but like it's been there for 20 years. And obviously DevOps paradigm doesn't really fit into what the traditional applications, traditional solutions out there. Your backup solution need to be part of your cloud, right? Just like you are automating your compute and the volume of the stories of the network, you should be able to deploy, upgrade and automate your backup solution just with the DevOps tools. You don't need to think twice or have a different management plan for managing your backup and recovery. So with those two goals in mind, back in 2003, we proposed a specification called Raksha. That, we did that and then realized, oops, I think we are too early for OpenStack. When we are basically talking to a lot of people, I think the reality is that it was very, very early in OpenStack, not many people really deployed OpenStack and then backup and recovery is not really at the top of their mind, right? So we pulled back a little bit on that and then we commercialized our product based on the specification we put together at OpenStack. So the company founded like two years back, two and a half years back. We have a pretty good background on the virtualization, backup and disaster recovery and the cloud infrastructure. My CEO has involved with numerous IPOs and acquisitions. So that's a brief discussion about what our company is but we'll talk about what our solution is, right? So just like any other service that you use in OpenStack, the compute service or network or image service, so ours is backup and recovery as a service, right? It has the same look and feel of any other service. It has the Python-based wrapper APIs. It has a RESTful API to define your backup jobs and then manage the backup jobs. And the form factor for us is it's a QCUT image that we ship, it's open to based image. We have all the IP there. So it includes the scalable backup engine. You can, it is scalable in the sense like obviously each VM comes with a finite capacity of backup but if you are growing your cloud, you can instantiate multiple VMs of that same image and then scale your solution. It is truly multi-tenant. We register into the Keystone as a backup as service and then we authorize the tokens and then we perform the work on behalf of each tenant. And deploying our solution is pretty much a non-districtive. So whether you already have a running cloud or whether you are starting from the ground, it's a drop-in solution. I'll briefly talk about our architecture and what the components are and none of the components are disruptive. So this is what I've been preaching, right? Like in the last two slides or four slides. Backup, it has to be a BIA and file backup or volume backup. It has to be at the environmental level, right? Better at like if it understands your application and then if it basically discurs what the resources that your application is using and then pull those things as part of the backup policies and then backup those things, that's an ideal solution. So we could do that for well-known applications like Cassandra or MongoDB where we can discover what VMs are there that those applications are running. But even otherwise, you can group all the related VMs together as one backup job and then we discover what resources are mapped to each VM. For example, what networks are mapped and what sender volumes are mapped. And we take the backup of the entire snapshot. We call it environmental snapshot. So that includes everything about your application, right? So we support incremental and the full backups. Initially, it's a full backup. We backup all the VM images and the sender volumes and all the network configurations. So in the subsequent backups, we basically calculated the difference what changed between the environment. So if we have a new sender volume added to that logical environment, so we backup that as a full and then the rest of the existing resources as an incremental backup. So we take care of what changes, what changes for the environment and backup those changes. So our vision is to basically leverage what is out there in the cloud, right? So leverage the cloud capabilities like elasticity and the agility. And we want to redefine how the backup data production is done in the cloud, right? Obviously, we need to support this bare minimal options like backing up a VM and then recurring a VM or backing and recurring single files or multiple files within the backup job. But since we are at the infrastructure, so we understand how your zones are laid out, how your regions are laid out. So we support additional use cases like for example, restoring it to a new available zone. So you may have a production running in one availability zone. And you may set up some test out on a different availability zone and now you want to basically test your production by taking a copy of your production and then restore it to a new availability zone for test purposes. So we take care of whether you have a different kind of network or whether you have a different kind of volume setup there. And we translate the backup image, the metadata into what is available on the new availability zone and then we restore the entire application onto the new availability zone. So what about disaster query? So what if you have two different clouds and you want the capability to keep the backups but you want to replicate to the offset? But when needed, you may want to leverage the remote cloud and then restore your applications up there, right? So we support that use case too. The third, the fourth use case, instant restore. How many people used guest fish tool? I think many of them who is familiar. So it's a nice tool, right? No matter what your VM and what volumes are created. So you can basically use the guest fish to explore more into what the composition of the VM is, whether it has LVM volumes, whether it has different file systems. So you can explore all those things and you can also tweak or fix any configuration changes that happened using the guest fish. It's a very popular tool. What if your data production kind of provide that kind of tool, right? You have, let's say 10 terabyte backup up there, right? And you want to run something. You want to basically re-spin that as a workload quickly. So what if you can log in and basically explore little things, right? So with the instant restore, which we are working on, essentially you don't need to copy the data back to your production. You don't need to restore the entire thing. You can quickly re-spin all the VMs out of that backup image and then explore it. So for example, if you have data encrypted in the VM with the regular tools, you can't just open it up, right? So that way you can spin up that VM and then you can explore more into your pointing time copy of it and then enabling migration use cases. So it goes beyond just doing a file backup. It is more about taking logical snapshot and then playing with it, right? So our architecture, I briefly mentioned like we have a QCout image which can be deployed on a standalone KVM box or into the glance and then recreate it under one particular tenant. But the most popular deployment is basically have a standalone KVM and then have this VM created out of this image. You can spin as many VMs as you want based on your size of your cluster and then we have just like Nova, the sender scheduler that basically does some kind of round robin to choose the right host. We have round robin based scheduler right now to basically load balance this backup just among this multiple VMs. And since our VMs are like completely stateless, you can crash and bone, but you can spin up a new VM and then your backup service continues. Now, so I think if you are played with some of the APS that are available in the OpenStack, so depending on what the configuration of the VM, whether you are booting up the local disk, whether you are booting up the CIF or whether you are booting up the sender volume, the APIs are snapshot operations work a little differently. So to put together a complete solution with those APS are very, very difficult. There are some gaps in the way of the OpenStack APS are defined when it comes to backup and recovery. So in order to basically fix those gaps, we defined a Nova extension. It is a small Python module that need to be installed on your compute node. And the basic functionality of the Nova extension is take a backup of the VM that is running on that compute node and then basically create a backup image and then copy it to the backup media. In this case, NFS are an object store. And during the restore operation, it does the same thing in the reverse and takes that copy from the backup media and then restore the contents to the compute node. So each Nova extension is responsible for managing the backups of the instances that are running on that compute node. So the scale is important when you are deploying a solution in the OpenStack. Obviously, you don't want to introduce any bottlenecks, but your architecture is also, should enable you to basically scale your application. So in our case, if you are good with one compute node, backing up the instances running on each compute node, you can essentially scale your solution without introducing a lot of bottlenecks because the backup engine doesn't do much of IO except writing the metadata to the backup engine, but most of the data transfer is happening at the extension level. So you can scale this application relatively easily compared to other solutions out there. Now, so we call our backup job a, sorry, backup job a workload. Workload is nothing but the collection of VMs. So, and we have a backup engine that basically identifies all the resources of those VMs and then invokes various API calls to backup those things. So for example, the backup job one, it has VM one, VM four and VM nine. VM nine, so when it is orchestrating, it basically invokes the right hooks into each of this compute and that's how you basically orchestrate the entire backup job for that group of VMs. Okay, so the other important thing is we don't use any proprietary format for storing our backup. We stand as everything on the QCOW team ages, right? That is a very popular and very standard way of creating backup, not only the VM images but also the disk images. And there are thousands of tools out there that basically helps you manage your QCOW team ages. So our base image is a QCOW team and then all the incrementals are QCOW team ages. And on the backup image we, those are well formed QCOW team ages. So for example, the latest QCOW team age, it has a back reference to the previous incremental previous age. So you can always run Kimu image info on the latest backup and basically walk through the whole chain and you can get fancy. You can also mount a QCOW team age to look at the content. So that, so we stand is on the QCOW team. The other one is, once you take a full backup, you never have to take a full backup again because we synthesize the full backups at the back and it's as simple as using the Kimu image commit block commit command to basically club two backup images into one, right? So that's how we move your retention window. If you say like 30 day retention window that you want to keep all the backup copies. On 31st day, we basically combine the last two backup images into one, right? That's how we do the retention policy. And our restores are also pretty easy because we don't have to aggregate all the incrementals and the full backup into one in a staging area and then restoring it. We can always take any point in time within that chain and then say copy that into a volume or copy that into a VM image. So our restore image are also very, restore operation is also very efficient. And lastly, like we also put something called instant mount snap. So since these are QCOW team ages, we can mount them as a device and then discover the file systems in it and then expose the files in that particular point in time. Right, so we do support NFS and Swift as backup medias. Support NFS, the ICE because they are the self-based to sender volumes. We are actually working on integrating with other third party storage arrays. So it's a single backup and recovery and we have a horizon plugin that basically helps tenant to manage their own backup jobs. We have the RESTful API and then the CLR apper. We also have Ansible Playbooks to deploy and manage your backup solution. And it is a drop-in solution, so it's non-distruptive. So you can onboard the solution on a running open stack. The only thing we do is basically since we register an extension to the NOVA, we need to restart the NOVA APS service but usually people have two or three instance of NOVA APS service rebooting one after another doesn't cause much disruption. So quickly like the screenshot, just to give a flavor of what our solution looks like. So once you basically boot the QCOW team, how am I doing with the time? Is it okay? Okay, excellent. So give the admin credentials, so we register into the Keystone as a backup and recovery endpoint. Once it basically goes through configuring our solution and once you have a solution and then you install the plugin, the horizon plugin, we introduce the tab called backups and then we have the workload tabs and you can create a new backup job using as a tenant, create a new workload and then add the VMs and set some policies and how frequently you want to take a backup and it does the whole thing. So once you have a backup, so they have details of what the backup job contains and then it has a list of backups it has taken and then it has a detail of how frequently the backup is being performed. Once you dig down into a particular snapshot, it also gives you more information about what is backed up as part of the backup job. In this case, it's a four VM backup. Just to keep it interesting, I have various flavors of the VMs that are created, some of them booting up the surf and some of them booting up the local disk, some of them have volumes mounted, some of them have multi-network interfaces that are attached. So we capture everything, right? And if you want to basically, since we have hundreds of, let's say tens of backup images, so what if you want to retrieve only few files within a particular point in time, right? You don't have to go and restore the entire backup job. So we support a mount snapshot. Once you choose the mount snapshot, we provide a explorer view into that particular point in time and you can download file or bunch of files for file level recovery. And then if you want to basically want to restore it to your new availability zone, right? We have something called selective restore, where you have more control about what that backup job is and how you want to restore it, whether you want to restore it to the same availability zone or into a different availability zone or if you want to map it to your new network so it doesn't interfere with your production, you can do that. Or you can choose a different volume type than the one we backed up from, you can choose that. And then if you want to exclude some of the VMs from your restore, you can choose it or you want to change the flavor of a VM for restoring that you can change. So we give more control about how you can restore your backup job from backup job. So in summary, we went through data production challenge in the OpenStack and we are introducing TridioVault as a backup, as a data production as a service, more than a backup. We do take environmental snapshots, not just a file or a volume. We have one click backup and one click restore support. It's all completely tenant-driven. It is forever scale. So just like any other service that you are doing, it is one more service that you need to basically plan and deploy. Okay, any questions? So there is one more demo presentation that's happening, that's going to happen on Wednesday, my colleague, Giri, is going to drive that presentation. But we have a booth, 8.22, so stop by if you need more information. Okay, any questions? If you have a question, can you please go to the mic? There. You specifically mentioned ice-cassey-based center volumes. Any reason why? Because typically, a center volume, if you have deployed as the backend, you're going to go through the center driver or the driver plugin of that particular block storage in the backend. And if they have an FC, for example, as the preferred or as their supported data plane, so how does that change what you have to offer on top? It should be pretty transparent to whether you're coming through ice-cassey or FC. Your slide specifically mentioned ice-cassey, so I was wondering why. Right, right, so now I think you dropped something. So we don't, I just mentioned for the ice, we, there are a few things that we do better by tightly integrating with, for example, CIF, right? CIF has a very good API how to calculate the diff between two snapshots and we create a Q-counter. But the default implementation is compare two snapshots block by block and then calculate the diff, right? We really want to integrate tightly with storage vendors and then try to basically calculate the diff through the API. So your question about, because we leverage how the same mechanism that the NOVA uses for axing the volume, so whether it's a Q-counter, whether it's a NetApp, NFS-based volume or ice-cassey-based volume or FC-based volume. We use the same connection string that LibVort uses to basically read the contents. So the protocol doesn't matter for us, so it is more about axing those two snapshots and calculating the difference. So can you go into a bit of detail around how you provide disaster recovery? In other words, replicating the backup images site to site and then recovering at the secondary site. Absolutely, so it's all going back to like taking the logical environmental snapshot, right? So we have the context. We basically capture the context. So that means we capture what that VM flavor is, what glance image is booted from, what network types are, all those things. So for every time we take a backup, we capture all that information. So if you replicate that entire backup image to the remote site, we not only get what the makeup of that application is, but also all the snapshots, individual snapshots. So our CLI supports something called import functionality. Essentially, it imports the backup, all the metadata. And then once you import that, through the wizard, you can run that selectory store. And then basically, that is when we discover what is out there on your new cloud, right? Whether it's the networks or whether it's different kind of storage type. So once you choose how you want to map your backup resources to new resource types, we go and restore the data. Right. We don't do replication, but we expect the underlying storage does the replication. Any other questions? Okay. Thank you very much. Thank you. Thank you.