 Good afternoon. I'm Venkat Jugana, an IBM Distinguished Engineer. Today I'm going to talk about the on-demand disaster recovery, service enablement through software-defined environments under hybrid clouds. A couple of my colleagues, they couldn't join me, so I'm going to take you through this presentation. So, for a quick recap, so this is about adopting software-defined environments with OpenStack in establishing the DR environments from the traditional IT setup within the cloud environment, how we can automate some of that, and how some of the software-defined core components, network, compute, storage, how they can be leveraged, and what is the kind of experience or learnings that we have from this particular experimentation. So, I'm going to share with you that. In terms of the agenda, I split into four different components. So, I'm going to lay out what the underlying problem scope is that I'm going to talk about, and then why software-defined environments, how it can be adopted to address this DR problem, and together with the OpenStack, how exactly it can solve that problem. And finally, I'm going to take you through the proof of concept that we have done with the demo. If you look at from a disaster recovery perspective today, within the IT, it has become so integral to the businesses in general, right? Any disruptions to IT will have serious business consequences, and especially when the disaster happens, it's much more impactful. So, in order to address that, we need to have some kind of a recovery mechanisms in such a way that the business applications will continue to run, even under those situations. So, there are two key parameters that are very important from the disaster recovery perspective. One is the RTO, the other one is RPO. The recovery time objective is what we are going to address in this experimentation or through the software-defined environments that we have done. And as you can see at the diagram, it is always critical by having a lower RTO, but it is not that easy to accomplish by having a lower RTO because it always takes a certain time to bring back the infrastructure in a secondary site, and also you need to bring up the workload and the application. And of course, the data also need to be replicated from wherever the data is backed up. So, by having a lower RTO, it will always be helpful to get the application up and running. So, that's where the focus is going to be. So, in terms of the events that can happen, typically we see that things like flooding are major disasters that will happen at which point the IT will get impacted. But in addition to those major things, there are 100 different other events that can impact the IT infrastructure or data center environment. So, typically we don't hear from them, but those can happen too. So, in terms of the problem statement, so I would like to take out a scenario where when the customers, typically they do not have the dedicated DR infrastructure. There are a lot of enterprise customers, small and medium enterprise, and even some large enterprise customers, they may not have the dedicated infrastructure, but they rely upon third party service providers to help them with the DR establishment. So, the service provider will provide the DR setup for them and then do the setup and basically get the application, the environment running for them. So, in this particular case, the data needs to be replicated and restored from the backup media, whatever the backup media may be, whether it could be tapes or it could be watchable tape library, and so it needs to be restored and the environment need to be stood up. But bottom line, how fast you can really bring up this environment to get the application up and running so that the business can continue to run, which is very, very important. And further, if you look at it from the DR service provider standpoint, there are few challenges that the provider needs to address. So, the typical challenges that they face is, so most of the process in order to stand up the environment that replicates the customer environment for the workloads that customer chooses. Obviously, it is a manual process that a lot of service providers go through, even though there are some level of automation, but still it takes quite a bit of time and which is something that really needs to be addressed. So, in the case of a DR service provider, the customer, if they are using that service, they wanted to make sure that the service provider have the capability to support and the environment they can replicate in the case of real disaster event that happens. So, they do the simulations. So, during the case of simulation, you need to have the same environment stood up and you go through the same process when the real disaster event occurs. So, you need to do that. So, the DR service provider typically takes a lot of time in order to do that. So, that is a big challenge for them. Sometimes it takes more than a day or a couple of days in order to get that workload up and running. So, this is the real challenge. So, the RTO, as mentioned, really need to be improved. So, what are the typical use cases where that situation happens? So, there are a few use case scenarios that we have identified. So, the first use case scenario is where we believe that when the customer do not have any dedicated DR infrastructure. So, that means when the disaster event happens, the service provider need to really stand up the environment and get the thing up and running. But, in order to do that, he needs to have the service provider needs to have the images of the workload that the customer is running and also the workload images themselves and the data itself. So, all of that needed to be brought into the DR side. So, this particular use case is where the customer data can be restored from the backup media. But, however, the custom images that on which the workload is running that may not be available at the DR side. So, for that, that need to be restored meaning that you take up the standard set of images and then build, apply the config of the customer specific workload and then create, build up that image and then get the workload up and running. So, that is one use case scenario. The one B is where same customer do not have that dedicated DR infrastructure where the images, custom images and the data is available on the backup media so that the same environment can be replicated by the provider. The second use case scenario is which is kind of common in the case where they need to have better RTOs where the data is replicated from the primary site to the DR site where the service provider is running. So, the DR data is replicated independently from the customer primary site to this DR site and the images will be available at the DR site in the same as in scenario one. Again, either we can use the standard set of images and apply the configuration of the customer workload or the custom images can be shipped to the DR provider. So, that is the second use case scenario. The third use case scenario is an active-active where you have the data replication and as well as the workload application is running at the same time as the primary site. So, these are the three use cases that we have looked at and we have built a proof of concept for the use case scenario one where we have demonstrated how the software-defined environment could be leveraged. So, what exactly is the software-defined environment? And today if you look at the businesses when they request for an IT service from the IT shop, they take a lot of time. So, the response to the business demand is very, very slow because most of the infrastructure provisioning and the workload provisioning is going to take a lot of time. Sometimes, depending upon the workload complexity and the request of the environment it may take several days to several weeks. So, we need to really look at a different paradigm. So, the new paradigm shift is, you know, like where the infrastructure need to be programmed and if the infrastructure can be programmed then it will become much easier and faster to get the resources provisioned and also the workload can be deployed much quicker and faster. So, we really need to go to the software-defined environments where through software, you know, you can program the infrastructure, whether it is compute, storage or network. And on top of that, you get the, in addition to the infrastructure resources, whatever the workload that you pick. So, there are software components that you need to get those deployed as well. So, within the software-defined environment we look at the entire stack from the infrastructure all the way up to the workload components. So, if you look at it from the top-down perspective, from a workload perspective, right, so you take any workload, web workload or analytics, anything. And from there, you define how you peel out the different workload components and from there you basically define certain policy and the resources that are required for each of the workload components. And from there, you try and orchestrate the provisioning of the underlying resources and as well as the components. And from the compute, storage and network perspective, the underlying the software-defined nature of the control element is going to optimize the resources based on the workload that changes over the period of time during the study state. So, if I expand this further, from the overall workload perspective, not only we define the overall workload components, each of the workload, like for example, web workload, if you take, there is a HTTP component, a web application component, a database component. So, if we split all of those components for each of the component, it needs certain set of resources, underlying resources from the compute, storage and network perspective. So, we really need to define for each of those components and then further, you basically give it out to the underlying modules, which is the software-defined compute and the storage and networking and then underneath it will go and then manage the respective resources. So, typically from the traditional viewpoint, it is always from the bottom up, right? You look at it from the resource virtualization standpoint, you try and do that particular setup and then each of these tiers or the silos that we look at it, they have to do it completely in independent nature. Whereas with the software-defined viewpoint, we look at it from the top down and then how all these components in an integrated fashion, how they can be supported. So, we look at it from an integrated perspective and then get those resources provisioned. So, what we have really done in this particular analysis is that from the definition, workload definition, orchestration and the provisioning, we really felt that the open stack is the key component where we can get the flexibility to operate and program the underlying infrastructure resource components. So, we use the open stack as the core foundation to build this particular control and management stack and then that will help to establish whatever the resources that are required for each of the workload components. So, if I go and then look at it from a DR perspective, you have the traditional IT shop where you have the workload running at their primary site, but if we need to bring that from a hybrid perspective into a cloud-based environment, we really need to drive through some kind of a control and management plane and where on the left side that you see here, we have the control and management plane that I talked about from the orchestration and management and all of the underlings, the software-defined compute elements which needed to drive the workload on the target infrastructure. So, in this particular environment, we have the common management and control plane and that can support multiple workloads for a single customer or it can be supported across many different customers in a multi-tenant environment. So, as a service provider, any service provider, they need to be able to operate and support these multiple customers completely in an isolated manner so that they don't conflict with each other. So, what we have really done is that in order to simplify and do this automation because you wanted to really improve the underlying RTO. Do that, we need to move away from the manual labor-driven process to more of an automated process. So, we took the heat template as an example to drive the entire workload deployment. So, that includes not only the underlying resource provisioning but as well as the software component deployment as well. So, the process that we had taken is basically identify whatever the set of workloads that the customer is wanted for the DR to be enabled because not all the workloads that they are running at their primary site need to be enabled for DR. Only some business-critical set of workloads. And then, basically, look at on the primary site where these workloads are running, we need to extract the configuration or the metadata of that particular workload because you need to be able to replicate at the DR site because that is very critical. So, we need to get that metadata and once you have that metadata, you can basically export that metadata into the DR site. There are many different ways that you can do that. And then, based on that metadata, use that metadata to create the hot template. So, we have used certain tools to help create that hot template. So, the way that we have to do is basically look at each and every component of the workload, whatever the workload, suppose if it is a web workload, that's the example that I have here. So, we have each of the tiers in that particular enterprise workload, multi-tier workload. So, for each tier that we looked at, what are the underlying voice images that we are running and what sort of configuration that it is running. So, basically extract that configuration and then use that configuration to build the heat template. And then, you need to apply that configuration as well to the target environment, right? So, because it need to be applied not only for the voice images but also the software components. So, the use case scenarios that I talked about, the 1A where the customer data is in backup media but the custom images are not available at the DR site. So, meaning that we need to build that custom image. So, the way that we do is basically use the standard set of images, whether it is REL or Windows or AX in the case of Power. So, we take that particular image and then apply the config that we have discovered on the source site or the primary site and then create that custom image. And the same thing that would be applicable for the software components as well. So, you basically go and then do the provisioning of the underlying resources for the compute, VM and the storage and the network and then apply this particular specific configuration that we have discovered and then go and then deploy the software component. So, this can be done in two different ways. So, where the customer really wanted to simulate, as I was mentioning in the scenario 1A, where the chosen workload need to be simulated at the DR site. So, typically there is a handshake that happens out of band. So, the customer informs the provider that they wanted to test the specific workload and then wanted to simulate that particular environment. So, the provider will have to stand up the environment based on this heat template. So, these resources are allocated on a temporary basis and so they test out for whatever amount of time that the customer really wanted to test it out and then automatically the resources will be will be tear down and the infrastructure will be torn down and the test will be fully done. In the case of the actual DR event, the customer will inform the provider and then tell the provider that they had a DR event so they need to do this particular thing. So, the same process is pretty much being followed here. So, in this particular case, the only difference in this particular case is that the customer data which is backed up on some kind of backup media, whether it is tape library or tapes or virtual tape library, they will have to be shipped to the provider because the data has to be restored. So, the way that we do, basically we restore that data into the software defined storage environment and then whatever the volumes where the data is restored, we get that into the software defined environment that we are going to provision. So, the way that we are going to do that is through the heat template and so the volumes that we identify from where the actual data is located and then we put that in the heat template. So, that is one difference that we have with respect to the initial templates that we have created for the customer workload because after we get that workload, we get that template so we apply this additional information into the heat template because that is very important to get that running. So, that the customer workload will have that data. So, in the steady-state operation, also based on the workload demands, we may have to adjust and optimize the underlying infrastructure resources as well and for that we have some level of monitoring that need to be done and then you can do the horizontal scaling or the vertical scaling in each of the tiers that the workload is running. So, for that, you need to be able to dynamically schedule the underlying resources, either add resources or delete resources appropriately. So, the use case scenario 1B is where we already have, in the case of a real DR event, you know, like we have the actual customer data and as well as the images that are restored in the backup media. So, from the backup media, so we get those images and so we use those specific images to bring the virtual machines back in the DR environment. So, the proof of concept that we have here, so the way that we did is basically leveraging IBM's cloud orchestrator and then we have set up the multiple tiers for the web workload. So, you have a HTTP tier and then application tier and the database tier and basically here we have leveraged IBM's WebSphere application server liberty profile and also the DB2. So, in the heat template that we have here, basically the way that we have generated this heat template because in order to set up these multiple tiers, the way that we have set up is through separate network overlay. So, we have used the network virtualization engine. In this particular case NSX we have used because we have the underlying VMware environment for the virtualization and so we have set up these network overlays and based on these network overlay IDs, we have set up each of these tiers on the network specific network segments. And as far as the data is concerned, as I mentioned, whatever the data that is restored in the software defined storage through the center volumes, we basically replicate and attach those volumes in the heat template. So, in this particular case, we used the ISCSI type of center volumes and we also used the network function virtualization primarily some of the routing for the east-west traffic and as well as the north-south traffic because we need to be able to communicate across these different tiers and also the external communication. So, that's all these capabilities we have leveraged and the appropriately the hot template is updated and then by getting that hot template within the cloud orchestrator and executing that hot template, you should be able to drive the entire resource provisioning and as well as the workload component deployment as well. So, it includes both. So, this gives you an idea of how we have set up each of the tiers and each of the tiers work in a separate network segment and then on the right side, what you see is the cloud orchestrator where we have the entire control and management plan running and from there we go and then execute the heat template and that would drive the setup and the workload deployment. So, here is the sample template that we have used. Sorry, I think I may have to take this out. So, here is the template. So, here we have tried to attach the volumes. Here the center volumes from where the data is restored to. So, here we set up the actual DB2 server and we have used the user data construct to get the environment setup and the workload component to get executed. So, this is all the environment setup that we have here and then we set up the respective repose, whatever the repose that we need to have in the heat template that we have set up those and then we have also created these Chef recipes because we need to be able to deploy the software component of each of them, right? The DB2 component, the web sphere application server and the Debate profile. So, for that we use the Chef recipe. In order to automate this, we really need that. So, we need to have whatever the VM that we are provisioning, we need to have the Chef client running, so we get the Chef client and then we set up the access to the Chef server here which we have set up locally and the key to access the Chef server and then the file systems that we have tried to replicate based on the workload that is being run on the primary site. So, here we have set up the volume groups and basically set up all the file systems that are required for that data and then we go and then install the DB2 component. So, that installation happens through the Chef recipe for the DB2. Once that is done, we go and then get the web sphere liberty profile executed and it goes through the similar process pretty much. So, the same Chef recipe that we have for the WLP, which is set up in the similar fashion as DB2 and that will be deployed as well. Once we have the WLP server up and running, then what we do is basically we can get the application up and running. So, the way that we do this is basically taking it through the demo here. Sorry about that. So, here what we have is a IBM Cloud Orchestrator and we use the cloud deployment basically with the stacks that we have here. So, the stack that we have here is the overall workload, the web tier workload that we have here. So, that has the components basically for the WLP profile and as well as the DB2 component. So, here we set up and get the heat template that we have generated for this particular workload. We get that template here and then execute this heat template. So, you need to specify the stack name here. So, once you have that, then what it does is it goes and then provision the resources for each of the underlying components. So, if you look at that, the history, it goes and then looks at the process for the DB2 server which is what we have launched to begin with. So, that is in process right now. And here, as you can see, the network overlay 192.168.102 is where we have the WLP server, the DB2 server that is up and that is being provisioned right now with the corresponding IP address being assigned here. So, once you have the DB2 server up and running, so now we are able to ping that particular VM. Once the DB2 server is up and running, because we have the WLP server has a dependency on the DB2, so we have the now the WLP server is up and running which is on the 101 subnet. So, right now that is currently being built. So, now 102 is basically up and running and there we are going to deploy the software component as well for each of the tiers, the WLP and DB2. So, that's the process here that we are checking the config to check to make sure that it is fully deployed. So, that you see here that it is in the process of the deployment of the actual software component and you can see now both the DB2 and as well as the Liberty profile are up and running. So, the entire process is now pretty much complete. So, once you have each of the tiers is up and running, now you have the application, you know you need to run on the application server. So, the next step is basically it goes and then deploys the actual application on the application server and then you can go and then log into the application and do whatever that need to be done. So, effectively this whole process of provisioning the underlying resources and as well as deploying the software components for the entire you know workload effectively took 12 minutes for us. So, which is really impressive in terms of provisioning the underlying resources with compute storage network and then deploying the software components. All this process you know it took just 12 minutes which is really amazing. So, this is where you know the heat templates have tremendously helped. So, the same concept can be applied with any other workload that you have in the similar process that you have to split those components and create the definitions, the workload definitions for each of the software components. So, this is where we are really showing that the overall process is pretty much complete for the entire set of workload you know components. So, in the interest of time what I'll do is basically go to the next slide. In terms of the learnings that we have gained from this particular exercise is that we were able to achieve significant improvements in the RTO by doing complete automation of the resource provisioning plus the software you know deployment. So, overall we have done some calculations based on this, based on the outcome of this particular experimentation and we found that for about 25 x86 VMs if we have to do the setup and then deploy the workload typically that takes just the infrastructure setup. It takes about 8 hours versus about 1 hour by leveraging SDE. So, that is a significant gain and improvement from an RTO perspective and especially as you scale up the number of servers then it becomes much better in terms of the efficiency and the same thing that we have looked at for the power as well for the power platforms. And not only the improvements in the time but also we have noticed that there is a significant reduction in the labor cost. So, it takes several resources to do this over 8 hours because as I mentioned it is being done in silos for the compute storage network and then once that infrastructure is set up you have to go and then you know deploy the software components that obviously takes much more time and a lot more resources to be spent in order to bring this up. So, as part of the next steps we have several things that we could follow and to make things happen. So, this is for the future future summits that we would like to come up with these additional enhancements. So, if you have any questions please let me know. Thank you, thank you for coming and listening.