 Today's talk is about reliable open stack. Me, I am Chaitanya and Ajay is supposed to come and he will join anytime. Sorry about that. So, basically reliable open stack, designing open stack for high availability. And how many of you are actually interested in high availability aspect of open stack? Wow, that's nice. And what kind of models are you using? We'll go over it, it's hard to answer. So, basically outline of the talk is as follows. I'll talk about high availability. And you know, we'll see what is high availability and what kinds of high availability options are there for open stack right now. And we'll actually divide high availability into three sections. One is platform high availability, that means your cloud is always available. That's platform level high availability. The second one that we are going to talk about is VM level high availability. That means is your virtual machine always up, is it running? Virtual machine by is basically your workload. And third section is basically about application level high availability. Hello, Ajay. Ajay, when I say application level high availability basically, your cloud is up, your VM is up, but is your application that is running inside your VM, your web server or whatever web app, is that running healthy? That is the third level of high availability. So, we'll go over approaches to all these three problems of high availability. Ajay will start with standard and I will take over later. Okay, sounds good. So, before going into the high availability, let me just do some basic definitions that we are going to be using for the talk and let me just do a distinction between high availability and failover. What we mean by high availability is a system which is always on and it can tolerate failures and the system can heal itself. So, when a failure happens, the system would heal itself so that it can actually tolerate another failure if that was to happen. And that's just between few seconds in the failure that the system should take to auto heal. What I mean by failover is essentially you can tolerate the fault temporarily. For example, if you run a service in two nodes, you can tolerate the failure of one node, but now the second node, if that was to fail, the service would go down completely. And in this case, somebody has to come in and again heal the system to make it running on two nodes again. So, what we draw a distinction is if you are not doing auto healing, that's more like failover rather than continuous and dynamic high availability. Also, for the purposes of the talk, let me just state a high level goal, what we are really trying to achieve here if we would just step back from availability for a second. At the end of the day, enterprises really want a web scale private cloud that they can run which is always available and they should be able to start small and then grow based on the demand. So, when you think of a cloud, you should not always have to start with a very large cloud. You can start somewhere small and then add more nodes and infrastructure as your need actually grows. So, let me first start with some of the standard approaches before we go into what we are doing at ZeroStack and how we are improving the reliability. So, in case of standard approach, this is actually something that most of the people here are probably familiar with and we are starting with stateless services. In case of a stateless service, you can start with multiple controller nodes. So, in this case, there are two controller nodes and we are running NOVA as an example as a service. And in order to now provide higher availability, you would run a HAProxy which would have the virtual IP for the service and the request would go to either of the NOVA instances. And you can have separate compute and storage nodes which may or may not be converged depending on which solution you are using. But if you look at the standard architecture, there are a couple of problems that you have to deal with or worry about. First problem is that in the beginning you are starting out with a set of controller nodes. You can start out with a one controller node or three or five, whatever number you choose but as your rest of the infrastructure scales, the controllers are not scaling automatically. You would have to worry about scaling the controllers as and when the infrastructure is scaling if you want to really run more instances of a service. Second problem is, in this case, one of the obvious things is HAProxy itself is a single point of failure. So, now you have to worry about high availability of HAProxy itself and deploy some techniques for that. And the second problem is that now if you do it for every service, you are doing it for NOVA, you are doing it for Cinder, Glance, now for every service you are adding an extra HAP. So, if there is a request that comes in and the request requires multiple services to talk to each other to actually complete it, you are adding an extra HAP for each one of these. Third problem is that if really one of the node fails in this system. So, for example, in this case, the one of the NOVA node fails. Now you are back to this failover scenario. There is other node which is running, which is fine for the time being, but if the second node was to go down, you really have no service running and at this point you need some sort of intervention to actually set up another node and make sure it's running a instance of NOVA and you are stitching it back to the base HAProxy. So, the fact that it requires manual intervention, that becomes another problem in a scale-out system where you want the system to run hundreds to thousands of servers, but you don't want people to come in when something fails because the larger the infrastructure, the higher is the probability that something would fail at some point of time. So, so far we looked at the stateless services. If you look at the stateful service reliability in terms of standard approach, here in specific to OpenStack, we are looking at either MySQL or AMQP, which is RabbitMQ or any other AMQP service. Again, you would set up some sort of replication because here you care not only about the service-level replication but the data-level reliability. And one of the standard things is to set up replication between data, between two different nodes, for example, there is a DRBD-based solution or you can even use a shared storage which could be an external SAN or NAS-based shared storage itself. And again, the service can be active-active or active-passive depending on which solution you are deploying. Now, if you go with this solution, again there is a problem. One, there are these special nodes needed. It's like having personalities to different nodes. You have nodes which are your compute nodes, you have nodes which are your storage nodes, you have nodes which are running your controller or which are doing MySQL replication. And these special nodes are fine if you have a small-scale environment. They are fine if you are doing this for some test and dev only. But if you are really running something in production, you can't be dealing with these special nodes in the system. And in case something was to fail, again manual intervention is needed. The service would run on the second node in case of failure but now you are back to that fail-over scenario where you have failed over but you have to fix the system by yourself. And again some admin has to come in and set up the replication with another node and make sure the system can tolerate another fault. Similar problem exists if you go with the shared storage model. There is one extra complexity that you have to worry about the external shared storage itself which in some cases could be expensive and another silo to maintain. And in case of a problem, again you have to worry about some sort of manual intervention. So just to give a summary of the standard approach and this is something I think many people have used in the past and for some use cases it may be okay but for a good scale-out solution they don't really work that well. And in most cases in case of an error some sort of human intervention is needed and there are well-known results that say almost 40% of errors happen when human beings go into a data center trying to fix some other problem and they'll create something else or mistype something and you have to deal with these kind of errors. And the more special nodes you have in a large scale system the more you have to worry about it because now some node failing has a different semantics than some other node failing. So we looked at all this and we thought about what can be a better scalable approach to high availability. Where you don't have to deal with all these manual steps. And I would let Chedanya go through some of the solutions that we are building to help with higher availability. See everybody is trying to build private cloud. And OpenStack is the solution for private cloud. Now compare private cloud with public cloud. In public cloud if you are, I don't know, Amazon might be using MySQL to manage their platform. And if it goes down, I don't really care. My workload is still running. That's what we want to get to with OpenStack. So what we built for high availability is basically a distributed control plane. What is a distributed control plane? It's a small distributed service that runs on all nodes in your cluster. All nodes in your private cloud. And this distributed service manages OpenStack services. It takes care of, you know, health monitoring the monitoring OpenStack services. It takes care of migrating OpenStack services. And it takes care of how to bring up the OpenStack services, etc. details. And this distributed service is fault tolerant. So we'll see how it how it is done. And one more thing that we try to achieve at ZeroStack is this distributed control plane can use any node in the in your private cloud to run a service. For example, there is no concept of, you know, MySQL nodes. There is no concept of NOVA nodes. Any node can be used for any purpose. What this achieves is basically, you know, your active node died. Then you moved to passive node. And that node also died. Then you moved to some other node, etc. If you have a 100 node cluster, we want to achieve 97 faults out of that. That means you have 100 node private cloud and we want to support 97 failures in that private cloud. 97 machines go down and your private clouds should still be functional. That's our goal, ZeroStack goal. We'll see how it is done in ZeroStack. And yeah, this distributed control plane not only, you know, uses any node to run OpenStack services. It also heals any node, heals problems because of, you know, one node going down, any node going down. What is this healing? Basically, you have some data sitting across, you know, your 100 nodes in the cluster and one node went down. Suddenly this data, you know, replication factor has actually gone down for this data. So somebody needs to take care of, you know, fixing this under replication. Then only you can actually support one more node failure. So distributed control plane actually takes care of, you know, initiating healing. So it fixes under replication. Once the under replication is done, fixed, then basically it is a brand new cluster with 99 nodes. Okay, now it can again support one more node failure. And then once, you know, that failure is also healed, then you have actually 98 node cluster. The procedure is same basically. After a node failure, there is a time where, you know, you do the healing of your cluster. And once the healing period is done, then you can actually take one more failure. So that's how we achieve 97 node failure sort of a 100 node cluster. So how is such a distributed control plane implemented? That's the core of this talk. How can you do such a thing? Okay, yeah. Yeah, first concept that we use for, you know, implementing such a distributed control plane is basically leader election. Even your 100 node cluster, basically one node becomes the leader. It uses a fault tolerant leader election algorithm, which is tax source is there and there are several other fault tolerant algorithms that you can build on top of other, you know, fault tolerant key value stores. Once you have this algorithm that actually picks a node as the leader, then that node takes care of the entire responsibility for the cluster, your 100 nodes in the cluster. And he brings up all open stack services in the cluster one by one. At the beginning, every node is basically a fresh Ubuntu installation, nothing else. Now it actually starts open stack services on all nodes in the cluster in your private cloud. Now once open stack services are configured and started, you have a cloud running. Now this leader, he keeps on monitoring the health of all services. Like he goes and thinks, you know, is my the node where my skill is running, he actually makes some query. The node where keystone is running, he actually makes some keystone operations. Basically, he knows everything that is going on in the cluster and he makes sure that all services are running and healthy. And he actually makes sure that, you know, services are distributed across the cluster. For example, you have different tracks in your data center. Then he makes sure that, you know, keystone instances are running across racks, such kind of things. Okay, so, you know, a typical configuration with the four node cluster something looks like something similar to this, where two nodes are one node is running, no one, sender, and you know, another node is running, neutron, glands, etc. This is how a four node cluster looks like. Hundred node cluster will have one service running per node. And one of them, everybody is actually participating in the leader election algorithm. So one of them becomes the leader. Leader is that green guy. And now how does he handle failures? Okay, I already told you that he actually monitors, you know, he knows where services are running and he keeps on monitoring. Keystone is healthy, MySQL is healthy, etc. Now, if he finds that some node, he cannot reach to some node or he cannot actually, you know, he cannot query some operation like MySQL operation, for example. He couldn't do MySQL query. That means, leader got disconnected from somehow. Something happened in the cluster and leader is unable to talk to MySQL. MySQL could be down or later could be down. We will figure it out. So what he does is he computes, okay, these are the nodes that I can talk to. What is the best plan for running services in such a configuration? Okay, if MySQL goes down, the node which is running MySQL goes down. That means he can actually talk to 99 nodes in the cluster. But MySQL is the only guy who is not, you know, that node is down. So he actually creates a plan saying, where should I run MySQL in this 99 node cluster? He actually picks one node, random guy, and he decides to run MySQL on that guy. That's what we call by service mapping. Service mapping is what are the nodes and what are the services that are supposed to run. Now he goes, next is basically he tries to migrate services. So that, you know, the current layout, active layout of the cluster is actually matching his plan. He does that by stopping some services on some nodes and starting services on some nodes, etc. And if it is a node failure, then he actually initiates, you know, healing. Healing is basically what do I need to do to fix up a service so that it can actually take one more fault. If it is a distributed storage kind of service, then he needs to fix under replication. If it is a distributed key value store he is using, then he needs to actually migrate data out of it. Somehow he needs to fix the system so that it can actually take one more failure. That's what we call by healing. I'll show you a demo how it happens. So let's assume the node, you know, in a for node cluster, the node running heat say failed. Then he recognizes the failure and he migrates MySQL and heat into some other nodes of the cluster. This is simple. That's how, you know, service and node level failures are handled. Service failure is basically a simpler version of node level failure. Only one service is crashing maybe. Okay. Now let's look one more step further. We saw how service failure is handled, how node failure is handled. But how about leader failures? The guy, the node which became the leader, which started everything in the cluster in your private cloud, that itself went down. Then how do we handle this scenario? I told you, if you remember that all nodes in your cluster, like 100 nodes in your cluster, are participating in leadership election algorithm. So the leadership election algorithm picks one guy as the leader and rest of the 99 guys are not leaders. What do they do? They keep on monitoring the leader's health and if they figure out that leader is actually, you know, faulty, leader is unavailable. They start a new leader election round, that is basically re-election and they try to bring up a leader. Now, one more guy becomes the leader. Then this new guy somehow needs to figure out what the world guy is doing and he is already down, so he cannot even talk to that guy. So how does he figure out what the world leader was doing, what is his service map, how he is running services in the cluster. So he actually, you know, the way we do it is basically leader always stores his state in a replicated storage and it's a distributed wall. How many of you know what wall is? Write a headlock. You heard about it, right? So anytime leader is about to do some operation, like migrate MySQL from node A to node B, he is going to write an entry in the wall. You know, saying that, hey, I am going to migrate MySQL from node A to node B before he does that. And he actually, he writes it once the write is successful into distributed wall, he actually then issues commands to migrate the service. Like, you know, shut down MySQL and unmap its disks or whatever and then goes to and issues an RPC to node B and says that, you know, mount whatever the MySQL volume there and then start MySQL there, etc. This is how the distributed wall mechanism works and distributed wall is basically a replicated wall. So if the leader goes down and the new leader is elected, this new leader still has access to, can read distributed wall content. So he actually reads the wall content and he actually restores the state of, you know, previous leader. So basically he actually comes to learn about what the previous leader was trying to do. And previous leader state is basically, you know, service mapping, what are the nodes and what services they are supposed to run or what am I trying to do, etc. So let's look at it in action in a four node cluster. Say for example, leader died, then what happens? Immediately after leader died, you know, all non-leader nodes, that means all the other three guys, they figure out, okay, leader is dead and they try to restart. They try to become a leader for the cluster and only one of them wins because of leader election algorithm. And that guy takes the responsibility of the leader and he restores his state and he figures out that, you know, the node which is running the leader, previous leader, he was actually running, you know, NOVA service and Cinder service, so he migrates them out. And this is how we actually, you know, handle leader failures. Yeah, sure. Yeah. I see. Okay. Your question is? We can do it after the talk. Okay. Just hold on to that. That's a great question. We'll handle it after the talk. Yes. Okay. We talked about, you know, I already detected a failure, like I detected node failure, I detected service failure or I detected leader failure. But how do you detect a failure in a distributed system? You know, you cannot detect a network disconnect, you cannot differentiate between a network getting disconnected and node going down. Process did not crash, but its network cable got disconnected. Both are same if you, basically, if you try to programmatically figure out if a node is dead or disconnected, you cannot figure it out. That's the nature of network systems. So basically, what I wanted to say is you cannot differentiate between node going down and network cable got disconnected. Then how do you ensure, you know, how do you make a decision that MySQL should be started on some other node? The guarantee that you need to provide there is basically, you know, there are no two instances of MySQL running at the same time. If the node that got disconnected is also, is actually running MySQL, say, and now you decide to start MySQL on some other guy using the same shared volume, then you have two MySQL instances trying to run on the same disk and it is going to just break the system. How do you ensure this? The way we do it at ZeroStack is basically using leases. So when leader assigns some node, picks some node to run MySQL, he actually issues a lease saying that you can run MySQL only for five minutes. Okay, let's assume that five minutes is the number. And after five minutes, the node itself will stop MySQL service. Nobody is actually asking it to stop MySQL service. It just tries to kill MySQL service after five minutes. That's how leases work. It's nonsense if, you know, if you just start MySQL on some node and it immediately shuts down after five minutes, then your system is coming down. The way we keep it alive is basically leader, period, he knows the mapping, service mapping, so he refreshes leases on all nodes every time, every minute, you know. So if leader is refreshing services, that means there is a refresh within five minutes of, you know, starting MySQL, then the node will actually get the refresh lease and it will actually make it six minutes and it will make it seven minutes like that. As long as the leader can keep, ping, keep refresh the guy, you know, keep refreshing the guy, then service will not go down. If leader gets disconnected from that node, all leader needs to do is wait for five minutes so that MySQL will be down. That node will stop MySQL and then I will stop MySQL at some other point. That's how we do it, but five minutes is a big wait time, so we actually do it in ten seconds, less than ten seconds. So leader lease time is basically service specific. MySQL has ten seconds lease, some other service has five seconds lease. It's all configurable. That's how we actually do, you know, that's how we detect node failures. So here I'm going to tell some implementation details about how ZeroStack does it. ZeroStack actually uses a distributed key value store and we built our later election algorithm and we built our distributed wall on top of a distributed key value store. There are lots of options for distributed key value store like ZooKeeper is one, etcd is one and, you know, Raft is a good consensus algorithm. You can use that, PAXOS is one. There are lots of options. Now, one more thing, I don't know if you guys notice. If you are moving your MySQL service all around across the nodes, then it doesn't have a fixed address. Then how do we handle that case? We actually use virtual IPs for every service. So when MySQL is moving, its IP address is also moving to the node. So on the new node where MySQL is started, we actually assign a virtual IP for that node and then we'll start MySQL. That's all part of MySQL startup. And if you are using virtual IPs, you have to make sure that R-Pentries are actually, you know, expired after a timeout. And we have, all our timeouts should be configurable so that you can actually decide on the responsiveness of each system or of each service also. That way you can actually build a better system. So given this entire design, you know, what are the benefits out of it? These are the benefits basically. You don't have any single point of failure because there is no such thing as controller node. Anybody can become a leader and high fault tolerance. Once the healing process is done, you can take one more fault as if, you know, your cluster, 100 node cluster was actually a 99 node cluster. So, and as I said, there are no special controller nodes in our system. And we actually do automatic healing. So we actually, you know, if we remove a node out, we can actually fix the under-replication and we distributed key value store itself, stores its data across nodes. So you need to fix under-replication. So we do automatic healing. And the best part, if there was a failure, you don't need, you know, admins to actually go and look at the failure. And of course your mileage will vary. Okay. And yeah. So just to summarize this part of the talk, I think the main goal that we started with is to get high availability in the platform. Now there are one set of approaches which we define as failover approaches with some manual healing. And I think that's most of the standard recipes and tools that a lot of people use. The second set of approach that I think we talk about and what we think is a better approach for a scale out private cloud web scale system is you use some sort of a leader based self-healing mechanism and you have a control plane that takes care of fixing the system if anything goes wrong. And essentially the left-hand side, I feel it's easier to do manually if you are doing a small scale environment, you know exactly where what is running, but if you are running for larger scale, it's actually better to have some automated system which in reality would be harder to debug if you are debugging manually. But the fact that software takes care of it, that means you don't have to worry about it. And that's what we have built at ZeroStack. So with this, let me give you a demo. I'll just go through a video demo and let me just set up the overall scenario for the demo. And here I just want to talk a little bit about the overall ZeroStack solution because that's what we are going to use for the demo, so it becomes easier to explain. So the way ZeroStack solution works is there are two parts to it. One is a converged hardware node that you rack and stack in your data center or call or wherever you want. And you can even do it on multiple different sites. Once you rack and stack that node and you assign it some sort of IP address so that it can talk outside, there is a ZeroStack control plane or a ZeroStack cloud platform which is running as a SAS layer. And everything else happens through that SAS layer. The high availability part that we talked about is essentially running a control plane on the ZeroStack nodes which are on-prem. The SAS layer is essentially there to help you consume the cloud, to monitor the cloud, to do operational intelligence on it. Things like capacity planning, things like monitoring the cloud, all of that is done through SAS layer. So the idea is that the overall cloud would look and feel like you deal with a public cloud but everything is running on-prem. So let's just go through a demo and rest of the things I'll explain as part of the demo itself. So in this case what has happened so far is somebody has gotten the ZeroStack converged system and they have deployed it, they have racked and stacked it and put it in their data center. Once you do it, you go to the ZeroStack cloud portal and you register the company. In this case we register a company called HA Inc. You get a validation code that says this is a customer specific code that you have and you can enter it on the machines that you have put in your data center. This is a way to associate those machines with your customer account. Once the nodes are now attached to the customer account, when you log into the cloud portal again, you actually would see those nodes in your environment. So in this case we entered it on four machines and you can see all four of them in your environment. Now you go through some very basic setup. It literally takes about two-three minutes to enter some network information. We provide different pools of storage in terms of high SSD and disk-based pools. And you go through a little bit of setup in terms of creating a cloud admin account, is the person who knows everything about the cloud. And the overall setup is pretty much you are making very few decisions in terms of storage and networking. Everything else is taken care of by the control plane itself. So here is a summary of the cloud that it's building and it shows you how much capacity was there overall in the cloud. And once you look at that, you start building the cloud. This is a step that takes about ten to fifteen minutes. So I'm not going to make everybody wait fifteen minutes to look at this. So we have shortened this step and it will build the cloud in probably thirty seconds or a minute. But at this point it's literally creating a cluster across these machines, initializing that control plane across them and bringing up open stack services, stitching everything together. And in about fifteen minutes you would have a fully functional private cloud running on those machines. If you want to add another node, you just rack and stack another node, you assign it IP addresses, it will show up in your account. So it did the configuration of the cluster, now it's deploying open stack services and it says your private cloud is ready and now it's ready to launch the cloud. So I'm just building all this before I go into the H8 demo. This is just to show a little bit of the setup so that I can show when something is failing, what is failing and what was running in the system. So here you can see the infrastructure view. So there is one region, one availability zone and there are four hosts because we created a cloud on four hosts. You can actually create it on 32 nodes or 100 nodes. It wouldn't make a difference. The unit of consumption is first you create a business unit which is like a PU within your organization and you can have an admin associated with the business unit itself. And you can also, so user one here is the admin for that business unit. So once you do that, now you can create projects within that unit and project is like a mini virtual data center which is where the users come in and users are consuming essentially the projects that they are part of. Since we just created the cloud, we are just going to create a quick project so that we can start creating workloads in it and then we'll do the high availability in action once we have put on some workloads in the cloud. So as part of creating the project, obviously it's customizable. You can set the dimensions of the project, how much VCPUs, how much VM. You can associate different users with the project. So that's pretty much like a unit of consumption here and you can create external network within the project. So when you initiate VMs, they can get IP from that external network. All this is just to get to the main thing which is high availability. So now we have created the project and within the project now you can see there were no VMs. Then we go ahead and create some workloads. So now we have created some VMs in the background and now when you log in the project essentially has five VMs running and you can look at networking or images or volumes and all of these things are live. It's basically a fully functional private cloud running now. And VMs have external IP addresses. You can ping outside. This is just to show that everything is working here. It's a cloud with real VMs running and now we are going to go through a failure scenario and to stress the system, it's literally a four node system. So it's a single unit where it's a four node system and the services are running across four nodes. If you put a hundred node system, you would see one service on some of the nodes and there would be nodes with no services on them. But since it's a four node system, you can see that there are four nodes and you look at the service map and different services are running across different nodes. So in this case you can see there is a glance, there is RabbitMQ and these are the services that are running on Z host one. Then there are other hosts in the system which are running heat and neutron. And now we are going to go and just shut down one of the machines. So in this case we actually chose to shut down a critical machine which is Z host three which is actually running MySQL and NOVA. So we just log into the machine and we shut it down. You can even pull power out of the machine. It would be the same thing. It's just that we have redundant power supply. Pulling one power out doesn't shut down anything. So we have to go and manually do the shutdown command. So now if you wait about ten to fifteen seconds and you run the same command again saying show me the service map you can see here the services are now running across Z host zero one and two. Node three is part of the cluster but there is nothing running here. The machine is gone and now the services have automatically migrated. And now we'll go and try to create some workload again. So now we are going to just go log into the UI and we are going to create another VM and show that everything is working even though one of the node is down. So we are just creating a new VM. We are going to give it a floating IP and we are going to check network connectivity and make sure that everything else is working. You are essentially running a private cloud with somewhat lesser capacity because one of the node is gone but everything else is completely functional. So now the VM is up and running which is the VM that we just created. We will go into the console and see the overall connectivity of the VM itself. That it basically tests all the services in the system that all the services are running fine. And we also look at one of the existing VMs and that is also working fine. So that pretty much ends the demo part and one of the things here you would notice below every entity here there is a timeline that shows what is happening to that entity. So when you bring up a VM it shows all the VM level operations at the bottom. If it is a project it would show all the project level operations. So with every object we are storing what is happening as time series with that object and you can do a lot of analysis on top of that. So one thing I think many of you may be wondering we actually shut down one of the nodes and the services migrated but what happens to the VM? That node was running some VMs here and unfortunately that VM is obviously down. That node it was running a VM that is down. And there are a lot of people who come back and say that with some of the solutions like VMWare there is a feature called high availability which is VM level HA what it essentially says is if there is a node which is running some VMs and the node goes down you should be able to restart those VMs on some other node and that is something which would provide you this HA where if some VM goes down it would automatically come back up with the same disks on the other node. So the problem with doing VM level HA is again the same problem with networking where how do you detect a node is actually down because it is hard to distinguish between a network disconnect versus a node being down and the same problem that Chetanem mentioned before is that if you think a node is down just because you cannot talk to the node the VM could actually be running there and now if you start the same VM somewhere else using the same backend storage you are going to have corruption in literally two minutes. So the way to solve the problem is essentially you again have an agent on the host which is now making sure that the host is connected to the network and you can define the connectivity and availability how you want to. It could be that it can talk to the remaining nodes in the cluster or it could be that it can talk to the internet whatever you define but once the agent realizes that this node is not being able to connect outside that agent would essentially stop the VMs. So in this case the agent would detect that it cannot talk outside and as part of that it would shut down the storage connectivity making sure that if there was any storage that this machine was talking to that it cannot talk to it anymore and would also shut down the VMs that are there on the host. If the host is literally dead then you can actually make sure it is dead by issuing an IPMI command and making sure the host is really dead there is no agent to even take care of things and obviously the control plane that you are running would detect this and after some time out which is the time within which agent would act you bring up the VMs on some other node. So this will give you a VM level high availability where you don't have to worry about restarting the VMs yourself. Now moving on to the ultimate high availability that I think at the end of the day everybody cares about is the application level high availability. Everything else is just to achieve this goal here. Infrastructure level, VM level you ultimately want app level and that is something that we are looking at for future but I just want to point out few things that can make your life easier if you are thinking of the application level availability and there are two main failures here one is the infrastructure level failures which we are going to focus here and the second is app level failure which is fine and that is something we are not targeting here. I think for that you have to have your own application plug-in liveness check and other things but for the infrastructure level the standard way people do most of the app level availability is you run it across multiple A's you design the app either active active or active passive where you can run it across A's and you run some load balancer in front. You can run it on top of this and what you can do better on top of this is if you look at the problem with just this design is within the A's itself any of the A's is where you deploy the app you really have no locality if you just go and deploy a bunch of VMs and this is specially noticeable if you do something like a private cloud you deploy a bunch of VMs they would go somewhere in a large data center or VM request if a request comes to your one tier of VMs goes to the second tier goes to the third tier by the time it gets done it may have traversed in actual networking hops maybe 25 different hops and switches and that is where one of the problem comes in that you don't have as much control over it and you suffer from performance penalty in terms of your availability penalty the problem is given that you have no control and if you do not have the ability to use the same host but in the case of a single host failure actually can take down a full tier so in this case in rack one both of your tier one VMs are running on the same host if that host goes down actually the app is not available it doesn't matter if everything else is up in different groups and you can have affinity within a group across tiers and anti-affinity across groups and that can give you much better performance and much better failure tolerance if something was to go wrong. We're almost out of time so just to conclude, the current techniques that are there, they are good for some small use case but we believe they don't scale very well for a web scale architecture and the key ideas to get a scalable HA is you don't want to have special nodes in the system, a symmetric design is much easier to deal with, you want automatic healing and you want some sort of consensus based approach to take decisions and VM level HA would help you with better failure tolerance and you actually need lot of good detection and isolation for that and app level HA can help you with better performance and reliability. I would be happy to take questions, I think we're almost out of time, we can always talk outside. Yes, we can talk outside and feel free to drop by our booth T43, would be happy to give lot more detail there. Thank you so much.