 Hello, everyone. This is Dhruv. I'm here with my colleague Nirendra. We are from Veritas. And today we are here to talk about a very interesting topic, a topic which is very close to our hearts, Resiliency in Cinder Node. So we all love Cinder, right? We have to love it. I mean, it is the service that provides us our backend, our storage on which our applications actually run. So let's take a very quick look at what Cinder has underneath. So we have Cinder Volume Services, which are talking to a backend exposed over LVM or CIF or things like that. These Cinder Volume Services are running in a node which we very lovingly call it as our Cinder Node. We have Cinder Scheduler and Cinder API services running elsewhere. I then control how they can be running elsewhere as well. All these services talk over a messaging bus, like AMQP. And that's that. And this entire infrastructure is basically exposed to the user via some REST APIs. So what are the failures that can happen in this infrastructure? Well, a network can go down. Network meaning either the management network or the data network. Management network where the control flow happens. The data network, if you have a network exposed storage, so if that network goes down, then we say that it's a data network failure. The storage itself can go down, which means that either there can be a disk failure or there can be a rate controller array failure or things like that. There can be a server failure as well. The entire Cinder Node can get panicked or entire Cinder Node can be rebooted or there can be other software failures on the server. So all these are really frustrating, right? So what are the resolutions? So for this, let's take a very quick look at a very basic implementation here. So if you see there is a compute node, which is talking to a Cinder Node. The storage is exposed through LVM over iSCSI. There are two Cinder Nodes, which is having direct attach storage here. The data is getting continuously replicated in between these two Cinder Nodes. The compute node, which is talking to the Cinder Node, is we are calling it as primary Cinder. The other Cinder, we are calling it as a standby Cinder. So this is a very basic implementation of an open stack architecture. There's, of course, a Cinder Scheduler running elsewhere. So what if a primary Cinder Node goes down? Or there's a failure here? So there can be three types of failures that we just discussed. Either it can be a network failure. So let's say if it is a network failure, then we recommend an admin to basically wait for some time and figure out if it is really a network failure or if it is just a network glitch. So if he figures out that if it's a network failure, then he can go ahead and change the Cinder ownership. The other kind of failure can be a media failure. An admin can use tools like SmartD or other various tools that are available to figure out whether there is a proper media that has failed, whether there's a proper media failure or not. And if he detects that it's a proper media failure, he can simply go ahead and change the Cinder ownership. The other kind of failure can be a Node failure or a software failure. The admin can choose to monitor the Cinder status and based on the Cinder status, he can go ahead and change the Cinder ownership. So in this particular example, let's think that the Node has failed and admin gets to know that my Cinder is now down and now he has to go ahead and change the Cinder ownership. So what steps he would do? Since here, the storage is exposed through LVM over iSCSI. What an admin can do is that the standby Cinder node was already having all the data, he can go ahead and import the volume groups on the standby Cinder node. These are the commands that he would use to do it. So I'm sure everyone is aware. So once the LVM group is imported over the standby Cinder, an admin has to go ahead and change the Cinder ownership via the Cinder scheduler. So once that is done, then admin can see that the standby node has actually now become the primary Cinder node and the compute node is now talking to the secondary Cinder, which is now providing its data. So now, like all of this was very crude, right? I mean, it was a pain for the admin. So now I would like to call upon Nirendra to talk about how Veritas Hyperscale for OpenStack actually goes ahead and deals with such kind of failures. Thank you, Dhruv. So all right, before we talk about failures in Hyperscale, let's talk about Hyperscale a bit. So Hyperscale is an integrated enterprise storage which runs in the OpenStack environment. It's fully integrated with OpenStack. So Hyperscale consists of a set of computes or hypervisor. So these are the nodes which host the virtual machine. Now these virtual machines, of course, they do their IOs using the Hyperscale storage layer. And as it is integrated with the Cinder and NOAA and those are the components that these storage layer belongs to. So when the VM does the IO, then the IOs are synchronously replicated to a peer compute node, which means like a peer compute node always has a delta. So where this delta is required, we'll be talking about it in some time. So also what happens here is over a network switch, the data is periodically synced to another set of nodes which are called data management nodes. Now these data management nodes are interesting because they host a lot of storage. I mean they're the nodes which are heavy on storage. So with this arrangement, the data node is always behind the compute node by some time, let's call it say 15 minutes. So every 15 minutes we are syncing the data. So now what happens is in case there is a compute failure, say the leftmost compute has failed, then what will happen? Like if it has failed in the 15th minute, then the data from the data plane which has just been synced can be hydrated on another compute node. So there is no data loss. What if it fails in the 22nd minute? If that happens, so the data from the data plane, the 15 minutes data can be hydrated and the seven minutes data which was already part of the peer compute node, the delta, that can be synced together and the data can be recovered. So there is no data loss in this architecture pretty much. And the deltas which are getting synced to the peer compute nodes, they can get periodically pulsed. So there's only a 15 minutes delta at most. Now let's talk about the failures here. So what can fail? There can be a media failure. So suppose there's a media failure on a compute one which is the leftmost compute. If that happens, then the data, the IOs immediately are redirected to a peer compute node. In that case, the IOs will be immediately served from the peer compute node and the storage. So now the data, of course the IOs are from peer node which means like they may be having delta, the data which is required, or they may not have the data because the data is with data plane. If that is the case, then data will be periodically synced up from data plane to compute plane to serve the IOs. And now this being a remote IO over the network switch, it could be slow. So the storage is gradually migrated from data management plane to the compute plane. With that, the IOs can get served from the local storage and then the VMs are auto migrated. So with this arrangement, there is no IO loss or there's no data loss in case of a media failure. Now let's see what happens if there is a software failure. So softwares do fail. I mean, no softwares are free of bugs. So if that happens, in that case of software failure, the IOs as I previously mentioned, get immediately migrated to a peer compute node and there is auto healing of software which happens on the node where it failed. And as soon as the software has healed, the IOs are again failed back to the original node. So with this, there is no IO loss and the node which was serving the VM will continue to serve the virtual machine. So let's talk about the network failures. Now they are interesting because the network can fail anywhere. They can fail either between the compute nodes because the data is getting synced or it can fail between the compute node and the data node. So if it fails, suppose in this example, if it failed between the compute node and the data node or the network switch has failed. So the data nodes are a set of nodes. So here we do a lot of intelligent monitoring at the network layer to figure out if it is a network glitch or a real network failure. So some of the things that could be to delay the network failure completely because it could be a network glitch and a lot of other things are done there. Once that is determined, suppose we figure out that it's a nick failure on the data node where the compute was talking to. Then in that case, we immediately do a failover from the network across from one compute to the another data node. So just to summarize, within hyperscale, we do auto migration in case of a media failure, auto healing in case of a software failure and intelligent network handling of the failures in case of a network failure. So in this Boston Summit, we are actually unveiling hyperscale and the hyperscale is of course, it's resilient in case of a dash environment. It's fully resilient. Thank you.