 Please introduce from Red Hat, Adam Lichti, who will be talking about certainty and shared stored environments. A quick housekeeping reminder, if you have questions, please hold them until the end of the session. These sessions at the beginning of the day are brief, so we want to try to get as much speaker content in as possible. And without further ado, Adam Lichti. All right. Thanks, guys. Good afternoon. Hope you can hear me all right. So as Brian mentioned, I'm Adam Lichti. I work for Red Hat on storage and the overt project. And today we're going to talk about a little bit of reality, kind of a dose of reality. So I'm going to start off by going into a really brief overview of the overt project and the storage architecture. It's been mentioned before, just to give you a little bit of context into what we'll be talking about in the rest of the presentation. Then we get into the mayhem that's happening in everyday deployments of overt that happens all the time. And so we'll talk about the kinds of things that happen. And then we're going to talk about what we can do to remedy that and to kind of restore order into the system from a storage perspective. And then I'm going to try to bring it all together with some step-by-step examples to kind of go through the algorithms that we employ and how they're going to help solve the problem. So this is just a real quick high-level overview of overt. Basically, overt is enterprise virtualization management. It is orchestrated by a management application that we call the overt engine. And it is connecting out to hundreds of hosts on top of which we can run thousands of virtual machines. Engine is kind of the brain behind the entire operation, which dispatches commands out to the host to achieve the results that we're needing in our system. And it provides the interfaces to end users via REST API and a web interface. Now, all of our hosts that are in overt are connecting out to shared storage. We support lots of different types there from NFS to Gluster, iSCSI, FiberChannel, Seth, others as well. So that's a real brief introduction. Let's go down one level into the chain here. So basically, each storage endpoint, its main function, is to store virtual machine disks. And in overt, we call those images. We're using the QMU, QCAL2 image format. So in our case, a virtual machine disk can actually be a sequence of underlying volumes organized into a chain. So for the purposes of today's presentation, we're going to be working at the volume level. So it's important to understand what overt thinks a volume is. In this case, we have the obvious component, which is the data area. And this is the block device or the file out on the storage that we're actually initializing as the data area, whether it be a raw volume or a QCAL2 volume. So that one's pretty easy. The next one is the metadata area. And this is stored somewhere next to, roughly, the volume out on the storage domain. And it's a very small area where we have key value pairs. This stores properties of the volume, such as, does it have a parent? Does the system regard the contents of the data area as legal? These type of things. And then one thing that I'll talk about today, which is the volume generation. And then the final area that we have is the lease area. I'll get into this one in more detail, but it's a shared storage lease. And a host is able to utilize this to gain exclusive access to the volume. So in overt, we have basically storage operations occurring, which can be classified into two main groups. The first one are metadata operations. These really change the shape of the storage. I like to think of so adding and removing volumes from the system. If you were here last year in this room, I gave a presentation on how we could manage in a resilient manner these type of operations by using garbage collection and storage transactions. Today I'm going to focus on the data path operations, which are the other type. These are kind of the main operations that are occurring in a steady state over environment, the main one being the obvious one being a virtual machine accessing its disk while it runs on a host. But other examples are the hypervisor copying one volume from a source volume to a destination volume, for example. So these ones are requiring a little bit of different kind of protection since garbage collection really can't handle the job. So we can take all of our data path operations and encapsulate them into something called a storage job. So let's say we're taking a high level operation, such as cloning a virtual machine disk from one storage type to another. This may comprise multiple sub-operations. For example, if we have a volume chain, we have to copy each volume in the chain over to the destination. So a storage job is going to be one copy from one volume to another, so the kind of the most granular form of operation that we can provide. Okay, so Engine is going to package these into a storage job and schedule it for execution on a host of its choosing. At some point, that host is going to begin to run the job when it has available resources, and it's going to basically run that job in an asynchronous fashion, allowing Engine to query the progress during the operation. Okay, basically when the operation is completed, it's ephemeral, it goes away, we don't track it in a persistent manner, so it sort of maps to the process of actually copying the data. So that's a little bit of a background on the storage jobs. So this all works really great. It was designed for a reason because of the chaos that happens in everyday worlds. So we can have power outages, network service and corruptions, hardware failures of all different stripes which are affecting the operation of the system in unexpected ways. And we can have software bugs even, never an overt, the kernel, libvert, QMU, and of course our stuff too. So all of these things we can mitigate. We can have redundant power supplies, redundant networks. We can have better hardware that's better maintained. We can have better software development processes, but as hard as we try, we can't eliminate these. So if you want to have reliable storage, you're going to have to design it from the ground up with some algorithms that are going to do that. So this graphic here is just a little bit of a visual representation of what I said. The interesting thing is there are a lot of different ways that failures can happen with different effects. So the host can lose its connection to storage, engine can lose its connection to the host, and then there could be internal problems with the host itself. And the main thing is a lot of these are going to look exactly the same to the engine, so we can't say if it was a power outage, do this, whatever. We have, when it happens, we have no idea. So how do we restore order when we don't even know what went wrong? And basically what we can distill this down to is figure out all of the outstanding jobs that we're running across the affected hosts in the system, and we need to figure out what's happened to them. And you have to use storage to do this. Storage is really the point of record. It is, as I like to just say, if it's not on storage, it never happened, regardless of what you tell me. So this is our endpoint. We'll use the storage to tell us what happened. So the first thing we need to do is determine if there are any running jobs and decide if we want to wait for those or kill them. And then with any jobs that are ending, we need to decide if they succeeded or failed so that we can report back to engine and we can correlate all of the outstanding jobs into resolving the end user command, whether whatever it was that they requested. In order to do these things, and again having a storage-based approach, we have a couple of tools. The first one is volume leases. And I touched on these earlier. These are implemented with a technology called Sandlock, which we discussed last year as well. Sandlock runs a daemon on every single host that's connected to overt storage, and it allows the host to join a lock space on each storage domain. So the lock is maintained out on storage. Once you're a member of that lock space as a host, you can request volume leases, which allows you to gain exclusive access to the volumes. Basically, you have to, as a host with leases, you have to abide by the Paxos lease algorithm, which means that you are occasionally updating and renewing your lease on storage and also making sure that you're passing the liveliness checks. And if as a host you're failing to operate correctly, then Sandlock can fence you. And this is really a two-phase thing. So the first thing Sandlock will try to do is kill any outstanding processes that are associated with volume leases. If that fails, it can use the kernel watchdog device to force a hard reset of the system. So with this in place, we now have a guarantee of exclusive access to volumes, but also importantly, we have a guarantee that those leases won't become stale or stuck in the case of failing hosts. They'll always be released so that we can use another host to satisfy the request. The second tool that we're gonna use and talk about today is volume generations. Volume generations are simply a monotonically increasing value within the volume metadata that can only be changed while holding the volume lease and which allows to sequence the jobs together so that we can guarantee that only one storage job per volume generation is able to run. So we'll take all these things and put them together in a rough storage job structure. So no matter what the job is, we always acquire the volume lease. We compare a requested generation, which is supplied by the engine when scheduling the job, with the actual generation that as it appears on storage, if they match, we do whatever work the job prescribes. When we're done, we increment the volume generation to the next integer and then we release the volume lease. So let's take a look at how this works in a normal flow. First thing, engine chooses a host and schedules a job, storage job, whatever it may be onto it and it gets a response from the host saying that the scheduling was successful so now we can begin monitoring the job. At some point, the host is gonna run the job and the first thing is to acquire the volume lease. Next, we validate the generation against the storage and find it to match. We perform whatever operation it is that we're doing as part of the job and get periodic progress updates back to the engine. When we're done, we increment the generation and release the volume lease. The last step then is a done event and this job goes away and we've resolved the situation. So that's all fine and good, but let's see what happens when a problem happens. So we're gonna begin again with a job that's scheduled on the host and we are using, just noting here, we're using generation one for the volume. This is the expected generation. Then during monitoring, something bad happens. We're no longer able to talk to the host so at this point we have no idea if it's powered off. Is it talking to the storage still and continuing to write to the volume? No idea whatsoever what's going on so we have to figure out what's going on. We're going to use something I'm calling volume reconnaissance and this is actually just a storage job like any other storage job except its point is to resolve what's happening with the affected volume. So basically we schedule it onto a host that we're able to talk to and it's going to check if the job is running by way of trying to acquire the volume lease so that's gonna give us that piece of information and if it is able to acquire the lease then it's gonna be able to determine success or failure based on the value of volume generation. So let's take a look at how this works in practice. So here we go we're gonna select host B which is still responding to us and we are going to schedule a reconnaissance job with the last known generation of the volume which is still one at this point. And now it's time to check on that lease and so there's two potential situations we can find ourselves in. In the first case we fail to acquire the volume lease so it's worth noting here that volume leases are a try lock semantics so you're going to try the lease if you can't get it you get an error immediately. So if we fail to acquire the lease we know that the job is running or has been running recently on the host and we can choose whether we want to wait for a while or if we want to use a feature from Sandlock which says I'm repealing the lock and I'm going to request it for myself and that's a way of forcing the job to end and fencing it. So that's a decision we can make. In the alternate scenario we acquire the lease successfully and now we know there's nothing running and we move on to the next step to resolve the situation. And here again we have two possible paths we can take. In the first case the volume generation matches and that is actually gonna tell us that one of two situations has happened. Either the job was still scheduled and never ran or the job ran and failed because the last thing a successful job does is increment the volume generation. So the other option is the volume generation comparison fails and in this case it's counterintuitive but that actually teaches us that the previous job was successful. Now it's important to note that in the previous slide when we acquired the volume lease in both of those cases we incremented the generation to two. And that's important to note because it serves a really good function. If for example host A comes back to life and reconnects to storage, tries to execute that job we don't want it stomping on what we may have done since we did our reconnaissance. So this is a way of preempting those old jobs we don't care about that we've resolved. So this one is going to acquire the volume lease it's gonna compare the generation find it doesn't match it's not gonna be able to proceed any further. So a couple of things we wanna do in the future. So when we talk about copying volumes from a source to a destination I always talked about the destination lease which is the most important arguably because it's right access. But the source volume we want to acquire a shared lease for that as well. Now today engine protects against performing a volume operation on a volume that's shared through orchestration level locking but we'd like to be further safer to have it down in the storage subsystem itself. The next one is parallel job execution. So obviously the design is set up so that we can if we have four volumes to copy at once we can do them all on different hosts at the same time. There's a bit of scheduling logic that goes into dispatching all the operations monitoring all the status and then combining it into the final result. So we're working on that. And then finally we have something called VM leases which is a way of actually using a storage lease to represent a VM. So at the storage level preventing two VMs from running on multiple hosts which would obviously be a bad idea. So that's great. In storage we prevent storage operations from conflicting with running VMs again by orchestration level locking but we want to in the storage level prevent that as well for extra safety. So I hope you found this walkthrough interesting. Here's some links to the overt project where I was welcoming more users and more developers. So I'd love to see everyone over there. And at this time if there's any questions about anything but I'm not seeing you just yell out or something. All right, thanks everybody. Oh, drunk. Is it supported in hosted engine? Yeah, it should be working everywhere. It should be agnostic of that. In terms of the VM lease integration we'll have to be extra careful about that of course but for typical storage operations it'll work seamlessly. All right, thanks everyone.