 Thank you. Good morning. My name is Alistair Coles. I'm with SWIFTAC, and I've been working on SWIFT for about five years. For I think four or thirty years I've been on the core developer team. This morning I'd like to tell you a little about how SWIFT can be used to build geographically distributed clusters. I'm going to begin by doing a brief overview of SWIFT. I apologise to those of you that may have been here earlier this morning and heard Tiago's talk. But I'll try and get through that reasonably quickly before then starting to talk about geographically distributed clusters. And hopefully answering the questions, what are they? Why might you want to build one? And how does SWIFT enable that? And then I'm going to look at another really nice feature of SWIFT, which is erager coding. And talk about how erager coding also works with distributed clusters. So distributed clusters has been a feature of SWIFT for many years. Erager coding I think has been available for maybe three years, two, three years. But it's only in the last year that we've managed to get the two of those to work together better. So I'm kind of really excited to tell you about what we've done to get those two working together. OK, so what is SWIFT? SWIFT is an object storage service. So it's great for storing blobs of unstructured data, pictures, media files, virtual machine images, whatever it might be. It was one of the founding projects of the OpenStack cloud software suite. And it's been around in production for seven or eight years now. SWIFT offers a REST API accessed over HTTP. And this API offers a standard set of operations to create objects, read them back, update them, and delete them. Now it's important to understand SWIFT is not a file system. And it's definitely not a block storage system. It does have a very simple naming hierarchy. So objects belong to containers and containers belong to accounts. OK, so what are some of the properties of SWIFT? Well, first of all, data that's stored in SWIFT is extremely durable. Typically SWIFT will store three replicas of every object that is written into the storage service. So here, just to explain, this is a very simplified architectural diagram of SWIFT. We have an HTTP put request that's putting an object to a URL, ACO, that's the account container, object name structure. And that request is handled by a proxy service which will write three replicas of the object onto three disks in a pool of storage servers. And at the heart of SWIFT is this component which we called the ring. And unfortunately I don't have time to go into a lot of detail about the ring, but it's a form of consistent hashing. So there's a data structure there. And the ring is always trying to disperse replicas of our objects across both disk devices and servers in the storage pool. So that's how SWIFT cheers durability. We always have more than one copy of an object and they're written to different disks on different servers. SWIFT is also very scalable. There's two factors that contribute to this. Again, the ring has a role to play, so as well as dispersing replicas of our objects, the ring is always trying to balance the load of objects across the storage pool. So here we have two different objects being written to two different names, and the ring has chosen a different set of devices to store those objects. I said it's a consistent hashing algorithm. The hashing naturally causes objects to be dispersed somewhat randomly and uniformly across the storage pool. So this helps with scalability. But also we have no centralized services. So for example, we can have multiple instances of these proxy servers and multiple instances of the ring. And there's no need for any communication between them as objects are written. There's no need for any coherency protocol. They're essentially stable independently operating services. The only time the ring does need to be updated is when changes are made to the devices in the storage pool. Finally, SWIFT is highly available. So SWIFT will continue to accept rights and to serve read requests even when one or more of the storage pool devices has failed. So in the example I'm using here of a three replica storage policy, SWIFT will consider a right to be successful if two of the three replicas, if a quorum, have been written to disk. And here you can see for some reason the third replica has failed to be written. Maybe there's just like a network failure, disk has failed congestion, whatever reason. But we have two replicas written and SWIFT considers that a quorum and so the right has been successful. And then we have some background asynchronous processes that are continually working to replace missing replicas by copying data from the existing replicas. So not much long after this right request, those background processes will ensure that that third replica is in place. Now I haven't actually told you the whole truth there. SWIFT is actually doing a little bit more but I'll get to that later in my talk. But needless to say it just gets better. It's a little better than that but that's the basic principles behind the high availability. Now a consequence of this is that we can end up with stale data in SWIFT. So here's a little bit more complicated example where the object that was written at time T1 has now been overwritten at time T2. Only the overwrite was only partially successful. So I still have one of the blue replicas on disk that was written at time T1 and only two of the three replicas at time T2 were successfully written. So this means there's some stale data in the cluster and it's possible to read that stale data because our ring will be choosing randomly one of those replicas to serve reads. So this is kind of like a very different property of SWIFT and other object stores when compared to a file system or a block storage service. And with a file system we expect that when we've written data, when we next read it we'll get back the data that we last wrote. That's a contract that we're very used to. And I emphasise this point because SWIFT does have a different contract with regard to data consistency. We refer to it as eventual consistency. Eventual because those background asynchronous processes are always working to update all of the replicas and ensure that eventually your data becomes consistent. Now if that sounds odd to you, rest assured there are plenty of applications where this consistency contract is totally acceptable and this does enable SWIFT to be very scalable and highly available. Okay, that was my very brief overview of SWIFT. We're going to touch on some of those concepts again as I talk about distributed clusters. So first what is a geographically distributed cluster? Well my definition for the purposes of this talk is this is a cluster which consists of data being stored in multiple physical locations and that those physical locations would typically be connected by a wide area network. And importantly, at least for the purpose of my definition, every object that is stored in a distributed cluster will have at least one copy of itself in each of those physical locations. So we might be talking about multiple data centres here like I think approximately in London and Geneva connected by a wide area network. And the entire cluster, the global cluster, operates under a single namespace. So objects are written and read in any region within the cluster under the same names. Okay, why would you want to do this? First reason is for increased data durability and in particular for disaster recovery. So if in the event of some catastrophic event you were to lose the availability of an entire data centre, you'd still have copies of your data available in another physical location. But it's also useful for achieving data locality. So if you have users accessing the same data sets but in multiple geographic regions, a distributed cluster can mean that you have copies of your data located close to each of those users and you're able to serve their read requests with low latency. Oh, and just to mention, if you look through Swift documentation literature, you'll also see these distributed clusters referred to for obvious reasons as global clusters or as multi-regions Swift. Okay, so let's see how that actually works out when you map it onto a Swift cluster. So what I've done here is I've added a few more storage servers to my storage pool and I've grouped them into two regions. Now, this is really easy to achieve. In Swift, every device in the storage pool is annotated with some metadata and the metadata associates it with a server and also associates it with a region. So in the same way that the ring was always trying to disperse replicas of an object across disks and across servers, the same principle is extended to include regions. The ring just inherently works to disperse our object replicas across these multiple regions. And that's great because it means now that if I lose an entire region from my cluster, I still have replicas of the object in the other region and it's still available to me. Now, you may have noticed, if you're sharp, that I've changed the replic account of the storage policy that I'm using in my example. Originally I said that typically, deployers might use three replicas in a Swift cluster. The reason I've increased it to four is not a requirement for a distributed cluster, but it's a nice number because it gives a symmetry between the regions. It means we end up with two replicas of every object in each of the two regions in this example. And that also means that as well as being able to survive the loss of an entire region, each region can also independently survive the loss of one device and still have the object available to it. So a full replica policy is just a nice choice and choice that many deployers use when they're operating multi-region clusters. So this is great. It looks like we haven't really had to do much with our Swift cluster to just naturally achieve data durability and disaster recovery. What about data locality? Well, obviously I now have copies of my object in both of these regions close to any users that might be in those regions. But we need to do a little bit more work to actually achieve data locality in this cluster. And that's because the ring, as I said, is always trying to load balance. So by default, when serving a read request, the ring is selecting a random replica for each individual read request. That means that a read that's arriving in region one may, by default, actually be directed to a copy of our object that's in region two. Now that's not optimal, but fortunately we have an option in Swift to override that behavior. It's called read affinity. So this is just an option that is set in each of the proxy servers. And all it does is it puts a bias into the ring's selection algorithm to prefer replicas that are resident in the local region when serving reads. So read affinity is basically just giving us a means to trade off load balancing for read performance in the case of a distributed cluster. And it's typically recommended to use. It's a good idea. Okay, so seeing that relatively simply Swift has been able to cope with my service being distributed. We've annotated the devices with some region information. The ring pretty much operates as it would for a single site cluster. And we have this read affinity option just to optimize and create data locality. All sounds good. Just pause at this moment though, because we should just consider the fundamental difference between this distributed cluster and a single site cluster, which is that we now have this wide area network component that's sitting in the middle of our storage pool. With a single site cluster it's reasonable to assume that all of the nodes in the storage pool are connected by low latency, high bandwidth, reliable networking. That assumption may not hold when we have a wide area network connection between two regions in our storage pool. So we're going to think a bit about what the consequences are, first of all, if that wide area connection was to fail. And in particular, what happens to our writes when the wide area network has failed. So this is where I said to you earlier I wasn't telling you the whole truth about the way Swift writes replicas during a write request. So in this case, the ring is temporarily unavailable, unable to access the locations intended for two of the four replicas. But it can write two successfully, so it can achieve a quorum. In the case of a four replica ring, two successful writes constitutes a quorum. What I didn't tell you is that actually Swift worked harder than that to write data down onto this. So it doesn't just stop when it reaches quorum. What it will do is it will look for two temporary alternate locations for the remaining two replicas and write them there. And then again we have these background asynchronous processes that are always working to move replicas that are misplaced to their correct location. In this case, once the wide area network becomes available again. So despite the fact that we've lost the wide area network, we're still writing fully durable data, albeit into one region. This gets a little more interesting again when we consider an overwrite operation. So here, just like my previous overwrite example, some data has been written at T1. And then at T2, the wide area network fails. So what happens to the overwrites at T3? Well as I just said, Swift works really hard and it's going to write down four replicas. Two of them will successfully overwrite the older replicas in region one. Two of them are written to a temporary location in region one. But we still have our old replicas down in region two. So again we see this eventual consistency effect. That there is a window of time while the wide area network is unavailable. When reads in the second region may be reading temporarily inconsistent data. As soon as network heals, background processes will fix that. Swift is always working to heal and to put the most consistent set of data into the cluster. So hopefully our wide area network doesn't fail too often. But it may be that it has lower bandwidth or higher latency than the network within our single site. And it wouldn't be unreasonable to ask the question, well isn't actually this a terrible idea that now every one of the right requests into the cluster are actually having to write data into the remote region. Isn't this going to slow down every put request, every write request? And the answer to that is potentially yes, it will. So here I've just built a kind of development cluster and I've artificially, I have this cluster, I have some storage devices in two regions. And I've artificially slowed down the right time to the storage nodes in one of those regions. And you can see as you'd expect that starting from a baseline of fairly responsive, so the y axis here is the overall completion time for a put request. It starts pretty healthy but as I start to increase the time to write those replicas to the remote region, you can see the overall request completion time increases. It's actually bounded at an upper limit because there's a final part of the whole truth that I need to tell you. So Swift will require a quorum of successful writes. It will then try really hard to write the remaining replicas. But after some time out, which we call the post quorum time out, it'll give up and say, I have a quorum of replicas, I'm going to return to the client and say this write was successful. So that puts an upper cap on this degradation of write performance that you would see as your remote region became slower and slower. But this may not be great. I mean, these numbers are quite extreme. The latency that I've put into these remote writes, but this isn't great. So is there anything that we could do to improve this write performance when we have distributive regions? Well, if we think back to what happened when the wan failed, then what I described was that the ring would just write all four replicas into the local region. So how about if we just deliberately did that all of the time? And that's a mode that Swift has that's called write affinity, as opposed to read affinity. Just as I said, what this does is it overrides the behaviour of the ring. So rather than directing all four copies of the objects to their intended locations in both regions, it actually writes the remote copies to temporary locations in the local region. And that means that the book requests will complete much more quickly, but we still have four copies of our data. It's a fully durable write our data, albeit in one region. So this write affinity mode is another trade-off, and it's temporarily trading off the dispersion of our objects for increased write performance. Clearly there's a window of time here where if our first region was to become unavailable, we would have no copy of the object in the second region. But it's a trade-off that we can use to improve performance. And as you'd expect, I wouldn't be telling you about this if it didn't work. So at the end of this graph you can see I enabled write affinity in my test cluster, and immediately the overall put request completion time has dropped back down towards our baseline. So this is great. I should point out though, of course, with write affinity enabled, although we're writing all of the replicas to one region, the data is still available in the other region. It just means that any read request that's made during the window of time before those replicas have been asynchronously moved will be served by propagating back across the wider area network and reading the data from region one. Once the async process completes, then those reads will be served from the local region. So we have no loss of availability, we have a temporary loss of dispersion, but we still have full durability in the first region. And our write performance is hugely improved. And of course, I probably don't need to repeat this, but we're also trading off consistency for write performance. Again, my example of an overwrite, the first write at T1, in the background, SWIFT has relocated all of the replicas to their correct locations. So we have two replicas of the object at T1 in the second region with write affinity when I'm deliberately deferring the update of those two replicas in region two. We're leaving that to our asynchronous background processes, which means there is this window of time when we may be reading stale data in the second region. But that's okay as long as we clearly understand the contract of eventual consistency that SWIFT offers. So write affinity is a powerful tool, but you do need to use it carefully. It's not always appropriate. I mean, first of all, there is, as I say, no free lunch here. At some point, data has to be moved across the wider area network. So those replicas that we've written into temporary locations do need to be moved at some point. So if you have a workload that has a continuously high write rate, you have continuously high ingress into your cluster, then those misplaced replicas are likely to back up in your local region if the asynchronous processes can't keep up with the ingress rates of your writes. Also, if you have a use case where users, clients of the remote region, are trying to read object data immediately after it's written, then, well, those reads are going to end up having to go and fetch data back from their remote region because it hasn't yet been moved to their local region. And if that's happening, well, you might as well have written the data across the wider area network rather than go and read it immediately after it was written into the local region. Thanks. So there are some workloads that wouldn't be suitable for write affinity. And in fact, if you have a very high ingress rate, it's probably best to just take the hit of whatever the latency is of doing those remote writes, and that in turn will generate some appropriate feedback to the clients and just kind of govern the ingress rate into your cluster. But there are some workloads where write affinity is a really useful tool. So if you have bursty traffic, for example, where you want to rapidly accept a lot of writes into the cluster, but then you have a quiet period when you can be asynchronously moving replicas and working towards the dispersion goal that we have. And in particular, if your remote clients don't have a requirement to be immediately reading data, then write affinity is perhaps a suitable tool. So for example, if your goal is to replicate archives for delayed access by clients in multiple regions. OK, so Swift is very well able to support distributed, geographically distributed clusters, and I've shown you a couple of the tools and tuning options that we have to make some trade-offs when doing that. We also have this other really nice feature, we have many nice features, but another one is rager coding. And I say here, with an exclamation mark, Swift now supports these. Now what I want to emphasise is Swift now supports the combination of a rager code with distributed clusters. And until this last year, that was something where the two features didn't work too well together. And so I'm just going to kind of explain why that was and the steps we've taken to fix that. Because there's again some interesting choices that we had to make. Apologies to those of you that have a really great understanding of a rager coding. I felt I should very briefly just describe what it is. So a rager coding is a very popular technique for storing data durably, but using less storage space than a replication policy. It's a mathematical algorithm and in the example I'm showing here, we would have a coding algorithm that accepts a blob of data, splits it into a number of fragments. So in this case, the data has been split into four data fragments and a new rager code also calculates a number of parity fragments. And the example I've chosen here, we have two parity fragments. And we, at least in the Swift community, we've referred to this as a four plus two rager coding scheme. Now each of those parity fragments is the same size as a data fragment. So in this example, we've added two more fragments, so we have added 50% more to the size of the original data. The really nice interesting feature of a rager code and the reason they're called a rager code is that we can raise any two of that fragment set and still reconstruct the original data. Now raising two sounds familiar to a three replica storage policy. We had three replicas, we could lose two of those replicas and we'd still have a complete copy of our object. And here we can lose two of our fragments and still reconstruct the original object, but we're actually only using one and a half times the size of the data. That's why rager coding is very popular. Here's how it works out in Swift. So rager coding is implemented in the Swift proxy as data is inbound into the cluster. I'm sticking with the same example, four data fragments plus two parity fragments. I just mentioned we have a couple of open source libraries that we use to implement the algorithms. And the ring behaves exactly as it behaves for replicas. In fact, the ring is completely unaware that it's dealing with a rager code fragments rather than replicas. So as I said before, the ring is always trying to disperse fragments across the storage pool and we end up with one of these fragments on each of our servers, on a disk, which means that we can lose two disks and still reconstruct our object data. And we're using approximately 50% of storage compared to a three-replica scheme which gives us similar durability. So that's great. What was the problem with distributed clusters? Isn't this just going to work? Well, let me add back in the regions into my storage pool. Same rager coding scheme, four data fragments, two parity fragments. The ring has done its job. The ring has dispersed the fragments uniformly across the regions. Unfortunately, though, this only leaves us with three fragments in each of our regions. And we need four fragments to be able to reconstruct the object. So now we don't yet have our disaster recovery, data durability property. If we lose one of our regions, we don't have sufficient fragments in the other region. So although in the single site case, this four plus two rager coded policy gave us similar durability to a triple replica, when we moved to a multi-region case, it's not actually quite enough. Which is reasonable because we're storing nowhere near as much data as we were in the case of a replicated policy. So what we need, we need some more fragments. And so how about if I change the rager coding scheme that we're using, and now I'm still going to have four data fragments, but I'm going to have six parity fragments. It gives me a total of ten fragments. Why did I choose that scheme in particular? Well, that ends up with five fragments in each region. Four fragments is enough to reconstruct my data so I can lose an entire region and still reconstruct my data. And in fact, I can lose a device within a region and still reconstruct data. So this scheme is now equivalent to four replicas. And it's using about two and a half times the size of the data versus four times the size of the data. So this is looking good. This works. This gives us our distributed cluster with disaster recovery. We have data locality. We can read the data from either region without going across the wide area network. Are we done? No. Unfortunately, we had another problem because we realized that calculating all of those extra parity fragments introduces a compute burden in the proxy server. And I've just calculated an example here. It's like the relative compute time for a 40 megabyte, to encode a 40 megabyte object using one of the back ends that we have available. The x-axis is the number of parity fragments and I'm always using four data fragments. So as we go from four plus two, we started to four plus six, we're roughly doubling the compute time to encode this object and that turns into latency in our right path. So that wasn't great. So we had another thing and somebody very clever, not me, made the observation that although we want to have five fragments in each region and the fragments in each region need to be unique with respect to each other, they don't need to be different fragments in the two regions. So instead of calculating more parity fragments, how about we just duplicate the set of fragments we have and spread them across the two regions? So this means I can drop back to four data fragments and just one parity fragment. Much less compute burden and then I can duplicate the set of fragments, distribute them across my cluster and I have the result I was looking for. Still using same amount of data, storage data on disk two and a half times to size the original data, getting the durability and the data locality that I would have got with a four replica policy. Okay great, it's looking good again. I'm smiling again with this but there's one last problem that we faced which is unfortunately although the ring does a great job of dispersing fragments throughout our cluster it makes no guarantees as to which fragment goes to which device or to which region. So we can end up in a situation like this where we have five fragments in each region but we don't actually have unique sets of fragments in each region. So here I've ended up, in the second region I have two copies of fragment one, two copies of fragment three and one copy of fragment two which only gives me three unique fragments and again that's not enough for me to reconstruct data from that region alone. So there's one final piece in the jigsaw or the journey if you like towards this getting a range of code and support up together for distributed clusters and that was to add what we call a composite ring and it's actually quite a simple idea what we do here is we just take two of our regular rings and allocate one ring to each of the regions so now we have a ring just behaving exactly as it normally would but it's just responsible for managing the dispersion and the load balancing in one region and another ring for the other region and then this new composite ring concept we have guarantees that the duplicate sets of fragments are always spread between the two regions and that's the end of our solution that's how it works so now we're guaranteed to always have a set of unique a range of code fragments in each region we have disaster recovery we have data locality and we get all the benefits of reduced storage requirements that come from a range of coding and that's it it's the end of my talk, thanks for your attention and I think we have some time, a few minutes for some questions why do you think there are questions? I'm just going to pop this up welcome anyone who would like to come and contribute how do you do rebalancing? great question so the question was how do we do rebalancing and I probably need to add some context to the question so rebalancing is a process in Swift when devices are added or removed to the storage pool I assume that's what you're referring to so when that happens we need to move data that was previously resident on a device that's been removed and we might need to move data to populate a device that's been added so actually rebalancing for an array decoded scheme operates much the same way as it would for a replica scheme so the ring at that point the ring is, it recalculates a new data structure that captures the mapping from what we call partitions which are virtual subspaces of the hash space and the consistent ring and the ring has an algorithm to try to move as few replicas as possible when it's going through rebalancing and still maintain the properties of dispersion and balance throughout the entire cluster and it's the same with an array decoding scheme which is partly why the ring is sort of ignorant of the significance of the individual array decoding fragments it's just treating them like replicas okay any more questions, yes can you do more than two regions? yes and I'm aware of sorry repeat the question can we do more than two regions yes and I'm aware of production clusters that are operating over more than two regions I can think of one example immediately that is a three region cluster and for their use case they've chosen to deploy three replicas so they end up with one replica in each region because that gives them the durability they require with the range of coding I would say cautiously at this point in time it might be good to experiment with two regions before jumping in with four or five regions depending on your choice of a range of coding scheme you can end up with a lot of fragments being written to back end storage notes so it's probably good to experiment with a two region cluster first but yep multi regions it's definitely possible yes let me repeat the question to make sure I understood I think the question was if you would like to use a range of coding can you move to a range of coding without losing without downtime the answer is yes and no sorry so I have glossed over a topic that Tiago covered earlier which is that actually within a swiff cluster we can have multiple storage policies operating at the same time so in a single cluster we could have a replication policy running alongside a storage policy and the client can actually choose which of those policies to store data it is absolutely possible to introduce an a range of coding policy alongside a replication policy what we don't have at this moment in time is a mechanism to automatically migrate data that was previously in the replication policy into an a range of coding policy there is some work in progress on that but we don't have that at this point in time but absolutely you can add a range of coding to an existing cluster okay I've been told time's up sorry I'm going to ask me out please